Fuxman Thesis

221
Efficient Query Processing Over Inconsistent Databases by ArielDami´anFuxman A thesis submitted in conformity with the requirements for the degree of Ph.D. in Computer Science Graduate Department of Computer Science University of Toronto Copyright c 2007 by Ariel Dami´an Fuxman

description

Databases

Transcript of Fuxman Thesis

Efficient Query Processing Over Inconsistent Databases

by

Ariel Damian Fuxman

A thesis submitted in conformity with the requirementsfor the degree of Ph.D. in Computer ScienceGraduate Department of Computer Science

University of Toronto

Copyright c© 2007 by Ariel Damian Fuxman

Abstract

Efficient Query Processing Over Inconsistent Databases

Ariel Damian Fuxman

Ph.D. in Computer Science

Graduate Department of Computer Science

University of Toronto

2007

Although integrity constraints have long been used to maintain data consistency, there

are situations in which they may not be enforced or satisfied. In this thesis, we present

ConQuer, a system for efficient and scalable answering of SQL queries on databases

that may violate a set of constraints. ConQuer permits users to postulate a set of key

constraints together with their queries. The system rewrites the queries to retrieve all

(and only) data that is consistent with respect to the constraints. The rewriting is into

SQL, so the rewritten queries can be efficiently optimized and executed by commercial

database systems.

The problem of obtaining consistent answers for primary key constraints and Select-

Project-Join (SPJ) queries is known to be intractable in general. However, we identify

a large and practical class of SPJ queries for which the problem is tractable. For this

class of queries, we provide a query rewriting algorithm that can be executed in linear

time in the size of the query. We consider SPJ queries that may have either set or bag

semantics. For the latter case, the queries may also have grouping and aggregation. We

show the maximality of the class of queries, in the sense that minimal relaxations of its

conditions may lead to intractability. Finally, we study the efficiency and scalability of the

query rewritings on a commercial database system. The study shows that the overhead

of the rewritings is reasonable, when we consider the original (non-rewritten) queries

as a baseline. The experiments use representative queries from TPC-H (the standard

benchmark for decision support systems) and databases of up to 20 GB.

ii

A mis padres Silvia y Miguel

iii

Acknowledgements

First and foremost, I would like to thank my supervisor, Renee J. Miller, for her constant

encouragement and support. During these years, I have benefited tremendously from her

remarkable vision and experience. She has been the greatest mentor, always available for

discussion and guidance. I will always be grateful for the endless hours she devoted to

reading and correcting my drafts, and for the numerous times she stayed at the university

until very late to help me out before conference deadlines.

I am grateful to the members of my committee (John Mylopoulos, Mariano Consens,

and Thodoros Topaloglou) for thoroughly reading my thesis and for their valuable feed-

back. I also thank Leopoldo Bertossi for serving as the external reviewer of the thesis, and

for coming to Canada during his sabbatical in Chile with the sole purpose of attending

my thesis defense.

I am indebted to Alberto Mendelzon, who sadly passed away the year before I com-

pleted my thesis. Alberto was not only an outstanding researcher, but also the warmest

and most generous person. At the beginning of my stay in Canada, I was needing a job

offer in order to obtain permanent resident status. Alberto hardly knew me at that time

(I was then not even a member of the Database Group), but as soon as he heard about

my situation, he offered me a position as Research Associate in his group.

In 2004, I had the opportunity of visiting Phokion Kolaitis and Wang-Chiew Tan at

University of California at Santa Cruz. It was a joy to work with both of them. They

were also wonderful hosts, and I thank them for their hospitality. During the summer of

2005, I did an internship with the Clio group at IBM Almaden, working with Mauricio

Hernandez, Lucian Popa, and Howard Ho. I very much enjoyed my time at Almaden,

where I had an opportunity to learn how research is done at an industrial lab. Special

thanks go to Mauricio for his unwavering support during the internship.

For the implementation of the ConQuer system, I received invaluable help from my

brother Diego. I “convinced” him to do his final undergraduate project on the topic of

consistent query answering, and his contribution was fundamental for the demo that we

gave at VLDB in Trondheim. Diego, I am proud of your work! I also thank Jiang Du for

his help in building up the experimental framework used in Chapter 7.

Many people helped to make these years in Toronto a very enjoyable experience. I

especially thank the “Latin American gang” (Sebastian Sardina, Andres Lagar-Cavilla,

Carlos Hurtado, Blas Melissari, Flavio Rizzolo, Pablo Sala, and many others) for their

iv

friendship. I will always remember our long, heated debates at the Graduate Lounge,

which gained us the reputation of being the loudest group of people in the Department.

I am also grateful to Patricia Rodriguez Gianolli for her support during the last year of

my Ph.D.

And last, but definitely not least, I would like to thank my parents, Silvia and Miguel,

and my brothers, Adrian and Diego, for always being there, despite the distance: without

their love and support none of this would have ever been possible.

v

Contents

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Consistent Query Answering . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.4 Organization of the Document . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Formal Framework 10

2.1 Repairs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2 Query Answering Semantics . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.3 Query Rewritings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.4 Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3 Rewritings for Conjunctive Queries 22

3.1 A Broad Class of First-Order Rewritable Queries . . . . . . . . . . . . . 22

3.1.1 Notation for Conjunctive Queries . . . . . . . . . . . . . . . . . . 22

3.1.2 Join Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.1.3 The Class Cforest of First-Order Rewritable Queries . . . . . . . . 25

3.2 Query Rewriting Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.3 Correctness of the Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.3.1 Properties of Repairs . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.3.2 A Structural Property of Cforest . . . . . . . . . . . . . . . . . . . 35

3.3.3 A “Pessimistic” Repair . . . . . . . . . . . . . . . . . . . . . . . . 36

3.3.4 Correctness of RewriteLocal . . . . . . . . . . . . . . . . . . . . 39

3.3.5 Correctness of RewriteTree . . . . . . . . . . . . . . . . . . . . . 42

3.3.6 Correctness of RewriteForest . . . . . . . . . . . . . . . . . . . . 44

vi

3.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4 Rewritings for Queries with Grouping and Aggregation 48

4.1 Formal Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.2 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.2.1 Queries with Bag Semantics . . . . . . . . . . . . . . . . . . . . . 51

4.2.2 Queries with the sum, min, and max Functions . . . . . . . . . . . 56

4.3 Correctness of the Algorithms . . . . . . . . . . . . . . . . . . . . . . . . 60

4.3.1 Building Upon First-Order Rewritings . . . . . . . . . . . . . . . 61

4.3.2 An “Optimistic” Repair . . . . . . . . . . . . . . . . . . . . . . . 63

4.3.3 Sound Ranges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.3.4 Tight Ranges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

4.3.5 Putting It All Together . . . . . . . . . . . . . . . . . . . . . . . . 79

4.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

5 Complexity-Theoretic Analysis 83

5.1 Minimal Relaxations of Cforest . . . . . . . . . . . . . . . . . . . . . . . . 83

5.2 A Dichotomy Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

5.2.1 The Class C∗ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

5.2.2 Basic Intractable Cases . . . . . . . . . . . . . . . . . . . . . . . . 91

5.2.3 Generalizing the Basic Cases . . . . . . . . . . . . . . . . . . . . . 95

5.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

6 ConQuer: System Implementation and SQL Rewritings 101

6.1 System Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

6.2 ConQuer Rewritings for Queries without Aggregation . . . . . . . . . . . 103

6.2.1 Rewriting Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 103

6.2.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

6.3 ConQuer Rewritings for SPJ Queries with Aggregation . . . . . . . . . . 121

6.3.1 Rewriting algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 121

6.3.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

6.4 Exploiting Precomputed Annotations . . . . . . . . . . . . . . . . . . . . 134

6.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

vii

7 Experimental Analysis 139

7.1 Experimental Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

7.1.1 System and Database Manager Configuration . . . . . . . . . . . 139

7.1.2 Inconsistent Database Instances . . . . . . . . . . . . . . . . . . . 140

7.1.3 Workload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

7.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

7.2.1 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

7.2.2 Effect of Degree of Inconsistency . . . . . . . . . . . . . . . . . . 153

8 Conclusions and Future Work 157

Bibliography 159

A TPC-H Queries and their Rewritings 168

B Design Advisor Indices 208

viii

Chapter 1

Introduction

1.1 Motivation

The presence of inconsistent data is known to be a major problem in enterprises. How-

ever, data analysts often make business decisions based on inconsistent data; and their

database systems rarely give any warning or indication about this situation. In fact,

current database management systems are largely unable to give such a warning because

they rely upon the fundamental assumption that the underlying data is consistent. In

this thesis, we tackle this problem by providing a set of tools that enable users to obtain

meaningful answers from databases even if they are partially inconsistent.

Integrity constraints have long been used by database management systems in order

to maintain data consistency. The typical data design process focuses on developing a set

of constraints that ensure that every possible database reflects a valid, consistent state

of the world. However, integrity constraints may not always be enforced or satisfied for

a number of reasons. For example, when data is integrated from multiple sources, each

source may satisfy a constraint (for example, a key constraint), but the merged data may

not (for example, if the same key value exists in multiple sources). More generally, when

data is exchanged between independently designed sources with different constraints, the

exchanged data may not satisfy the constraints of the destination schema. As another

example, in some environments, checking the consistency of constraints may be too ex-

pensive, particularly for workloads with high update rates. Hence, the database may

become inconsistent with respect to the (unenforced) integrity constraints. In addition

to these long-standing problems, the trend toward autonomous computing is making the

need to manage inconsistent data more acute. In autonomous environments, we can no

1

Chapter 1. Introduction 2

longer assume that data are married with a single set of constraints that define their

semantics. As constraints are used in an increasing number of roles (from modelling

the query capabilities of a system, to defining mappings between independent sources),

there is an increasing number of applications in which data must be used with a set

of independently designed constraints. In such applications, a static approach where

consistency (with respect to a fixed set of constraints) is enforced on the database may

not be appropriate. Rather, a dynamic approach in which inconsistent data is tolerated,

but consistency is taken into account at query time, permits the constraints to evolve

independently from the data.

One strategy for managing inconsistent databases is data cleaning [DJ03]. Data

cleaning techniques seek to identify and correct errors in the data, and can be used to

restore an inconsistent database to a consistent state. Data cleaning, when applicable,

can be very successful. However, it is necessarily a semiautomatic process, which makes

it infeasible or unaffordable for some applications. Furthermore, committing to a single

cleaning strategy may not always be appropriate. A user may wish to experiment with

different cleaning strategies, or may desire to retain all data, even inconsistent data,

for tasks such as lineage tracing. Finally, data cleaning is only applicable to data that

contains errors. However, the violation of a constraint may also indicate that the data

contains exceptions, that is, clean data which simply does not satisfy a constraint.

In this thesis, we consider inconsistent databases that may violate a set of primary

key constraints. This type of constraint (together with foreign key constraints) are the

most commonly used in commercial databases systems. Furthermore, databases that

violate primary key constraints are ubiquitous in enterprises. For example, in the domain

of Customer Relationship Management (CRM), data sources often contain conflicting

information about the same customer. Notably, commercial CRM tools provide limited

support for merging tuples corresponding to the same customer into one tuple in the

integrated database. Although they typically support some form of conflict resolution

rules (e.g., rules that take the average between two conflicting incomes of the same

customer), these rules may be difficult to design. In the absence of conflict resolution

rules, some CRM tools transfer all conflicting tuples to the integrated database. Thus,

even if the sources satisfy the key constraints, the integrated database may not.

Chapter 1. Introduction 3

1.2 Consistent Query Answering

While it is well known how to answer queries over consistent databases, we must give

a clear and precise semantics to the notion of a “meaningful” answer obtained from an

inconsistent database. In this thesis, we make use of a semantics based upon the notions

of possible worlds and certain answers, concepts that are widely used not only in the

context of database theory and data integration [Lip79, Lip81, AKG87, AD98], but also

in the field of knowledge representation [Lev81, Moo85]. These notions were first adapted

to the context of inconsistent databases by Arenas, Bertossi and Chomicki [ABC99], who

defined the semantics of consistent query answers.

The semantics of consistent query answers relies on the intuition that an inconsistent

database can be cleaned (or “repaired”) by adding or deleting tuples in such a way that

the resulting database satisfies some given integrity constraints. The semantics is agnostic

about which tuples should be added or removed. Therefore, each inconsistent database

may be associated to more than one clean, consistent database. A consistent answer is

then an answer that is obtained from every possible consistent database. Intuitively, this

means that the consistent answers are obtained no matter how the database is cleaned.

The semantics of consistent query answers provides a sound and elegant basis for the

study of the problem of query answering over inconsistent databases. However, despite

considerable work on its theoretical underpinnings [ABC99, CB00, ABC+03b, CLR03a,

CLR03b, BB03a, BB03b, CM05], to the best of our knowledge, little work has been

done on its practical applications. A key contribution of this thesis is to bridge the

gap between theory and practice by providing an efficient and scalable system to obtain

consistent query answers from inconsistent databases. In particular, we report the design

and evaluation of ConQuer, a system for managing inconsistent data.1 In ConQuer, a

user may postulate a set of integrity constraints, possibly at query time, and the system

automatically retrieves all (and only) the query answers that are consistent with respect

to the constraints. ConQuer also helps users take advantage of the query results in order

to interactively clean the inconsistent database.

The major challenge in consistent query answering is the potentially huge number

of consistent databases that can be associated with a given inconsistent database. In

the case of primary key constraints, that is the focus of this thesis, the number of con-

1ConQuer stands for Consistent Querying. ConQuer’s web page can be found atwww.cs.toronto.edu/db/conquer.

Chapter 1. Introduction 4

emplKey salary

t1 John 1000

t2 John 2000

t3 Mary 1000

Figure 1.1: An inconsistent database

sistent databases is exponential in the size of the inconsistent database. This problem

is tackled in ConQuer by implementing a query rewriting approach. Given a query q,

ConQuer rewrites q into another query Q that has the following property: for every incon-

sistent database, the rewritten query Q retrieves the consistent answers for the original

query q. The rewriting is done independently of the data, and works on every inconsistent

database. This approach has two fundamental advantages. First, it avoids constructing

the (potentially huge number of) consistent databases associated with the inconsistent

database. Second, the rewritten query is a SQL query that can be executed using any

commercial relational database management system (in ConQuer, we use IBM’s DB2).

In an extensive set of experiments, reported in Chapter 7, we show that the overhead

in the execution of the rewritten queries is reasonable, when compared to the original

(non-rewritten) ones.

In the next example, we illustrate the semantics of consistent answers and the query

rewriting approach.

Example 1.1. Consider the database of Figure 1.1, which contains information about

employees and their salaries. In particular, the schema of the database has one relation

called employee, with two attributes: emplKey (the name of the employee) and salary.

Assume that a user specifies that the key of the relation should be the attribute

emplKey. Note that the database violates this key constraint, perhaps because its data

has been integrated from many operational sources. In particular, there are two tuples

for employee John, one stating that he makes a salary of 1000, and the other stating that

he makes a salary of 2000. Suppose that we do not know which one of this alternatives is

correct, but we still want to be able to draw meaningful answers from the database. Let

us consider the consistent databases (i.e., databases that satisfy the key constraint) that

can be built from the inconsistent database. We would like these databases to be not

only consistent, but also “as close as possible” to the inconsistent database. This leaves

Chapter 1. Introduction 5

emplKey salary emplKey salary

t1 John 1000 t2 John 2000

t3 Mary 1000 t3 Mary 1000

Consistent database 1 Consistent database 2

Figure 1.2: Consistent databases for the inconsistent database of Figure 1.1

us with two possible consistent databases (shown in Figure 1.2), obtained by deleting

exactly one tuple for John in each of them.

Consider a query q1 that retrieves information about customers whose salary is less

or equal than 1000.

q1: select distinct emplKey

from employee

where salary <= 1000

If we execute this query directly over the inconsistent database, we obtain {John, Mary}.Intuitively, this is not a “consistent” answer because it may be the case that John has a

salary over 1000. In fact, if the consistent database turns out to be the database on the

right hand side of Figure 1.2, then John would not appear in the answer.

One strategy to obtain the “consistent answer” would be to apply query q1 to each

of the consistent databases of Figure 1.2. While this may be feasible in this simple

example, it is clearly impractical when the number of tuples violating the constraint

grows. In particular, even for the schema and single constraint of this example, the

number of consistent databases is exponential in the size of the inconsistent database.

For this reason, in ConQuer, we never build the consistent databases explicitly. Instead,

we follow a query rewriting approach, where we rewrite the original query (q1 in this

case) into another query that can be executed directly on the inconsistent database and

is guaranteed to always return the consistent answers for the original query.

In this case, it is quite simple to obtain a rewriting of q1. Notice that John appears

associated with two different salaries in the inconsistent database: one satisfying the

query, the other not. This suggests that in the rewriting we should return the employees

that satisfy q1 (i.e., have a salary of less or equal than 1000) in every tuple of the

inconsistent database where they appear. This can be obtained using the following

query:

Chapter 1. Introduction 6

Q1: select distinct emplKey

from employee e

where salary <= 1000

and not exists (select *

from employee e’

where e’.emplKey=e.emplKey

and c’.salary > 1000)

Notice the use of a nested subquery related by not exists. The purpose of this

subquery is to filter out those key values that satisfy q1 in some tuples, but violate it in

others. In our example, this subquery filters John out of the answer because he appears

in tuple t2 with an account balance above 1000.

Despite the simplicity of the previous example, it has been shown in the literature

[CLR03a, CM05] that there are Select-Project-Join queries for which there is no rewriting

into SQL (under a very likely complexity-theoretic assumption). However, we observe

that the presence of these negative results does not necessarily preclude the existence of

classes of queries for which there is a SQL rewriting. In fact, in Chapter 3, we show a

large and practical class of Select-Project-Join queries for which there is a SQL rewriting.

In Chapter 5, we show that this is a maximal class of queries, in the sense that minimal

relaxations of its conditions lead to queries for which there is no SQL rewriting.

Most of the previous work on consistent query answering (except [ABC+03b]) focuses

on queries with set semantics and no aggregation. However, practical query languages

like SQL have bag semantics (duplicates are not eliminated unless explicitly requested),

and support aggregation functions and grouping of results. In Chapter 2, we present

a generalization of the semantics of consistent answers for queries with bag semantics,

grouping and aggregation. In Chapter 4, we provide query rewritings that work under

this semantics.

In the thesis, we are concerned not only with the correctness of the rewritings (i.e.,

ensuring that they retrieve all and only the consistent answers), but also with their

efficiency when executed using existing database technology. We address efficiency issues

and their empirical validation in Chapters 6 and 7.

Chapter 1. Introduction 7

1.3 Contributions

The main contributions of this thesis are the following:

• We identify a large and practical class of Select-Project-Join queries for which the

problem of computing consistent answers is tractable. The class consists of queries

that can have two kinds of joins. First, they can have joins between key attributes.

Second, they can have joins from non-key attributes of a relation (possibly a foreign

key) to the primary key of another relation. Arguably, these two types of joins are

the most commonly used in practice (and certainly the most common in industry

standard benchmarks like TPC-H). (Chapter 3)

• For the class of tractable queries that we identify, we provide a query rewriting algo-

rithm that produces a query in first-order logic that returns the consistent answers.

The algorithm runs in polynomial time in the size of the query. The rewritings

are sound and complete, in the sense that they return all (and only) the consistent

answers. Since first-order queries can be written in SQL, the rewritings in first-

order logic are a first step towards reusing existing commercial database technology.

This work was first published at the International Conference on Database Theory

(ICDT) [FM05], and an extended journal version has been invited to the Journal

of Computer and Systems Sciences (JCSS) [FM06]. (Chapter 3)

• We consider not only Select-Project-Join queries with set semantics, but also queries

with bag semantics, grouping and aggregation. These extensions are needed to en-

able practical use in decision support applications. For this purpose, we extend

the semantics of consistent answers originally proposed by Arenas, Bertossi and

Chomicki [ABC99, ABC+03b] . We provide sound and complete algorithms un-

der this semantics for the most common SQL aggregation functions (count, min,

max, sum). This work has been published at the ACM International Conference

on the Management of Data (SIGMOD) [FFM05a]. (Chapters 2 and 4)

• We show a large class of Select-Project-Join queries for which the conditions of

applicability of our rewriting algorithm are not only sufficient but also necessary.

In particular, we show a class in which the problem of computing the consistent

answers is coNP-complete (and, assuming P 6= NP, inexpressible in first-order logic)

for every query of the class that violates the conditions of the class of queries for

Chapter 1. Introduction 8

which we give a rewriting algorithm. This type of result is stronger than the com-

plexity results given in the consistent query answering literature [CLR03a, CM05],

which consist of showing intractability of a class by exhibiting at least one query for

which the problem is intractable. As a corollary of our result, we get a dichotomy

for this class of queries: given a query q in our class, either the problem of comput-

ing the consistent answers for q is first-order rewritable (and thus it is in PTIME),

or it is a coNP-complete problem. (Chapter 5)

• We present the implementation of ConQuer, a system for querying inconsistent

databases. We also explain in detail the SQL rewritings produced by the system.

ConQuer has been demonstrated at the International Conference on Very Large

Databases (VLDB) [FFM05b]. (Chapter 6)

• We study the running time of ConQuer’s SQL rewritings on a commercial database

system, in particular IBM DB2. To this end, we present a detailed performance

study using the data and queries of the TPC-H decision support benchmark. The

study focuses on the overhead of the rewritings, using the original (non-rewritten

queries) as a baseline. We study the scalability of the approach (with databases of

up to 172 million tuples), and the effect of the degree of inconsistency (in terms

of the percentage of tuples that are inconsistent and the number of conflicting

tuples per key value). The experiments show that our approach can be applied to

large databases, several orders of magnitude larger than those considered in other

approaches for querying inconsistent databases. (Chapter 7)

1.4 Organization of the Document

The rest of this document is organized as follows. In Chapter 2, we present the formal

framework for querying inconsistent databases that will be used throughout the thesis.

In Chapters 3 and 4, we present query rewritings and focus on proving their correctness.

In Chapter 3, we consider a large and practical class of conjunctive queries (that is,

Select-Project-Join queries) and present rewritings in first-order logic. In Chapter 4, we

consider queries with bag semantics, grouping and aggregation, and present rewritings

in an extension of first-order logic with grouping and aggregation functions. In Chapter

5, we show the maximality of the class of queries that is the input to the rewriting

algorithms.

Chapter 1. Introduction 9

In Chapter 6, we present ConQuer, a system for efficiently querying inconsistent

databases. We present in detail the SQL query rewritings produced by ConQuer for

queries with and without aggregation. The efficiency of these rewritings is empirically

validated in Chapter 7 with an extensive set of experiments. We present related work in

separate sections at the end of each of the chapters. In Chapter 8, we finish the document

with conclusions and directions for future work.

Chapter 2

Formal Framework

In this chapter, we present the formal framework that will be used throughout the thesis.

In this framework, an inconsistent database is associated with a space of consistent

databases called repairs. In Section 2.1, we formally define the notion of repair. Then, in

Section 2.2, we introduce the semantics for query answering over inconsistent databases.

This semantics involves the exploration of all repairs of an inconsistent database. Since

the number of repairs can be very large, in this thesis we advocate a query rewriting

approach, where queries are rewritten in such a way that their consistent answer can be

obtained by posing another query directly on the inconsistent database, without explicitly

building any repair. In Section 2.3, we formally define the notion of a query rewriting.

Finally, in Section 2.4, we introduce the integrity constraints that are the focus of this

thesis.

2.1 Repairs

A schema R is a finite collection of relation symbols, each of which has an associated

arity. A database instance (or database) I over R is a function that associates each

relation symbol r of R to a relation I(r). A relation I(r) of arity k is a set of k-tuples

whose elements belong to some underlying fixed domain.1 Whenever it is clear from

context, we will abuse notation and use the same symbol r to denote both a relation

symbol and a relation. Given a tuple ~t occurring in relation I(r), we denote by r(~t) the

association between ~t and r.

1Although we will consider both set and bag semantics for queries, we always assume the relations ofa database instance (including inconsistent databases) to be sets.

10

Chapter 2. Formal Framework 11

A database instance I is consistent with respect to a set of integrity constraints Σ if

I satisfies Σ in the standard model-theoretic sense, that is I |= Σ. (As customary, an

integrity constraint may be any first-order formula [AHV95]). Throughout this thesis,

we will consider databases that may violate a given set of integrity constraints. That is,

given R and set of integrity constraints Σ over R, a database I may be inconsistent with

respect to Σ, that is I 6|= Σ.

Intuitively, we will assume that an inconsistent database can be cleaned (or “re-

paired”) by adding or deleting tuples in such a way that the resulting database satisfies

the given integrity constraints. We will be agnostic about which tuples should be added

or removed. Therefore, each inconsistent database may be associated to more than one

possible clean, consistent database. Furthermore, no matter how the clean databases are

obtained, we would like them to be “as close as possible” to the original, inconsistent

database (that is, to minimize the number of tuples that are added or removed). We will

call each consistent database a repair.

The notion of repair was originally introduced by Arenas, Bertossi and Chomicki

[ABC99]. A repair is a database instance that satisfies the given integrity constraints,

and which has a minimal distance to the inconsistent database. The distance between

two database instances I and I ′ is defined as their symmetric difference, i.e., ∆(I, I ′) =

(I − I ′) ∪ (I ′ − I). The formal definition of repair is the following.

Definition 2.1 (Repair [ABC99]). Let I be a database instance, and Σ be a set of

integrity constraints. We say that an instance I is a repair of I with respect to Σ if:2

• I |= Σ, and

• there is no instance I ′ such that I ′ |= Σ and ∆(I, I ′) ⊂ ∆(I, I) (i.e., ∆(I, I) is

minimal under set inclusion in the class of instances that satisfy Σ).

Example 2.1. Let R be a schema with one relation symbol employee. Assume that

employee has two attributes: emplKey (the name of the employee) and salary, and

that the only constraint in Σ is that attribute emplKey is the key of relation employee.

Let I = {employee(John, 1000), employee(John, 2000), employee(Mary, 1000)}. The

database I is inconsistent with respect to Σ because it violates the key constraint stating

that every employee has exactly one salary.

2Whenever Σ is clear from the context, we will just say that I is a repair of I.

Chapter 2. Formal Framework 12

There are two repairs of I wrt Σ: I1 = {employee(John, 1000), employee(Mary, 1000)}and I2 = {employee(John, 2000), employee(Mary, 1000)}. Notice that, according to

Definition 2.1, the databases {employee(John, 2000)} and {employee(Mary, 1000)} are

not repairs because their distance with respect to I is not minimal under set inclusion.

The minimality condition for repairs is crucial in the definition. Otherwise, the empty

set would trivially be a repair of every database that violates a set of key constraints.

Notice that repairs do not need to be unique. For example, if the given set of con-

straints consists of key dependencies, the number of repairs can be exponential in the

size of the inconsistent database.

2.2 Query Answering Semantics

The notion of repair can be used to give a precise meaning to query answering over

inconsistent databases. Intuitively, each repair corresponds to one particular way of

cleaning the database. Since we are agnostic about how the database should be cleaned,

it makes sense to consider the answers that would be obtained from every repair. This

notion is formalized with the concept of consistent answers, which we define next.

Definition 2.2 (Consistent Answer [ABC99]). Let R be a schema. Let Σ be a set

of integrity constraints. Let I be an instance over R (possibly inconsistent with respect

to Σ). Let q be a query over R. We say that a tuple ~t is a consistent answer for q with

respect to Σ if ~t ∈ q(I), for every repair I of I with respect to Σ. We denote this as

~t ∈ consistentΣ(q, I).

This definition was originally given by Arenas, Bertossi and Chomicki [ABC99]. It is

based on the semantics of certain answers [Lip79, Lip81, AKG87] that has been used in

database theory, and possible worlds, which is well-known in knowledge representation

[Lev81]. In the case of consistent answers, the space of possible worlds corresponds to

the repairs of the inconsistent database.

Example 2.1. (continued) Consider a query that retrieves all the employees from

the database, expressed as q1(e) = ∃s.employee(e, s). Recall that there are two re-

pairs of I wrt Σ: I1 = {employee(John, 1000), employee(Mary, 1000)} and I2 =

{employee(John, 2000), employee(Mary, 1000)}. The result of applying q1 on both I1

Chapter 2. Formal Framework 13

and I2 is {(John), (Mary)}. Thus, the consistent answers for q1 on I are the tuples

(John) and (Mary).

Now, consider a query that retrieves employees together with their salaries, expressed

as q2(e, s) = employee(e, s). Notice that q2 is the identity on the repairs. Thus, the con-

sistent answer can be obtained as the intersection of I1 and I2. In consequence, the only

consistent answer for q2 on I is (Mary, 1000). Notice that the tuples (John, 1000) and

(John, 2000) are not consistent answers. The reason is that neither of them are present

in both repairs. Intuitively, this reflects the fact that John’s salaries are inconsistent data,

and we do not want to retrieve possibly erroneous results.

For convenience, we will use the following notation for the consistent answers of

Boolean queries.

Definition 2.3. Let R be a schema. Let Σ be a set of integrity constraints. Let

I be a database instance over R. Let q be a Boolean query over R. We say that

consistentΣ(q, I) = true if for every repair I of I with respect to Σ, I |= q. We

say that consistentΣ(q, I) = false if there exists at least one repair I of I with respect

to Σ such that I 6|= q.

Notice the asymmetry between the case for consistentΣ(q, I) = true and

consistentΣ(q, I) = false. While for the former, every repair must satisfy the query,

for the latter it suffices to have just one non-satisfying repair. This is not intrinsic to

Boolean queries: by Definition 2.2, it is also the case that ~t 6∈ consistentΣ(q, I) if there

exists at least one repair I such that ~t 6∈ q(I).

The definition of consistent answers is independent of the language used to express

the input query q, and it makes perfect sense for queries that, for example, return tuples

from the active domain of the database. However, for queries that compute aggregates

over groups of tuples, it may be useful to relax this definition, as we motivate next.

Example 2.1. (continued) Let q3(s, v) be a SQL query that counts the number of

occurrences of each salary in the database:

select salary as s, count(*) as v

from employee

group by salary

Chapter 2. Formal Framework 14

Recall that there are two repairs of I with respect to Σ: I1 = {employee(John, 1000),

employee(Mary, 1000)} and I2 = {employee(John, 2000), employee(Mary, 1000)}. The

result of applying query q3 to the repairs is the following: q3(I1) = {(1000, 2)}, and

q3(I2) = {(1000, 1), (2000, 1)}. Since the intersection of these results is empty, according

to Definition 2.2, the set of consistent answers for q3 is empty. However, notice that the

salary 1000 appears in every query result (but together with a different number for the

count of occurrences). Intuitively, it would be desirable to report this salary in the result.

In the previous example, the value 1000 appears in every query result. However, it

appears a different number of times on each of them. How do we report the number of

times that it appears? In the semantics that we define next, we employ tight bounds

for this purpose. In this particular example, we will say that the minimum (greatest

lower bound) is one, since the salary 1000 appears exactly once in q3(I1); and that the

maximum (lowest upper bound) is two, since salary 1000 appears exactly twice in q3(I2).

In the following definition, we formalize this notion. The definition applies to any query

that computes an aggregate over a group (in our example, the aggregate is the count

of occurrences of each salary). We will denote with aggconsistentΣ(q, I) the modified

semantics for consistent answers for a query q on an instance I with respect to a set of

constraints Σ.

Definition 2.4 (Consistent Answer for Queries with Aggregation). Let R be

a schema. Let Σ be a set of integrity constraints. Let I be a database instance over

R. Let q be a query over R with free variables ~z and v, where v is a variable over a

numeric domain (possibly computed by an aggregate function). We say that (~t, glb, lub) ∈aggconsistentΣ(q, I) if all the following conditions hold:

• for every repair I of I wrt Σ, there is some d such that (~t, d) ∈ q(I) and glb ≤ d ≤lub; and

• there is some repair I of I wrt Σ such that (~t, glb) ∈ q(I); and

• there is some repair I of I wrt Σ such that (~t, lub) ∈ q(I).

We also say that glb is the greatest lower bound of ~t in q, and that lub is the lowest

upper bound of ~t in q.

This definition is particularly well suited to the case of queries with bag semantics,

grouping and aggregation, which are prevalent in practice. For instance, consider the

query q3(s, v) of Example 2.1:

Chapter 2. Formal Framework 15

select salary as s, count(*) as v

from employee

group by salary

In this case, q3 has free variables s and v. The variable s corresponds to the attribute

salary, on which there is a grouping condition; the numerical argument v, for which we

give tight ranges, corresponds to the result of count(*). Essentially, for a query q(~z, v),

aggconsistentΣ(q, I) gives the consistent answers on I with respect to Σ for each value

of ~z (the salary in our example), together with a tight range for the possible associated

numerical values.

Example 2.1. (continued) Let us obtain the aggconsistentΣ answers for q3 on I. Re-

call that the result of applying q3 to the repairs of the inconsistent database is: q3(I1) =

{(1000, 2)}, and q3(I2) = {(1000, 1), (2000, 1)}. Then, we have that aggconsistentΣ(q3, I) =

{(1000, 1, 2)}. This means that the salary 1000 appears in every query result, and the

value of count(*) for 1000 has a greatest lower bound of one and a lowest upper bound

of two. Notice that the salary 2000 does not appear in aggconsistentΣ(q3, I). The intu-

itive reason is that 2000 is not a consistent answer, since it does not occur in repair I1.

According to the definition of aggconsistentΣ above, 2000 is not in the answer because

it fails to satisfy the first condition of Definition 2.4. This condition is violated because

I1 is a repair such that (2000, d) 6∈ q(I1), for every d.

To the best of our knowledge, the problem of computing consistent answers for queries

with aggregation has only been studied before by Arenas et al. [ABC+03b]. In particular,

they were the first to propose a generalization of the semantics of consistent answers,

where ranges rather than exact values are returned. In their work, they consider a class

of SQL queries with no grouping, no selection conditions (i.e., no conditions in the where

clause) and on exactly one relation. In Chapter 4, we will present results for a much

larger class of queries. For the class of queries considered by Arenas et al., our and their

semantics coincide. However, we need to extend their semantics in order to be able to

deal with grouping.

2.3 Query Rewritings

The definition of consistent answers introduced in the previous section involves the explo-

ration of a potentially huge number of repairs (in the case of keys, it can be exponential in

Chapter 2. Formal Framework 16

the size of the inconsistent database). In this thesis, we approach this problem by design-

ing algorithms that compute consistent answers directly from the inconsistent database,

without explicitly building the repairs. Given a query q, our algorithms will return an-

other query Q such that, for every instance I, the consistent answers for the original

query q can be obtained by just evaluating Q on I. We call Q a query rewriting for the

problem of computing the consistent answers of q.

In order to give a formal definition of query rewriting, we first define the computa-

tional problems associated to computing consistent answers using the consistentΣ and

aggconsistentΣ operators (the latter for the case in which the query computes numerical

values over a group of tuples).

Definition 2.5. Let R be a schema. Let q be a query over R. Let Σ be a set of integrity

constraints.

The problem CONSISTENT(q, Σ) is the following: given an instance I over R, and

tuple ~t, is it the case that ~t ∈ consistentΣ(q, I)?

The problem AGGCONSISTENT(q, Σ) is the following: given an instance I over R, tuple

~t and real numbers glb and lub, is it the case that (~t, glb, lub) ∈ aggconsistentΣ(q, I)?

We can now define the notion of query rewriting for the problems CONSISTENT(q, Σ)

and AGGCONSISTENT(q, Σ). The definition is given for a fixed (but undefined) query

language.

Definition 2.6 (L-query rewriting). Let R be a schema. Let Σ be a set of integrity

constraints. Let q be a query over R. Let Q be a query expressed in a query language L(possibly different from the language used to express q).

We say that Q is an L-rewriting of CONSISTENT(q, Σ) if for every instance I over R

and tuple ~t, ~t ∈ Q(I) iff ~t ∈ consistentΣ(q, I).

We say that Q is an L-rewriting of AGGCONSISTENT(q, Σ) if for every instance I

over R, tuple ~t and real numbers glb and lub, (~t, glb, lub) ∈ Q(I) iff (~t, glb, lub) ∈aggconsistentΣ(q, I).

We also define the rewritability of a problem in a language L as follows. We say that

CONSISTENT(q, Σ) is L-rewritable if there exists a query Q expressed in language L such

that Q is a query rewriting for CONSISTENT(q, Σ). A similar definition can be given for

AGGCONSISTENT(q, Σ).

In Chapter 3, we will consider classes of conjunctive queries, and present query rewrit-

ings in first-order logic. Notice that if CONSISTENT(q, Σ) is first-order rewritable, then

Chapter 2. Formal Framework 17

it is tractable. This is because the data complexity of first-order logic is in PTIME (in

fact, in AC0, which is a subset of PTIME). Thus, the query rewriting Q can be executed

on the inconsistent database in polyomial time. Besides this, an approach based on first-

order query rewriting is attractive because first-order queries can be written in SQL. In

Chapter 4, we will focus on classes of conjunctive queries with bag semantics, grouping,

and aggregation. We will give query rewritings for the problem AGGCONSISTENT(q, Σ) in

a language that extends first-order logic with operators for grouping and aggregation. In

Chapter 5, we will study the computational complexity of the problem CONSISTENT(q, Σ).

Finally, in Chapters 6 and 7, we will present SQL query rewritings and show experimen-

tally that they can be run efficiently and scalably on a commercial relational database

system.

2.4 Constraints

The most commonly used type of constraints in database systems are keys and foreign

keys. Of these, keys pose a particular challenge since databases that are inconsistent

with respect to a set of key dependencies admit an exponential number of repairs in the

worst case. This potentially large number of repairs leads to the question of whether it is

possible to compute consistent answers efficiently. The answer to this question is known

to be negative in general [CLR03a, CM05]. However, this does not necessarily preclude

the existence of classes of queries for which the problem is easier to compute. Hence, we

consider the following question: for what queries is the problem of computing consistent

answers under key constraints in polynomial time (in data complexity)? And, can these

rewritings be executed efficiently in practice? We address the first question in Chapters

3 and 4, and the second question in Chapter 6.

A key constraint is an integrity constraint of the form

∀~x, ~y, ~z.(r(~x, ~y) ∧ r(~x, ~z)) → ~y = ~z

In the above constraint, we say that ~x is a key of relation r. Notice that a key may

consist of many attributes. Throughout the thesis, we will assume that Σ is a set of key

constraints that includes one key constraint per relation of the schema. This corresponds

to the notion of primary keys in database systems.

To facilitate specifying the key constraints each time that we give a query, we will un-

derline the positions in each literal that correspond to key attributes. Furthermore,

Chapter 2. Formal Framework 18

by convention, the key attributes will be given first. For example, the query q =

∃x, y, z.r1(x, y) ∧ r2(y, z) indicates that the first and second literals correspond to bi-

nary relations whose first attribute is the key. We will use vector notation (e.g., ~x, ~y) to

denote vectors of variables or constants from a query or tuple. In addition, when we give

a tuple, we will underline the values that appear at the position of key attributes. For

instance, for a tuple r(~c, ~d), we will say that ~c is a key value, and ~d is a nonkey value.

Using this notation, the key constraints of Σ that are relevant to the query are denoted

directly in the query expression.

2.5 Related Work

In this section, we survey work on related formal frameworks for managing inconsistent

data. For two excellent surveys of the area of consistent query answering, we refer the

reader to Bertossi and Chomicki [BC03] and Bertossi [Ber06].

Intuitively, a repair is a consistent database that is “as close as possible” to the given

inconsistent database. To formalize this intuition, it is necessary to define a notion of

distance between databases. The notion of distance that we employ in this thesis (and

which was initially proposed by Arenas, Bertossi, and Chomicki [ABC99]) is defined in

terms of the symmetric difference between sets. Other notions of distance have been

explored in the literature, which we review next.

Some proposals adopt a cardinality-based notion of distance between database in-

stances, instead of set-theoretic. For example, Lin and Mendelzon [LM96] propose a

semantics where conflicts are resolved according to a majority criterion. Their frame-

work is presented in the context of belief revision for first-order theories, and is therefore

broader in scope than consistent query answering. However, the complexity of query an-

swering under this semantics has not been studied. Other approaches [FPL+01, BBFL05,

FFP05, BMFR05] consider cost-based notions of distance, where each operation that can

be used to restore consistency is given a cost. Then, repairs are defined as the consistent

databases that can be obtained from the inconsistent database with a minimum cost.

These operations include not only insertion and deletion of tuples, but also modification

of values. While a cost-based notion of distance is attractive from a semantic point of

view, it can be computationally more expensive than the set-theoretic semantics. For

example, in the case of inconsistencies with respect to primary key dependencies, the

problem of obtaining a repair of an inconsistent database is NP-complete [BMFR05],

Chapter 2. Formal Framework 19

whereas it can be obtained in linear time under the set-theoretic semantics.

In some of the cost-based approaches mentioned above [FPL+01, BBFL05, FFP05],

tuples can be modified to contain values that are not in the active domain of the incon-

sistent database. Thus, the domain of the attributes that can be modified must have

an intrinsic distance metric. In particular, these approaches consider only numerical at-

tributes (it is not clear how their techniques could be extended to categorical values).

An approach based on tuple modification which allows arbitrary attribute domains is

given by Wijsen [Wij05]. In his work, the repaired databases may contain variables, and

the semantics is given in terms of homomorphisms to the inconsistent database. Instead

of answering queries directly on the inconsistent database (as we do in ConQuer), his

approach requires the offline processing of the inconsistent databases to construct con-

densed representations. The consistent answers to certain classes of queries can then be

obtained by directly executing the original query on the condensed representation.

In contrast to consistent answers, we could also consider possible answers, where

we retrieve answers that appear in at at least one repair. This notion has received

less attention than consistent answers, perhaps because it is less challenging from a

computational point of view. In fact, for broad classes of queries and constraints for which

obtaining consistent answers is intractable, the problem of obtaining possible answers

is tractable (and it usually suffices to compute the original query on the inconsistent

database). Although they are easier to obtain, possible answers are as important as

consistent answers in the context of inconsistent databases. While consistent answers are

best suited for decision making, possible answers can be used to understand the reasons

why a database is inconsistent. For example, in ConQuer, we give the option of retrieving

not only the consistent answers but also the possible answers (see Chapter 6). If the user

decides that a possible answer should have been a consistent answer, he or she can request

an explanation from the system in terms of the underlying database. This explanation

often helps the user to detect incorrect data and to (interactively) correct it.

The notions of possible and consistent answers are two opposite ends of a spectrum:

the former being the most aggressive, and the latter the most cautious. In some sce-

narios, it is desirable to give preference (or rank) tuples in the answer according to the

number of repairs where they appear. Furthermore, some repairs may be more preferable

than others. To formalize this intuition, it is natural to appeal to a semantics based on

probabilities, where each repair is assigned a probability of being the consistent database

that the user has in mind. There has been considerable research on the topic of prob-

Chapter 2. Formal Framework 20

abilistic databases [CP87, BMP92, LLRS97, FR97, DS04]. Recently, Dalvi and Suciu

[DS04] presented a framework for query rewriting over probabilistic databases. Their

rewriting algorithms rely on the fundamental assumption that each tuple has an inde-

pendent probability of being in the (in our terms) consistent database. In the context

of databases that violate primary key constraints, which is the focus of this thesis, we

cannot assume that all tuples are independent. In fact, tuples that share the same key

value are mutually exclusive. In recent work (which is not covered in this thesis), we

and other authors [AFM06] presented query rewriting algorithms that work under the

probabilistic semantics for databases that may violate primary key constraints. In that

paper, we also considered the important problem of obtaining the probabilities. In par-

ticular, we explored the use of a clustering-based technique that works particularly well

on categorical values [ATMS04]. The non-probabilistic semantics that we consider in this

thesis is a special case of the probabilistic semantics. However, the class of rewritable

queries that we can handled under the probabilistic semantics [AFM06] is considerably

more restricted than the classes considered in Chapters 3 and 4 of this thesis for the

non-probabilistic case.

Databases that are inconsistent with respect to primary key constrains can be mod-

elled as disjunctive databases [vdM98]. In particular, if Σ is a set of key dependencies, the

set of all repairs of an inconsistent database can be represented as a disjunctive database

D in such a way that each repair corresponds to a minimal model of D. However, to

the best of our knowledge, there are no results in the literature for query rewritings over

disjunctive databases. A relevant special case of disjunctive databases are databases with

OR-objects [IvdMV95]. If an inconsistent relation has two attributes (a key and a nonkey

attribute), then it can be modelled with OR-objects. However, this is no longer the case

for relations whose arity is greater than two.

To the best of our knowledge, DeMichiel [DeM89] and Agarwal et al. [AKWS95] are

the first authors to recognize the need to manage inconsistent databases. They propose

semantics analogous to the one for OR-objects. DeMichiel proposes algorithms that are

sound but not necessarily complete with respect to the semantics. Agarwal et al. do not

discuss the implementation of the projection and join operations which, as we will see in

Chapter 3, are particularly challenging under the consistent query answering semantics,

and an important contribution of this thesis.

We conclude this section by pointing out that the problem of dealing with inconsis-

tency arises (and has been studied) in other fields of computer science. For example, our

Chapter 2. Formal Framework 21

approach to handling inconsistency is related to the approaches followed by the belief

revision community [GR95] in the field of artificial intelligence. The scenario typically

adopted in belief revision is more general in scope than ours, since (in our terms) they

allow the modification of not only the data but also the integrity constraints. As another

example, the problem of handling inconsistency has been studied in software engineer-

ing [Bal91, NER00]. The focus of this body of work is not centered on data or query

answering, but on the reconciliation of inconsistent views of software requirements and

specifications.

Chapter 3

Rewritings for Conjunctive Queries

The problem of computing consistent answers for conjunctive queries over databases that

might violate a set of key constraints is known to be coNP-complete in general [CLR03a,

CM05]. This is the case even for queries with no repeated relation symbols, which is

the focus of this chapter. However, this does not necessarily preclude the existence of

classes of queries for which the problem is easier to compute. In fact, in this section we

characterize a large and practical class of conjunctive queries for which the problem of

computing consistent answers under key constraints is indeed tractable. Even more so,

we show that all queries in this class are first-order rewritable, and we give a linear-time

algorithm that computes the first-order rewriting. We introduce the class of queries in

Section 3.1, and we present the query rewriting algorithm in Section 3.2. The proof of

correctness of the algorithm is given in Section 3.3.

3.1 A Broad Class of First-Order Rewritable Queries

3.1.1 Notation for Conjunctive Queries

The results in this chapter concern a class of conjunctive queries. Conjunctive queries

[CM77, AHV95] are first-order formulas that may only have conjunctions of positive

literals and existential quantification. That is, they are formulas of the following form:

q(~z) = ∃~w.R1(~x1, ~y1) ∧ · · · ∧Rn(~xn, ~yn)

where the variables of ~x1, ~y1, . . . , ~xn, ~yn appear in exactly one of ~z and ~w. We will

say that the variables in ~z are the free variables of q, and that the variables in ~w are the

22

Chapter 3. Rewritings for Conjunctive Queries 23

existentially-quantified variables of q. Even though there are no equality symbols in our

notation for conjunctive queries, their effect can be achieved by having variables appear

more than once in the queries.

Notice that in the formula above, we denote the literals as Ri(~xi, ~yi). Throughout

the thesis, we will use the convention of using capital letters (usually R, S and T ) to

denote literals of a query. Notice that two distinct literals Ri and Rj may be on the same

relation symbol r (although most results in this thesis are for queries without repeated

relation symbols in which each literal corresponds to a distinct relation).

We will adopt the convention of using ~x to denote variables and constants of a literal

that appear at a position corresponding to key attributes of the relation symbol of the

literal, and ~y for variables and constants that appear at the position of nonkey attributes

of the relation symbol of the literal.

We will say that there is a join on a variable w if w appears in two literals Ri(~xi, ~yi)

and Rj(~xj, ~yj) such that i 6= j. If w occurs in ~yi and ~yj, we say that there is a nonkey-

to-nonkey join on w; if w occurs in ~yi and ~xj, we say that there is a nonkey-to-key join;

and if w occurs in ~xi and ~xj, we say that there is a key-to-key join.

3.1.2 Join Graph

Before introducing the class of queries handled by our algorithm, let us get some insight

from queries that are not considered by our algorithm because (unless P=NP) there is

no first-order rewriting that computes the consistent answer (no matter what rewriting

algorithm is used). In particular, let us consider the following queries:

• q1 = ∃x, x′, y.R1(x, y) ∧R2(x′, y)

• q2 = ∃x, y.R1(x, y) ∧R2(y, x)

• q3 = ∃x, x′, w, w′, z, z′,m.R1(x,w) ∧R2(m,w, z) ∧R3(x′, w′) ∧R4(m,w′, z′)

We will show in Chapter 5 that the problem of computing consistent answers for the

above queries is intractable. The first query consists of a join between nonkey attributes;

the second one involves a cycle of nonkey-to-key joins; and in the third, there are two

joins from nonkey variables to part, but not the entire key, of the corresponding relations.

In order to be more precise in specifying such conditions, we need the notion of the join

graph of a query, which has a node for each literal of a query. Notice that the conditions

Chapter 3. Rewritings for Conjunctive Queries 24

that we just gave are concerned with joins where at least one nonkey variable is involved.

Therefore, the join graph will be a directed graph, where directionality is determined by

the nonkey variables involved in the join.

Definition 3.1 (join graph). Let q be a conjunctive query. The join graph G of q is a

directed graph such that:

• the vertices of G are the literals of q;

• there is an arc from Ri to Rj if i 6= j, and there is some variable w such that w is

existentially-quantified in q, w occurs at the position of a nonkey attribute in Ri,

and w occurs in Rj.

Notice that key-to-key joins do not introduce any arcs to the join graph. Since the

class of first-order rewritable queries that we will present shortly is defined in terms of

the join graph, its queries can have arbitrary key-to-key joins. Further, the free variables

of a query do not introduce arcs to the join graph. As a special case, if all the variables

of a query are free, then its join graph has no arcs. Such queries correspond to the

class of quantifier-free queries, and have already been shown to be first-order rewritable

[ABC99]. If we think in terms of equivalent SQL queries, the fact that all variables are

free means that every attribute of every relation in the from clause must appear in the

select clause.1 This a strong condition which restricts the practical applicability of

the class. As an empirical observation, none of the queries in the TPC-H specification

[TPC03], the industry standard for decision support systems, satisfy this restriction. For

this reason, we will focus on a class of conjunctive queries that may have existential

quantification (in relational algebra terms, arbitrary projections). Handling queries with

existentially-quantified variables is a major challenge, which we address in this chapter.

In Figure 3.1, we show the join graphs for q1 and q2 (we label the arcs with the variable

involved in the joins for illustration purposes). Observe in the figure that both join graphs

have a cycle. For our rewriting algorithm, we will focus on queries that have an acyclic

join graph. Additionally, when we consider how two literals Ri and Rj are joined, we will

require that if any of the key attributes of Ri are joined with a nonkey attribute of Rj,

then all of the key attributes of Ri join with nonkey attributes of Rj. We will then say

that the query has only full nonkey-to-key joins. For example, in the query q3 above, of

1The only exception are the attributes that are equated in the where clause. In that case, only oneof the equated attributes needs to appear in the select clause.

Chapter 3. Rewritings for Conjunctive Queries 25

the form ∃x, x′, w, w′, z, z′,m.R1(x,w)∧R2(m,w, z)∧R3(x′, w′)∧R4(m,w′, z′), the joins

between R1 and R2, and between R3 and R4, are not full since they do not involve the

entire key of R2 and R4, respectively.

Definition 3.2. Let q be a conjunctive query. Let Ri(~xi, ~yi) and Rj(~xj, ~yj) be a pair of

literals of q. We say that there is a full nonkey-to-key join from Ri to Rj if every variable

of ~xj appears in ~yi.

We observe that if G is an acyclic join graph for a query all of whose nonkey-to-key

joins are full, then G must be a forest. We show this with the following proposition.

Proposition 3.3. Let q be a query all of whose nonkey-to-key joins are full. Let G be

the join graph of q. If G is acyclic, then G is a forest.

Proof. Assume towards a contradiction that G is a directed acyclic graph that is not a

tree. Then, there is a node v in G that receives arcs from two different nodes vi and vj

of G. Let R(~x, ~y), Ri(~xi, ~yi), and Rj(~xj, ~yj) be the literals at the nodes of v, vi, and vj,

respectively. Since there are arcs from vi and vj to v, there are variables wi and wj in

~yi and ~yj, respectively, that appear in R. Since G is acyclic, wi and wj must appear in

~x. Also, wj cannot appear in a nonkey position of Ri (or, otherwise, there would be a

cycle between the nodes vi and vj). Since there is a nonkey-to-key join from Ri to R on

variable wi, and variable wj does not occur at a nonkey position of Ri, the join is not

full; contradiction.

3.1.3 The Class Cforest of First-Order Rewritable Queries

We will now characterize a broad class of conjunctive queries for which the problem of

computing consistent answers under key constraints is tractable and first-order rewritable.

The characterization is given in terms of the join graph of the queries. In particular, we

will require three conditions. First, all the nonkey-to-key joins of the query must be full.

Second, the join graph must be a forest. As we showed in Proposition 3.3, this includes

all queries with full nonkey-to-key joins with acyclic join graph. Finally, the query should

have no repeated relation symbols. We call this class Cforest since we require the join

graph of its queries to be a forest, and we give the formal definition next.

Definition 3.4. Let q be conjunctive query without repeated relation symbols and all of

whose nonkey-to-key joins are full. Let G be the join graph of q. We say that q ∈ Cforest

if G is a forest (i.e., every connected component of G is a tree).

Chapter 3. Rewritings for Conjunctive Queries 26

Figure 3.1: Cyclic join graphs of intractable queries

A fundamental observation about Cforest is that it is a very common, practical class

of queries. Arguably, the most used form of joins are from a set of nonkey attributes of

one relation (which may be a foreign key)2 to the key of another relation (which may be

a primary key). Furthermore, such joins typically involve the entire primary key of the

relation (and, hence, they are full joins in our terms). Finally, cycles are rarely present

in the queries used in practice. Admittedly, the restriction not to have repeated relation

symbols does rule out some common queries (those in which the same relation appears

twice in the from clause of an SQL query). Still, many queries used in practice do not

have repeated relation symbols.

As an empirical observation, only one out of 22 queries in the TPC-H specification

[TPC03], the industry standard for decision support queries, has a nonkey-to-nonkey

join. All the queries in the standard are acyclic, and all the nonkey-to-key joins of the

queries are full.

3.2 Query Rewriting Algorithm

In this section, we present the query rewriting algorithm RewriteForest that works for

the class of conjunctive queries Cforest introduced in the previous section. We start the

presentation with a number of examples that highlight some of the intuition underlying

the algorithm.

In the next example, we illustrate the rewriting for a query consisting of only one

2Notice that we are not dealing with the problem of inconsistency with respect to foreign keys, butonly with respect to key dependencies.

Chapter 3. Rewritings for Conjunctive Queries 27

literal. We also show that even for such a simple query, the query itself is not a rewriting

for the problem of computing its own consistent answers.

Example 3.1. As in Example 2.1, consider a schema R with one relation symbol

employee, which has two attributes: emplKey (the name of the employee) and salary.

Furthermore, consider a set Σ consisting of only one constraint stating that the attribute

emplKey is the key of relation employee.

Let q1 be a query that retrieves all the employees from the database that make

a salary of 1000, expressed as q1(e) = employee(e, 1000). First of all, notice that q1

itself is not a query rewriting of CONSISTENT(q1, Σ). Consider a database instance I1 =

{employee(John, 1000), employee(John, 2000)}. It is easy to see that (John) ∈ q1(I1).

However, (John) 6∈ consistentΣ(q1, I1) because the repair I = {employee(John, 2000)}is such that (John) 6∈ q1(I).

Now, consider a database instance I2 = {employee(John, 1000), employee(John, 2000),

employee(Mary, 1000)}. It is easy to see that (Mary) ∈ consistentΣ(q, I2). This is be-

cause employee Mary appears with a salary of 1000 as its nonkey value, and does not

appear with any other s′ such that s′ 6= 1000. This can be checked with a formula

Qconsist(e) = ∀s′.employee(e, s′) → s′ = 1000. In fact, we will show that a query rewrit-

ing Q1 for q1 can be obtained as the conjunction of q1 and Qconsist:

Q1(e) = ∃e.employee(e, 1000) ∧ ∀s′.employee(e, s′) → s′ = 1000

In the next example, we illustrate the rewriting for a conjunctive query that has a

nonkey-to-key join.

Example 3.2. Let R be a schema with two relation symbols: employee and dept. As-

sume that employee has two attributes: emplKey (employee name), and deptFKey (de-

partment name); and dept has two attributes deptKey (department name) and mgrName

(manager name). Assume that there are two key constraints in Σ, stating that emplKey is

the key of the relation employee, and deptKey is the key of relation dept.

Consider the query q2 that retrieves the names of all employees whose department

appears in the dept relation:

q2(e) = ∃d,m.employee(e, d) ∧ dept(d,m)

As in the previous example, q2 itself is not a query rewriting of CONSISTENT(q2, Σ).

Consider the database instance I1 = {employee(John, Sales), employee(John,Engineering),

Chapter 3. Rewritings for Conjunctive Queries 28

dept(Sales, Peter)}. It is easy to see that (John) ∈ q2(I1). However, we have that

(John) 6∈ consistentΣ(q2, I1) because the repair I = {employee(John,Engineering),

dept(Sales, Peter)} is such that (John) 6∈ q2(I).

Now, consider the following database instance I2 = {employee(John, Sales),

employee(John,Engineering), dept(Sales, Peter), dept(Engineering, Tom)}. It is easy

to see that (John) ∈ consistentΣ(q2, I2). This is because every nonkey value (de-

partment name) that appears together with John in some tuple (in this case, Sales

and Engineering) joins with a tuple of dept. This can be checked with a formula

Qconsist(e) = ∀d.employee(e, d) → ∃m.dept(d,m). We will soon show that a query rewrit-

ing Q2 for q2 can be obtained as the conjunction of q2 and Qconsist, as follows:

Q2(e) = ∃d,m.employee(e, d) ∧ dept(d,m) ∧ ∀d.(employee(e, d) → ∃m.dept(d, m))

We now proceed to present RewriteForest, the query rewriting algorithm for queries

in Cforest (shown in Figures 3.2, 3.3, and 3.4). Given a query q such that q ∈ Cforest

and a set of key constraints Σ (containing one key per relation), RewriteForest(q, Σ)

returns a first-order rewriting Q for the problem of obtaining the consistent answers

for q with respect to Σ. The main procedure of the algorithm is shown in Figure 3.2.

The first-order rewriting Q that it returns is obtained as the conjunction of the input

query q, and a new query called Qconsist. The query Qconsist is used to ensure that q is

satisfied in every repair. It is important to notice that Qconsist will be applied directly to

the inconsistent database (i.e., we will never explicitly generate the repairs). The query

Qconsist is obtained by recursion on the tree structure of each of the components of the

join graph of q (recall that since q is in Cforest, the join graph is a forest). The recursive

procedure is called RewriteTree, and is shown in Figure 3.3.

The first part of RewriteTree produces a rewriting Qlocal for the literal R(~x, ~y) at the

root of the input tree. This rewriting is done independently of the rest of the query, and

it is produced by the procedure RewriteLocal (shown in Figure 3.4). The query Qlocal

deals with the constants that appear in ~y in the same way as we illustrated in Example

3.1. It also deals with the free variables that appear at nonkey positions of the query in

the way that we illustrate in the next example.

Example 3.3. Consider the query q3 that retrieves all employees and their salaries from

the database, expressed as q3(e, s) = employee(e, s). Notice that the only difference with

the query q1 of Example 3.1 is that the constant 1000 is replaced by the free variable

Chapter 3. Rewritings for Conjunctive Queries 29

Algorithm RewriteForest(q, Σ)

Input: q(~z), a query of the form ∃~w.φ(~w, ~z)

Σ, a set of key constraints, one per relation used in q

Output: Q, a first-order query that computes consistentΣ(q, I) for every database I

Let G be the join graph of q

Let T1, . . . , Tm be the connected components of G

for i := 1 to m do

Let Ri(~xi, ~yi) be the literal at the root of Ti

Let φi be the conjunction of literals of Ti

Let ~wi = {w : w is a variable that occurs in φi and ~w, and w 6∈ ~xi}Let ~zi = {z : z is a variable that occurs in φi and ~z, and z 6∈ ~xi}Let qi(~xi, ~zi) = ∃~wi.φi(~xi, ~wi, ~zi)

Let Qi(~xi, ~zi) = RewriteTree(qi, Σ)

end for

Let Qconsist(~w, ~z) =∧

i=1...m Qi(~xi, ~zi)

Let Q(~z) = ∃~w.(φ(~w, ~z) ∧Qconsist(~w, ~z))

return Q

Figure 3.2: Query rewriting algorithm for conjunctive queries in Cforest

s. The algorithm RewriteLocal creates a new, universally-quantified variable s′ for the

free variable s, and equates s′ to s. The resulting query rewriting for q3 is the following:

Q3(e, s) = employee(e, s) ∧ ∀s′.employee(e, s′) → s′ = s

The second part of RewriteTree recursively creates a query Qi for each subtree Ti

of T rooted at R. Let ~y0 be the variables at nonkey positions of R (excluding those

that also appear in ~x). Then, one of the conjuncts of the rewritten query returned by

RewriteTree is of the form ∀~y0.R(~x, ~y) → ∧i=1...m Qi(~xi, ~zi). Notice that the variables of

~y0 (i.e., the variables at nonkey positions of the root literal R) are universally quantified.

The intuition behind this is that, as we illustrated in Example 3.2, the query must

be satisfied by all the nonkey values of a given key (in that example, all the possible

departments for the given employee).

Chapter 3. Rewritings for Conjunctive Queries 30

Algorithm RewriteTree(q, Σ)Input: q(~x, ~z), a query in Cforest of the form ∃~w.φ(~x, ~w, ~z),

whose join graph T is a tree with root literal R(~x, ~y)Σ, a set of key constraints, one per relation

Output: Q, a first-order query that computes consistentΣ(q, I) for every database I

Let T be the join graph of qLet R(~x, ~y) be the literal at the root node of TLet qlocal(~x, ~z) = ∃~w.R(~x, ~y)Let Qlocal(~x, ~z) = RewriteLocal(qlocal, Σ)

if φ has exactly one literal thenQ = Qlocal

elseLet R1(~x1, ~y1), . . . , Rm(~xm, ~ym) be the children of R in Tfor i := 1 to m do

Let Ti be the subtree of T rooted at Ri

Let φi be the conjunction of literals of Ti

Let ~wi = {w : w is a variable that occurs in φi and ~w,and w 6∈ ~xi}

Let ~zi = {z : z is a variable that occurs in φi and ~z, and z 6∈ ~xi}Let qi(~xi, ~zi) = ∃~wi.φi(~xi, ~wi, ~zi)Let Qi(~xi, ~zi) = RewriteTree(qi, Σ)

end forLet ~y0 = {y : y is a variable that occurs in ~y and ~w, and y 6∈ ~x}Let Q(~x, ~z) = Qlocal(~x, ~z) ∧ ∀~y0.R(~x, ~y) → ∧

i=1...m Qi(~xi, ~zi)end ifreturn Q

Figure 3.3: Recursive algorithm on the tree structure of the join graph

The next example illustrates an application of the algorithm.

Example 3.4. Let R be a schema with four relation symbols: employee, dept, city,

and prov. Assume that employee has three attributes: emplKey (employee name),

cityFKey (city name), and deptFKey (department name); dept has two attributes:

deptKey (department name) and mgrName (manager’s name); city has two attributes:

cityKey and provFKey; and prov has two attributes: provKey (province name) and

countryName (country name). Assume that there are four key constraints in Σ, stating

that emplKey is the key of the relation employee; cityKey is the key of relation city;

deptKey is the key of the relation dept; and provKey is the key of the relation prov.

Consider a query q4 that retrieves the names of all employees that are located in

Chapter 3. Rewritings for Conjunctive Queries 31

Algorithm RewriteLocal(q, Σ)Input: q(~x, ~z), a query of the form ∃~w.R(~x, ~y), where

none of the variables of ~w appear in ~xΣ, a set of key constraints

Let σ be an injective function mapping natural numbers to variables not present in RInitialize Eq as an empty setfor each position p of ~y do

Let w be the variable that appears at position p of ~yLet z = σ(p)if there is a constant d at position p of ~y then

Add the equality z = d to Eqend ifif w appears in ~x or w appears in ~z then

Add the equality z = w to Eqend iffor every position p′ of ~y such that p 6= p′ and w occurs in ~y at position p′ do

Let z′ = σ(p′)Add the equality z = z′ to Eq

end forend forif Eq 6= ∅ then

Let ~y∗ be a vector of variables of the same arity as ~y, andsuch that if z is at position p of ~y∗, then σ(p) = z

Let Qeq be the conjunction of the equalities of EqLet Qlocal(~x, ~z) = ∃~w.R(~x, ~y) ∧ ∀~y∗.R(~x, ~y∗) → Qeq

elseLet Qlocal(~x, ~z) = ∃~w.R(~x, ~w)

end ifreturn Qlocal

Figure 3.4: Query rewriting for a given literal

Chapter 3. Rewritings for Conjunctive Queries 32

Figure 3.5: Join graph of query q4.

Canada and whose manager is Peter:

q4(e) = ∃d, c, m, p. employee(e, d, c) ∧ city(c, p) ∧ prov(p, Canada) ∧ dept(d, Peter)

The join graph of q4 is given in Figure 3.5. Notice that the join graph of q4 is a tree.

Furthermore q4 has full nonkey-to-key joins and no repeated relation symbols. Thus, q4

is in Cforest.

Let q′′ be the query q′′(c) = ∃p.city(c, p) ∧ prov(p, Canada); let q′′′ be the query

q′′′(p) = prov(p, Canada); and let qIV (d) = dept(d, Peter). The first-order query rewrit-

ing Q4 of q4 is obtained by applying the algorithm RewriteForest(q4, Σ) as follows.

Q4(e) = ∃d, c, m, p.employee(e, d, c) ∧ dept(d,m) ∧ city(c, p) ∧ prov(p, Canada) ∧Qconsist(e)

where :

Qconsist(e) = RewriteTree(q, Σ) =

∃d, c.employee(e, d, c) ∧ ∀d, c.employee(e, d, c) → (Q′′(c) ∧QIV (d))

Q′′(c) = RewriteTree(q′′, Σ) =

∃p.city(c, p) ∧ ∀p.city(c, p) → Q′′′(p)

Q′′′(p) = RewriteTree(q′′′, Σ) =

prov(p, Canada) ∧ ∀w′.(prov(p, w′) → w′ = Canada)

QIV (d) = RewriteTree(qIV , Σ) =

dept(d, Peter) ∧ ∀u′.(dept(d, u′) → u′ = Peter)

Notice the reuse of variables in the rewritten queries. In particular, each existentially-

quantified variable of q4 that appears at a nonkey position in a literal of q4 is first

existentially quantified, and then universally quantified in the rewriting Q4.

Chapter 3. Rewritings for Conjunctive Queries 33

Recall that queries with repeated relation symbols are not allowed in the class Cforest.

We now give an example of a query with repeated relation symbols for which our al-

gorithm fails to give the consistent answers. Although not addressed in this work, it

would be interesting to characterize the class of queries with repeated relation symbols

for which our algorithm is indeed correct.

Example 3.5. Let R be a schema with one relation symbol r, which has three attributes:

A,B, C. Assume that A is the key of the relation r. Let q be the Boolean query

q = ∃x, y, z.r(x, y, a) ∧ r(y, z, b), where a and b are constants. If we apply our query

rewriting algorithm, we obtain the following:

Q = ∃x, y, z.r(x, y, a) ∧ r(y, z, b) ∧ ∀y′, z′.(r(x, y′, z′) → z′ = a)∧

∀y.(r(x, y, a) → ∃z.r(y, z, b) ∧ ∀z′, w′.(r(y, z′, w′) → z′ = b))

Let I be the database instance I = {r(c, d, a), r(d, e, b), r(d, f, a), r(f, g, b)}. In this

case, there are two repairs of I with respect to Σ: I1 = {r(c, d, a), r(d, e, b), r(f, g, b)}and I2 = {r(c, d, a), r(d, f, a), r(f, g, b)}. Clearly, I1 |= q and I2 |= q. However, I 6|= Q.

We finish this section by pointing out that the complexity of the query rewriting

algorithm is linear in the number of literals of the input query. To see this, notice that

the algorithm visits each node of the join graph exactly once.

3.3 Correctness of the Algorithm

In this section, we show that the algorithm RewriteForest presented in the previous

section is correct for all queries in the class Cforest. In particular, we prove the following

theorem.

Theorem 3.5. Let R be a schema. Let Σ be a set of integrity constraints, consisting of

one key dependency per relation of R. Let q(~z) be a conjunctive query over R such that

q ∈ Cforest. Let Q(~z) be the first-order query returned by RewriteForest(q, Σ). Let I be

an instance over R.

Then, ~t ∈ Q(I) iff ~t ∈ consistentΣ(q, I).

Our proof relies on a few simple properties of repairs of inconsistent databases where

the set of integrity constraints contains a single key dependency per relation. We establish

Chapter 3. Rewritings for Conjunctive Queries 34

these properties in Section 3.3.1. In Section 3.3.2, we show a structural property of the

queries in Cforest that is important in order to guarantee the correctness of the algorithms

RewriteTree and RewriteForest: the literals from distinct trees of the join graph may

only share variables that appear as key attributes at the root of their trees.

In Section 3.3.3, we introduce the notion of a “pessimistic” repair. The name comes

from the fact that, for a given query q and database I, if a tuple fails to satisfy the query

on some repair, then it also fails to satisfy the query on the pessimistic repair. More

precisely, for any inconsistent database I, there is a repair M such that if M |= q(~c),

then consistentΣ(q(~c), I) = true. This enables the algorithm to independently consider

each instantiation of the variables for the key of the root literal.

We then proceed to prove the correctness of the building blocks of the rewriting

algorithm. First, in Section 3.3.4, we prove the correctness of the module RewriteLocal,

for “atomic” queries, that is queries with a single literal (and hence no joins). In Section

3.3.5, we prove the correctness of the recursive algorithm RewriteTree that works on

queries whose join graph is a tree. Finally, in Section 3.3.6, this is generalized to the case

of queries whose join graph is a forest, which gives the correctness proof for the rewriting

algorithm RewriteForest for conjunctive queries in class Cforest.

3.3.1 Properties of Repairs

We first show a few important properties of repairs when the set of integrity constraints

consists of one key dependency per relation. These properties will be used throughout

the proofs of this and the next chapter.

Proposition 3.6. Let I be a database instance. Let I be a repair of I wrt Σ. Then

I ⊆ I.

Proof. Let I ′ be an instance such that I ′ |= Σ. Assume that there is a tuple ~t such that

~t ∈ I ′ and ~t 6∈ I. Let I ′′ = I ′ − {~t}. It is easy to see that by removing tuples from

an instance, we do not introduce violations with respect to a set of key dependencies.

Hence, I ′′ |= Σ. Clearly, ∆(I, I ′′) ⊂ ∆(I, I ′). Therefore, I ′ is not a repair of I wrt Σ.

Proposition 3.7. Let I be an instance. Let I be a repair of I wrt Σ. Let R(~c, ~d) be a

tuple of I. Then, there exists some ~d′ such that R(~c, ~d′) is a tuple of I.

Proof. Let I ′ be an instance such that I ′ |= Σ and R(~c, ~d′) 6∈ I ′, for every ~d′. Let

Chapter 3. Rewritings for Conjunctive Queries 35

I ′′ = I ′ ∪ {R(~c, ~d)}. Since R(~c, ~d′) 6∈ I ′ for every ~d′, I ′′ |= Σ. Clearly, ∆(I, I ′′) =

∆(I, I ′)− {R(~c, ~d)}. Since ∆(I, I ′′) ⊂ ∆(I, I ′), I ′ is not a repair of I wrt Σ.

Proposition 3.8. Let I be an instance. Let R(~c, ~d) be a tuple of I. Then, there exists

some repair I of I such that R(~c, ~d) ∈ I.

Proof. Let I∗ be a repair of I wrt Σ. By Proposition 3.7, there exists ~d′ such that

R(~c, ~d′) ∈ I∗. Let I ′ = I∗−{R(~c, ~d′)}∪ {R(~c, ~d)}. Since I∗ is a repair, I∗ |= Σ. Since I ′

does not introduce any violation to the key dependencies of Σ, I ′ |= Σ. Assume that I ′

is not a repair of I. Then, there exists a repair I∗∗ of I such that ∆(I, I∗∗) ⊂ ∆(I, I ′).

By Proposition 3.6, I∗ ⊆ I, and thus I ′ ⊂ I. Furthermore, by Proposition 3.6, I∗∗ ⊆ I.

Thus, I − I∗∗ ⊂ I − I ′. Therefore, I ′ ⊂ I∗∗. Let I ′′ = I∗∗ − {R(~c, ~d)} ∪ {R(~c, ~d′)}.Clearly, I∗ ⊂ I ′′. Thus, I∗ is not a repair; contradiction.

3.3.2 A Structural Property of Cforest

In the next lemma, we show a structural property of the queries in Cforest that is important

in order to guarantee the correctness of the algorithm. In particular, we show that distinct

trees of the join graph may only share free variables (which do not contribute arcs to the

join graph) or variables that appear as key attributes at the root of their trees.

Lemma 3.9. Let q(~z) be a query such that q ∈ Cforest. Let G be the join graph of q.

Let Ti and Tj be distinct connected components of G. Let Ri(~xi, ~yi) and Rj(~xj, ~yj) be the

literals at the roots of Ti and Tj, respectively. Let w be a variable that occurs in a literal

of both Ti and Tj. Then, either w is free (w ∈ ~z) or w is in the key of the roots of both

trees (w ∈ ~xi ∩ ~xj).

Proof. Let ~wi = {w : w is a variable that occurs in some literal of Ti, w 6∈ ~xi and w 6∈ ~z}.Let ~wj = {w : w is a variable that occurs in some literal of Tj, w 6∈ ~xj and w 6∈ ~z}.Assume that there is some variable w such that w appears in ~wi and ~wj. Let S1(~u1, ~v1)

and S2(~u2, ~v2) be literals of Ti and Tj, respectively such that w appears in S1 and S2.

We must now consider the next two cases. First, suppose that w occurs in ~v1. Then,

by definition of join graph, there is an arc from S1 to S2 in G. But S1 and S2 are in

distinct connected components of G; contradiction. Second, suppose that w occurs in

~u1. By definition of wi, S1 is not at the root of Ti (i.e., S1 6= Ri). Hence, there must

be a nonkey-to-key join from another literal, S3(~u3, ~v3), in Ti to S1. Since q is in Cforest,

Chapter 3. Rewritings for Conjunctive Queries 36

all the nonkey-to-key joins of q are full. Thus, the variable w also appears in a nonkey

position in ~v3. Hence, there must be an arc in the join graph from S3 to S2. But S2 and

S3 are in distinct connected components of G; contradiction.

3.3.3 A “Pessimistic” Repair

In this subsection, we introduce the notion of a “pessimistic” repair. The name comes

from the fact that, for a given query q (in a class that we will define shortly) and database

I, if a tuple fails to satisfy the query on some repair, then it also fails to satisfy the query

on the pessimistic repair. More precisely, for every inconsistent database I, there is a

repair M such that if ~c ∈ q(M), then ~c ∈ consistentΣ(q, I). This is a fundamental

property for the following reason. Consider a Boolean query q = ∃~x, ~w.φ(~x, ~w) and a

query q′(~x) = ∃~w.φ(~x, ~w). That is, q and q′ have the same literals, but some of the

(existentially-quantified) variables of q are free in q′. Suppose that we would like to

check whether consistentΣ(q, I) = true. This holds if, for every repair I of I, I |= q. In

particular, since M is a repair of I, M |= q. Thus, there is some ~c such that ~c ∈ q′(M).

By Lemma 3.10 below, it follows that ~c ∈ consistentΣ(q′, I). This property will be

exploited in the design of our algorithms in order to check the consistency of each tuple

of ~x independently. Notice that the property does not hold in general for conjunctive

queries, as we show in the next example. However, it does hold for the queries that

satisfy the conditions of Lemma 3.10.

Example 3.6. Consider a schema R with two binary relations r1 and r2. Consider a set Σ

that consists of a key dependency for r1 and a key dependency for r2 (the key dependencies

will be obvious from the queries). Let qnk be the Boolean query ∃x, x′, y.r1(x, y)∧r2(x′, y).

Notice that qnk is not in Cforest because it contains a nonkey-to-nonkey join. Let I be an

instance such that I = {r1(a1, b1), r1(a1, b2), r1(a2, b3), r1(a2, b4), r1(a3, b5),

r1(a3, b3), r2(c1, b1), r2(c1, b3), r2(c2, b4), r2(c2, b5), r2(c3, b2), r2(c3, b3)}. It can be checked

that for every repair I of I, I |= qnk.

Now, consider the query q′nk(x) = ∃x′, y.r1(x, y)∧ r2(x′, y). That is, qnk and q′nk differ

only in the fact that x is existentially-quantified in the former, and free in the latter. Let

I1 be repair of I such that I1 = {r1(a1, b1), r1(a2, b3), r1(a3, b5), r2(c1, b3), r2(c2, b4), r2(c3, b3)}.Let I2 be a repair of I such that I2 = {r1(a1, b1), r1(a2, b3), r1(a3, b5), r2(c1, b1), r2(c2, b4),

r2(c3, b2)}. Notice that (a1) 6∈ q′nk(I1), (a2) 6∈ q′nk(I2), and (a3) 6∈ q′nk(I1). Thus, even

though consistentΣ(qnk, I) = true, we have that (a) 6∈ consistentΣ(q′nk, I) = false,

Chapter 3. Rewritings for Conjunctive Queries 37

for every a. Therefore, it is not possible to check whether consistentΣ(qnk, I) = true

by independently checking each instantiation of the free variables of q′nk.

The result that we give below assumes an input query q(~x) that is in Cforest, whose

join graph T is a tree, and whose free variables ~x are exactly the variables of the key of T ’s

root. In the algorithm RewriteForest, the input query will be broken into subqueries

that satisfy this condition.

Lemma 3.10. Let q(~x) be a query in Cforest, whose join graph T is a tree and where

R(~x, ~y) is the literal at the root of T . Let I be an instance. Then, there is a repair Msuch that for all ~c if ~c ∈ q(M), then ~c ∈ consistentΣ(q, I).

Proof. Let M be the instance instance built by invoking the procedure

BuildPessimisticRepair(q, I) given in Figure 3.3.3. Assume that q is of the form

q(~x) = ∃~w.φ(~w, ~x). We will prove the claim by induction on the number of literals of φ.

Base case. Assume that φ consists of exactly one literal R(~x, ~y). Let ~t be the tuple

selected by the algorithm in the iteration for literal R and the vector of values ~c. Assume

towards a contradiction that consistentΣ(∃~w.R(~x, ~w)[~x/~c], I) = false. Then, there is

some repair I of I such that I 6|= ∃~w.R(~x, ~y)[~x/~c]. Since ~t ∈ I and I is a repair of I,

by Proposition 3.7, there is some tuple ~t′ in I and some ~d′ such that ~t′ = R(~c, ~d′). Since

I 6|= ∃~w.R(~x, ~y)[~x/~c], we have that {~t′} 6|= ∃~w.R(~x, ~y)[~x/~c].

Notice that ~t and ~t′ can be added to M only during the iteration for the vector of

values ~c. Since {~t} |= ∃~w.R(~x, ~y)[~x/~c] and {~t′} 6|= ∃~w.R(~x, ~y)[~x/~c], the algorithm never

selects tuple ~t. But ~t ∈M; contradiction.

Inductive step. Assume that φ has more than one literal. Let T1, . . . , Tm be the

subtrees of T such that the root of Tj is a child of the root of T , for 1 ≤ j ≤ m. For each

1 ≤ j ≤ m, let Sj(~xj, ~yj) be the literal at the root of Tj. Let φj be the conjunction of

the literals of Tj. Let ~wj = {w : w is a variable of φj, and w 6∈ ~xj}. Let qj = φj(~xj, ~wj).

Let Mj =BuildPessimisticRepair(φj, I).

Assume that M |= q(~x)[~x/~c]. Let ~t be the tuple of I selected by the algorithm in

the iteration for literal R and the vector of values ~c. Then, ~t ∈ M, and there is some

~d such that ~t = R(~c, ~d). Since M |= q(~x)[~x/~c], we have that for every j such that

1 ≤ j ≤ m, there is some valuation ν for the variables of ~y, and some ~cj such that

ν(~y) = ~d, ν(~xj) = ~cj, and Mj |= qj(~xj)[~xj/~cj].

Chapter 3. Rewritings for Conjunctive Queries 38

Algorithm BuildPessimisticRepair

Input: q(~x), a query in Cforest of the form ∃~w.φ(~w, ~x),whose join graph T is a tree with root literal R(~x, ~y)

Σ, a set of key constraints, one per relationI, an instance

Output: M, a repair of I

Initialize M as an empty instance

if φ has exactly one literal thenfor each ~c such that there is some R(~c, ~d) in I do

if there is some ~d such that R(~c, ~d) ∈ I,

and {R(~c, ~d)} 6|= ∃~w.R(~x, ~y)[~x/~c] then

Let ~t = R(~c, ~d)else

Let ~t be any tuple of I such that ~t = R(~c, ~d), for some ~dend ifAdd ~t to M

end forelse

/* φ has more than one literal*/Let S1, . . . , Sm be the children of R in Tfor j := 1 to m do

Let Tj be the subtree of T whose root is Sj

Let φj be the conjunction of literals of Tj

Let ~wj = {w : w is a variable that occurs in φj and ~w, and w 6∈ ~xj}Let qj(~xj) = ∃~wj.φj(~xj, ~wj)Let Mj = BuildPessimisticRepair(qj, I)Add Mj to M

end forfor each ~c such that there is some R(~c, ~d) in I do

if there is some ~d, some j, some valuation ν for the variables of ~y,and some ~cj such that R(~c, ~d) ∈ I, ν(~y) = ~d, ν(~xj) = ~cj, andMj 6|= qj(~xj)[~xj/~cj] then

Let ~t = R(~c, ~d)else

Let ~t be any tuple of I such that ~t = R(~c, ~d), for some ~dend ifAdd ~t to M

end forend if

Figure 3.6: Algorithm to construct a “pessimistic” repair

Chapter 3. Rewritings for Conjunctive Queries 39

Assume towards a contradiction that consistentΣ(q(~x)[~x/~c], I) = false. Then, there

is some repair I of I such that I 6|= q(~x)[~x/~c]. Since ~t ∈ I and I is a repair of I, by

Proposition 3.7, there is some tuple ~t′ in I and some ~d′ such that ~t′ = R(~c, ~d′). By Lemma

3.9, none of the variables of ~wi appear in ~wj, for every i and j such that i 6= j, 1 ≤ i ≤ m,

1 ≤ j ≤ m. Thus, there is some j, some valuation ν for the variables of ~y, and some tuple

of values ~c′j such that 1 ≤ j ≤ m, I 6|= qj(~xj)[~xj/~c′j], ν(~y) = ~d′, and ν(~xj) = ~c′j. Thus,

consistentΣ(qj(~xj)[~xj/~c′j], I) = false. By inductive hypothesis Mj 6|= qj(~xj)[~xj/~c

′j].

Since Mj |= qj(~xj)[~xj/~cj], the algorithm never selects ~t in the construction of M. But

~t ∈M; contradiction.

3.3.4 Correctness of RewriteLocal

We now give a correctness proof of RewriteLocal, the module of the algorithm that

handles “atomic” queries, that is queries with a single literal (and hence no joins). These

atomic queries may have arbitrary selections and projections on any subset of the nonkey

attributes (more precisely, any of the nonkey attributes may be projected out of the

query result). We consider here only equality selections, but it is quite easy to see how to

extend the algorithm and the proof to more general selection conditions (including not

only inequalities, but also arbitrary first-order expressions relating the variables of the

literal).

Lemma 3.11. Let q(~x, ~z) be a query of the form ∃~w.R(~x, ~y). Let I be a database instance.

Let Qlocal(~x, ~z) be the first-order query returned by RewriteLocal(q, Σ).

Then, (~c,~t) ∈ Qlocal(I) iff (~c,~t) ∈ consistentΣ(q, I).

Proof. (⇒) Assume that I |= Qlocal(~x, ~z)[~x/~c][~z/~t]. Then, there is a tuple R(~c, ~d) such

that {R(~c, ~d)} |= ∃~w.R(~x, ~y)[~x/~c][~z/~t]. Assume towards a contradiction that

consistentΣ(∃~w.R(~x, ~y)[~x/~c][~z/~t], I) = false. Then, there is some repair I such that

I 6|= ∃~w.R(~x, ~y)[~x/~c][~z/~t]. By Proposition 3.7, there is a tuple R(~c, ~d′) in I.

Following the construction of Qlocal in RewriteLocal, let σ be an injective function

that maps natural numbers to variables not present in R. Let ~y∗ be a vector of variables

of the same arity as ~y and such that if z is at position p of ~y∗, then σ(p) = z. Let ν and

ν ′ be valuations for the variables of ~x and ~y∗ such that ν(~x) = ~c, ν(~y∗) = ~d, ν ′(~x) = ~c,

and ν ′(~y∗) = ~d′.

Since {R(~c, ~d)} |= ∃~w.R(~x, ~y)[~x/~c][~z/~t] and {R(~c, ~d′)} 6|= ∃~w.R(~x, ~y)[~x/~c][~z/~t], there

is some variable z at some position p of ~y∗ such that

Chapter 3. Rewritings for Conjunctive Queries 40

1. ν(z) 6= ν ′(z), and there is a constant at position p in ~y; or

2. ν(z) 6= ν ′(z), and there is some variable w such that w occurs at position p of ~y,

and w occurs in either ~x or ~z; or

3. there are variables w and z′, and a position p′ such that w occurs at position p of

~y, w occurs at position p′ of ~y, p 6= p′, z′ = σ(p′), and ν ′(z) 6= ν ′(z′).

Assume (1) that there is a constant d at position p in ~y. Since

{R(~c, ~d)} |= ∃~w.R(~x, ~y)[~x/~c][~z/~t], ν(z) = d. Since ν(z) 6= ν ′(z), there is a constant d′

such that d 6= d′ and ν ′(z) = d′. Notice in the algorithm RewriteLocal that since I |=Qlocal(~x, ~z)[~x/~c][~z/~t], we have that I |= ∀~y∗.R(~x, ~y∗) → z = d. Since I ⊆ I, R(~c, ~d′) ∈ I.

Thus, {R(~c, ~d′)} |= ∀~y∗.R(~x, ~y∗) → z = d. Therefore, ν ′(z) = d; contradiction.

Assume (2) that there is some variable w such that w occurs at position p of ~y,

and w occurs in either ~x or in ~z. Let c = ν(w). Since {R(~c, ~d)} |= ∃~w.R(~x, ~y∗)[~x/~c][~z/~t],

ν(z) = c. Since ν(z) 6= ν ′(z), ν ′(z) 6= c. Notice in the algorithm RewriteLocal that since

I |= Qlocal(~x, ~z)[~x/~c][~z/~t], we have that I |= ∀~y∗.R(~x, ~y∗) → z = w[w/c]. Since I ⊆ I,

R(~c, ~d′) ∈ I. Thus, {R(~c, ~d′)} |= ∀~y∗.R(~x, ~y∗) → z = w[w/c]. Therefore, ν ′(z) = c;

contradiction.

Assume (3) that there are variables w and z′, and a position p′ such that w occurs

at position p of ~y, w occurs at position p′ of ~y, p 6= p′, z′ = σ(p′), and ν ′(z) 6= ν ′(z′).

Notice in the algorithm RewriteLocal that since I |= Qlocal(~x, ~z)[~x/~c][~z/~t], we have that

I |= ∀~y∗.R(~x, ~y∗) → z = z′. Since I ⊆ I, R(~c, ~d′) ∈ I. Thus, {R(~c, ~d′)} |= ∀~y∗.R(~x, ~y∗) →z = z′. Therefore, ν ′(z) = ν ′(z′); contradiction.

(⇐) Assume that consistentΣ(q(~x, ~z)[~x/~c][~z/~t], I) = true. Assume towards a con-

tradiction that I 6|= Qlocal(~x, ~z)[~x/~c][~z/~t]. Then, at least one of the following conditions

hold:

1. I 6|= ∃~w.R(~x, ~y)[~x/~c][~z/~t]; or

2. there is a constant d at position p in ~y and a variable z such that z = σ(p) and

I 6|= ∀~y∗.R(~x, ~y∗) → z = d[~x/~c][~z/~t]; or

3. there is some variable w such that w occurs at position p of ~y, w occurs in either

~x or ~z, and I 6|= ∀~y∗.R(~x, ~y∗) → z = w[~x/~c][~z/~t]; or

4. there is some variable w that occurs at position p of ~y, and at a position p′ of ~y

such that p 6= p′, σ(p) = z, σ(p′) = z′ and I 6|= ∀~y∗.R(~x, ~y∗) → z = z′[~x/~c][~z/~t].

Chapter 3. Rewritings for Conjunctive Queries 41

Assume that I 6|= ∃~w.R(~x, ~y)[~x/~c][~z/~t]. Let I be an arbitrary repair of I. Since I ⊆ I,

I 6|= ∃~w.R(~x, ~y)[~x/~c][~z/~t]; contradiction.

Suppose that I |= ∃~w.R(~x, ~y)[~x/~c][~z/~t]. Furthermore, assume that there is a constant

d at position p in ~y and a variable z such that z = σ(p) and I 6|= ∀~y∗.R(~x, ~y∗) → z =

d[~x/~c][~z/~t]. Then, there is a tuple R(~c, ~d) in I such that {R(~c, ~d)} 6|= ∀~y∗.R(~x, ~y∗) →z = d[~x/~c][~z/~t]. This means that there is some constant e at position p of ~d such that

d 6= e. Thus, {R(~c, ~d)} 6|= ∃~w.R(~x, ~y)[~x/~c][~z/~t]. By Proposition 3.8, there is a repair Iof I such that R(~c, ~d) ∈ I. Assume that I |= ∃~w.R(~x, ~y)[~x/~c][~z/~t]. Let R(~c, ~d′) be a

tuple of I such that {R(~c, ~d′)} |= ∃~w.R(~x, ~y)[~x/~c][~z/~t]. Since I is a repair of I, I satisfies

the key constraints of Σ. Thus, ~d = ~d′. Therefore, {R(~c, ~d)} |= ∃~w.R(~x, ~y)[~x/~c][~z/~t];

contradiction.

Suppose that I |= ∃~w.R(~x, ~y)[~x/~c][~z/~t]. Furthermore, assume that there is some

variable w such that w occurs at position p of ~y, w occurs in either ~x or ~z, and I 6|=∀~y∗.R(~x, ~y∗) → z = w[~x/~c][~z/~t]. Then, there is a tuple R(~c, ~d) in I such that {R(~c, ~d)} 6|=∀~y∗.R(~x, ~y∗) → z = w[~x/~c][~z/~t]. Let ν be a valuation for the variables of ~x and ~z such

that ν(~x) = ~c and ν(~z) = ~t. Let c = ν(w). Then, there is some constant e at position p of

~d such that c 6= e. Thus, {R(~c, ~d)} 6|= ∃~w.R(~x, ~y)[~x/~c][~z/~t]. By Proposition 3.8, there is a

repair I of I such that R(~c, ~d) ∈ I. Assume that I |= ∃~w.R(~x, ~y)[~x/~c][~z/~t]. Let R(~c, ~d′) be

a tuple of I such that {R(~c, ~d′)} |= ∃~w.R(~x, ~y)[~x/~c][~z/~t]. Since I is a repair of I, I satisfies

the key constraints of Σ. Thus, ~d = ~d′. Therefore, {R(~c, ~d)} |= ∃~w.R(~x, ~y)[~x/~c][~z/~t];

contradiction.

Suppose that I |= ∃~w.R(~x, ~y)[~x/~c][~z/~t]. Furthermore, assume that there is some

variable w that occurs at position p of ~y, and at a position p′ of ~y such that p 6= p′,

σ(p) = z, σ(p′) = z′ and I 6|= ∀~y∗.R(~x, ~y∗) → z = z′[~x/~c][~z/~t]. Then, there is a

tuple R(~c, ~d) in I such that {R(~c, ~d)} 6|= ∀~y∗.R(~x, ~y∗) → z = z′[~x/~c][~z/~t]. Let ν

be a valuation for the variables of ~y∗ such that ν(~y∗) = ~d. Then, there are con-

stants d and e at the respective positions p and p′ of ~d such that d 6= e. Thus,

{R(~c, ~d)} 6|= ∃~w.R(~x, ~y)[~x/~c][~z/~t]. By Proposition 3.8, there is a repair I of I such that

R(~c, ~d) ∈ I. Assume that I |= ∃~w.R(~x, ~y)[~x/~c][~z/~t]. Let R(~c, ~d′) be a tuple of I such that

{R(~c, ~d′)} |= ∃~w.R(~x, ~y)[~x/~c][~z/~t]. Since I is a repair of I, I satisfies the key constraints

of Σ. Thus, ~d = ~d′. Therefore, {R(~c, ~d)} |= ∃~w.R(~x, ~y)[~x/~c][~z/~t]; contradiction.

Chapter 3. Rewritings for Conjunctive Queries 42

3.3.5 Correctness of RewriteTree

Consider a Boolean query q = ∃~x, ~w.φ(~x, ~w) and a query q′(~x) = ∃~w.φ(~x, ~w). That is, q

and q′ have the same literals, but some of the (existentially-quantified) variables of q are

free in q′. In Lemma 3.10 above, we showed that if q′ is in a certain class of conjunctive

queries, then there is a “pessimistic” repair M such that for all ~c, if ~c ∈ q′(M), then

(c) ∈ consistentΣ(q′, I). We also argued that this fact implies that, in order to check

whether consistentΣ(q, I) = true, it suffices to find some instantiation ~c for the free

variables of q′ such that ~c ∈ consistentΣ(q′, I). The latter condition is fundamental in

the design of our algorithm since it can be checked with a first-order query directly on the

inconsistent database I. In the next lemma, we show that the algorithm RewriteTree,

the main building block of RewriteForest, produces a first-order query that checks

precisely this condition.

Lemma 3.12. Let q(~x, ~z) be a query in Cforest whose join graph T is a tree with root

literal R(~x, ~y). Let I be an instance. Let Q(~x, ~z) be the first-order query returned by

RewriteTree(q, Σ).

Then, (~c,~t) ∈ Q(I) iff (~c,~t) ∈ consistentΣ(q, I).

Proof. The proof is by induction on the number of literals of q.

Base case Assume that q has exactly one literal. Then, q(~x, ~z) = ∃~w.R(~x, ~y),

and Q = RewriteLocal(q, Σ). By Lemma 3.11, we have that I |= Q(~x, ~z)[~x/~c][~z/~t]

iff consistentΣ(q(~x, ~z)[~x/~c][~z/~t], I) = true.

(⇒) Notice in the algorithm RewriteLocal that, since I |= Qlocal[ν],

I |= ∃y1, . . . , ym.R(~x, ~y)[ν]. Let ~c = ν(~x). Then, there exists some ~d such that R(~c, ~d) ∈ I.

Let I be a repair of I. By Proposition 3.7, there is some ~d′ such that R(~c, ~d′) ∈ I.

Assume that there are no constants in ~y. Since all the variables of ~y are existentially

quantified in qT , {R(~c, ~d′)} |= qT [ν], and we are done.

Assume that there is some constant in ~y. Since all the variables of ~y are existentially

quantified in qT , in order to show that {R(~c, ~d′)} |= qT [ν], it suffices to show that ~d′ and

~y coincide in their constants. By Proposition 3.6, I ⊆ I. Thus, R(~c, ~d′) ∈ I. Since

I |= Qlocal[ν] and R(~c, ~d′) ∈ I, we have that |= Qconst[~y∗/~d′]. Therefore, it holds that if

there is a constant e at position i of ~d′, then |= Ei[wi/e], where wi is the variable created

in RewriteLocal for the i-th position of ~y. By construction of Ei, this means that there

is a constant e at position i of ~y.

Chapter 3. Rewritings for Conjunctive Queries 43

(⇐) Let I be a repair of I. Let ~c = ν(~x). Since I |= qT [ν], there exists ~d such that

R(~c, ~d) ∈ I. By Proposition 3.6, I ⊆ I. Therefore, there exists ~d′ such that R(~c, ~d′) ∈ I.

Thus, I |= ∃y1, . . . , ym.R(~x, ~y)[ν].

Assume that there is some constant in ~y. Let νy be a valuation for the variables

of ~y∗, where ~y∗ is the vector of variables created in RewriteLocal. Let ~d be such that

~d = νy(~y∗). If R(~c, ~d) 6∈ I, then I |= R(~x, ~y∗) → Qconst[ν][νy] because the left-hand side

of the implication is not satisfied. Assume R(~c, ~d) ∈ I. By Proposition 3.8, there exists

a repair I of I such that R(~c, ~d) ∈ I. Since I |= Σ, if R(~c, ~d′) ∈ I, then ~d′ = ~d. Since

I |= qT [ν], {R(~c, ~d)} |= qT [ν]. Therefore, if d is a constant that appears at position i in

~y, then d occurs at position i in ~d. Thus, I |= Qconst[ν][νy].

Inductive step Let R1(~x1, ~y1), . . . , Rm(~xm, ~ym) be the children of R in T . Assume

that q is of the form ∃~w.φ(~w, ~z), where φ is a conjunction of literals. For each 1 ≤ i ≤ m,

let Ti be the tree whose root is Ri. Let φi be the conjunction of the literals of Ti. Let

~wi = {w : w is a variable that occurs in φi and ~w, and w 6∈ ~xi}. Let ~zi = {z : z

is a variable that occurs in φi and ~z, and z 6∈ ~xi}. Let qi(~xi, ~zi) = ∃~wi.φi(~xi, ~wi, ~zi).

Let Qi(~xi, ~zi) = RewriteTree(qi, Σ). Let qlocal(~x, ~z) = ∃~w.R(~x, ~y). Let Qlocal(~x, ~z) =

RewriteLocal(qlocal, Σ).

(⇒) Assume that I |= Q(~x, ~z)[~x/~c][~z/~t]. Then, there is a valuation ν for the variables

of φ such that:

1. ν(~x) = ~c, and

2. ν(~z) = ~t, and

3. I |= Qlocal(~x, ~z)[ν], and

4. for every i such that 1 ≤ i ≤ m, there are ~ci and ~ti such that ν(~xi) = ~ci, ν(~zi) = ~ti,

and I |= Qi(~xi, ~zi)[~xi/~ci][~zi/~ti]

Let I be a repair of I. Assume towards a contradiction that I 6|= ∃~w.R(~x, ~y)[~x/~c][~z/~t].

Then, consistentΣ(∃~w.R(~x, ~y)[~x/~c][~z/~t], I) = false. By Lemma 3.11, we have that

I 6|= Qlocal(~x, ~z)[ν]; contradiction.

Assume that I |= ∃~w.R(~x, ~y)[~x/~c][~z/~t]. By Lemma 3.9, none of the variables of ~wi

appear in ~wj, for every i and j such that i 6= j, 1 ≤ i ≤ m, 1 ≤ j ≤ m. Then, I 6|= qi(~ci,~ti)

for some i such that 1 ≤ i ≤ m. Thus, consistentΣ(qi(~xi, ~zi)[~xi/~ci][~zi/~ti], I) = false.

By inductive hypothesis, I 6|= Qi(~xi, ~zi)[~xi/~ci][~zi/~ti]; contradiction.

Chapter 3. Rewritings for Conjunctive Queries 44

(⇐) Assume that consistentΣ(q(~x, ~z)[~x/~c][~z/~t], I) = true. Assume towards a con-

tradiction that I 6|= Q(~x, ~z)[~x/~c][~z/~t]. Let ν be a valuation for the variables of φ such

that ν(~x) = ~c and ν(~z) = ~t. By Lemma 3.9, none of the variables of ~wi appear in ~wj, for

every i and j such that i 6= j, 1 ≤ i ≤ m, 1 ≤ j ≤ m. Then, either (1) I 6|= Qlocal(~x, ~z)[ν];

or (2) there is some i such that I 6|= Qi(~xi, ~zi)[ν].

Assume that I 6|= Qlocal(~z)[ν]. By Lemma 3.11, consistentΣ(∃~w.R(~x, ~y)[~x/~c][~z/~t], I) =

false. Thus, it is the case that consistentΣ(q(~x, ~z)[~x/~c][~z/~t], I) = false; contradic-

tion. Assume that there is some i such that I 6|= Qi(~xi, ~zi)[ν]. By inductive hypothesis,

consistentΣ(qi(~xi, ~zi)[~xi/~ci][~zi/~ti], I) = false. Thus, it is the case that

consistentΣ(q(~x, ~z)[~x/~c][~z/~t], I) = false; contradiction.

3.3.6 Correctness of RewriteForest

We are now ready to give the correctness proof of our rewriting algorithm, for all queries

in class Cforest. The intuition of the proof is the following. Assume that we are given

a query q in Cforest. Then, each of the connected components of the join graph of q

is a tree. Recall that RewriteTree, the algorithm for which we proved correctness in

the above lemma, requires that the input query satisfies the following conditions. First,

the join graph of the query must be a tree. Second, the free variables of the query

must include all the variables at key positions of the literal at the root of this tree.

In order to be able to use RewriteTree, RewriteForest produces a subquery for each

tree of the join graph such that the variables at the key of the corresponding tree’s

root are free. In this way, a first-order rewriting can be produced for each subquery by

invoking the algorithm RewriteTree. For each i, let Qi(~xi, ~zi) be the rewriting obtained

by invoking RewriteTree(qi, Σ). The query returned by RewriteForest has the form

Q(~z) = ∃~w.(φ(~w, ~z) ∧ ∧i=1...m Qi(~xi, ~zi)), where φ(~w, ~z) is the conjunction of literals of

the original query q, and the variables of each ~xi are in ~w. The correctness of this formula

relies on the structural property of Section 3.3.2 and the notion of a “pessimistic” repair of

Section 3.3.3. First, by Lemma 3.10, it suffices to find one instantiation for the variables

of each ~xi. Thus, the variables of ~xi can be free in Qi. Second, the subqueries do not

share existentially-quantified variables. This is ensured by the structural property proved

in Lemma 3.9.

Chapter 3. Rewritings for Conjunctive Queries 45

Theorem 3.5. Let R be a schema. Let Σ be a set of integrity constraints, consisting

of one key dependency per relation of R. Let q(~z) be a conjunctive query over R such

that q ∈ Cforest. Let Q(~z) be the first-order query returned by RewriteForest(q, Σ). Let

I be an instance over R.

Then, ~t ∈ Q(I) iff ~t ∈ consistentΣ(q, I).

Proof. Let G be the join graph of q. Since q ∈ Cforest, G is a forest. Let T1, . . . , Tm be

the connected components (trees) of G. Assume that q is of the form ∃~w.φ(~w, ~z), where

φ is a conjunction of literals. For each 1 ≤ i ≤ m, let Ri(~xi, ~yi) be the literal at the root

of Ti. Let φi be the conjunction of the literals of Ti. Let ~wi = {w : w is a variable that

occurs in φi and ~w, and w 6∈ ~xi}. Let ~zi = {z : z is a variable that occurs in φi and ~z,

and z 6∈ ~xi}. Let qi(~xi, ~zi) = ∃~wi.φi(~xi, ~wi, ~zi). Let Qi(~xi, ~zi) = RewriteTree(qi, Σ).

(⇒) Assume that I |= Q(~z)[~z/~t]. Then, there is a valuation ν for the variables of φ

such that:

1. ν(~z) = ~t, and

2. I |= φ(~w, ~z)[ν], and

3. for every i such that 1 ≤ i ≤ m, there are ~ci and ~ti such that ν(~xi) = ~ci, ν(~zi) = ~ti,

and I |= Qi(~xi, ~zi)[~xi/~ci][~zi/~ti]

Let I be a repair of I. Assume towards a contradiction that I 6|= q[~z/~t]. Thus,

I 6|= q[ν]. By Lemma 3.9, none of the variables of ~wi appear in ~wj, for every i and j

such that i 6= j, 1 ≤ i ≤ m, 1 ≤ j ≤ m. Then, I 6|= qi(~xi, ~zi)[~xi/~ci][~zi/~ti] for some i such

that 1 ≤ i ≤ m. Thus, consistentΣ(qi(~xi, ~zi)[~xi/~ci][~zi/~ti], I) = false. By Lemma 3.12,

I 6|= Qi(~xi, ~zi)[~xi/~ci][~zi/~ti]; contradiction.

(⇐) Assume that ~t ∈ consistentΣ(q, I). Assume towards a contradiction that I 6|=Q(~z)[~z/~t]. Let ν be a valuation for the variables of φ such that ν(~z) = ~t. Then, either

(1) I 6|= q(~z)[ν]; or (2) there is some i such that I 6|= Qi(~xi, ~zi)[ν].

We will build a repair M of I as follows. For each i, let Ii be the projection of

I on the relation symbols of φi. By Lemma 3.10, there is a repair Mi such that if

Mi |= qi(~xi)[~xi/~ci], then consistentΣ(qi(~xi)[~xi/~ci], Ii) = true. We add all the tuples of

Mi to M.

We now show that M 6|= q(~z)[ν]. Assume that I 6|= q(~z)[ν]. Since M ⊆ I, M 6|=q(~z)[ν]. Now, assume that there is some i such that 1 ≤ i ≤ m and I 6|= Qi(~xi, ~zi)[ν]. By

Chapter 3. Rewritings for Conjunctive Queries 46

Lemma 3.12, consistentΣ(qi(~xi, ~zi)[ν], I) = false. By Lemma 3.10, Mi 6|= qi(~xi, ~zi)[ν].

Thus, M 6|= q(~z)[ν].

So, for every valuation ν such that ν(~z) = ~t, we have that M 6|= q(~z)[ν]. Thus,

~t 6∈ consistentΣ(q, I); contradiction.

3.4 Related Work

In their seminal paper on consistent query answering, Arenas, Bertossi and Chomicki

[ABC99] propose a first-order rewriting algorithm. The algorithm applies to a broad

class of constraints but a restricted class of queries, called quantifier-free conjunctive

queries. In these queries, all variables are free (i.e., there is no existential quantification).

If we think in terms of equivalent SQL queries, the fact that all variables are free means

that every attribute of every relation in the from clause must appear in the select

clause. This a strong restriction that rules out many practical queries. As an empirical

observation, none of the queries in the TPC-H specification [TPC03], the industry stan-

dard for decision support systems, satisfy this restriction. Chomicki and Marcinkowski

[CM05] propose a rewriting for another restricted class, where no variables are shared

between literals (and therefore, there are no joins). In this chapter, we focused on a class

of conjunctive queries that may have existential quantification, and we argued that the

class captures many queries that arise in practice.

Except for the aforementioned work [ABC99, CM05], to the best of our knowledge,

none of the work in the consistent query answering literature has focused on first-order

rewritings. Instead, they typically produce rewritings into disjunctive logic programs

[ABC00, CB00, GZ00, FPL+01, GGZ01, LLR02, ABC03a, BB03a, BB03b, EFGL03,

CB05]. Their focus is on obtaining correct disjunctive logic programs for (usually large)

classes of queries and constraints. However, given the high complexity of disjunctive

logic programming, none of these approaches focus on tractability issues. Tractability

results have been given in the context of databases with OR-objects [IvdMV95]. As

we mentioned in Section 2.5, OR-objects can be used in some (though not all) cases

to represent databases inconsistent with respect to key constraints. To the best of our

knowledge, query rewriting has not been studied in the context of OR-objects.

Our work on first-order query rewriting has been subsequently extended by other

authors. Grieco, Lembo, Rosati and Ruzzi [GLRR05] show a query rewriting algorithm

for our class Cforest under exclusion constraints (that is constraints which restrict values

Chapter 3. Rewritings for Conjunctive Queries 47

to appear in exactly one of two relations). In a recent paper [LRR06], Lembo, Rosati,

and Ruzzi extend the class Cforest to consider queries that may have the union operation.

Chapter 4

Rewritings for Queries with

Grouping and Aggregation

In the previous chapter, we presented query rewritings for queries with set semantics and

no aggregation. However, practical query languages like SQL have bag semantics (dupli-

cates are not eliminated unless explicitly requested), and support aggregation functions

and grouping of results. For this reason, in this chapter we present rewritings for queries

with bag semantics, grouping, and aggregation.

4.1 Formal Language

Despite extensive research on queries with bag semantics and aggregation [CV93, IR95,

LW97, GM96, GRT99, CNS99, HLNW01, CNS03], there is no commonly agreed formal

language for this kind of queries, with different researchers proposing different (but of-

ten equivalent) languages. For this reason, in this section, we introduce languages for

first-order aggregate queries and conjunctive aggregate queries that are influenced by the

previous proposals. The former language will be used to express our query rewritings,

whereas the latter will be used for the input queries (i.e., the queries for which we compute

consistent answers). The language of first-order aggregate queries extends the language

of first-order logic with operators for grouping and aggregation. Aggregate conjunctive

queries are a subset of first-order aggregate queries.

Our language for first-order aggregate queries is based on the one given by Cohen,

Nutt and Sagiv [CNS03], except for the fact that we use a “SQL-like” syntax to specify

grouping and aggregation. The language can be shown to be a subset of the aggregate

48

Chapter 4. Rewritings for Queries with Grouping and Aggregation 49

logic Laggr introduced by Hella, Libkin, Nurmonen, and Wong [HLNW01]. We do not

explicitly provide the bag manipulation operators (such as additive union, maximum

union, etc.) that are given in bag algebras [GM96, LW97].

Bags and aggregation functions. A bag or (multiset) is a collection of elements,

each of which occurs one or more times in the collection. We will denote the multiplicity

(number of occurrences) of each element x of a bag B as |x|B. If S is a domain, we

denote by B(S) the set of finite bags over S. A k -ary aggregation function is a function

F : B(Ck) → R that maps bags of k-tuples of constants from some underlying domain C

to real numbers. In particular, we will consider the functions sum, min, and max, which

return the sum, minimum, and maximum of a bag of tuples, and the function count(*),

which returns the cardinality of a bag of tuples.

We will consider a bag-set query semantics [CV93], where relations (and their re-

pairs) are assumed to be sets, but the aggregate queries manipulate bags. For example,

consider a database I = {employee(John, 1000), employee(Mary, 1000)} and a query

q that retrieves the salaries (the second attribute of relation employee), expressed as

q(s) = ∃e. employee(e, s). Under bag-set semantics, the result of q(I) is {{1000, 1000}}(that is, 1000 has multiplicity two in the result).

Language syntax.A first-order aggregate query q may be either:

1. a first-order formula; or

2. a formula of the form

select ~z, F1(~v1), . . . , Fm(~vm)

from q∗(~w, ~z)

group by ~z

where q∗ is a first-order aggregate query, ~w and ~z do not share variables, ~v1, . . . , ~vm

are vectors of variables from ~w, and F1, . . . , Fm are aggregation functions with

arities |~v1|, . . . , |~vm|. We will say that ~z are the grouping variables of the query, and

~v1, . . . , ~vm are the aggregation variables.

Language semantics. We now define how to obtain a set of tuples by applying a

first-order aggregate query q to a database I. (Even though aggregate functions take

bags as input, the final result of a query is always a set because it has one tuple for each

“group”).

Chapter 4. Rewritings for Queries with Grouping and Aggregation 50

If the aggregate query is just a first-order formula (Case 1 above), its semantics

corresponds to the semantics of first-order queries. If the query is of the form of Case

2 above, the aggregate query q is evaluated as follows. First, we retrieve “groups” that

satisfy q∗ (i.e., all the satisfying assignments for the grouping variables ~z). Second, for

each group ~a (i.e., for each instantiation of the grouping variables ~z), we obtain the bag

of tuples σ~a that satisfy q∗ and whose projection on ~z is ~a (the tuples of σ~a are on both

the grouping variables ~z and the other free variables ~w of q∗). Third, for each group ~a

and aggregation function Fi, we create a bag Bi,~a by taking each tuple (~c,~a) of σ~a and

projecting on the aggregation variables ~vi. Finally, we apply every aggregation function

Fi to the corresponding bag Bi,~a.

More formally, for every database instance I, tuple ~a and real numbers b1, . . . , bm, we

say that (~a, b1, . . . , bm) ∈ q(I) if there is a set σ~a such that:

• I |= ∃~w.q∗(~w,~a), and

• σ~a = {(~c,~a) : (~c,~a) ∈ q∗(I)}, and

• for every i such that 1 ≤ i ≤ m, bi = Fi(Bi,~a), where Bi,~a is the bag obtained by

taking each tuple (~c,~a) of σ~a and projecting on the aggregation variables ~vi.

We now define the language of conjunctive aggregate queries as a subset of first-order

aggregate queries. A conjunctive aggregate query is a formula of the form

select ~z, F1(~v1), . . . , Fm(~vm)

from q∗(~w, ~z)

group by ~z

where q∗(~w, ~z) is a conjunctive query, ~v1, . . . , ~vm are vectors of variables from ~w, and

F1, . . . , Fm are aggregation functions of the arities of ~v1, . . . , ~vm. We will say that ~z are

the grouping variables, and ~v1, . . . , ~vm are the aggregation variables. The semantics is the

same as for first-order conjunctive queries.

As with first-order aggregate queries, the language of conjunctive aggregate queries is

influenced by previous proposals. In particular, it corresponds closely to the language pre-

sented by Cohen, Nutt and Serebrenik [CNS99], except that we use a “SQL-like” syntax

instead of a Datalog syntax. It is also related to the language of real conjunctive queries

(conjunctive queries with bag semantics) introduced by Chaudhuri and Vardi [CV93],

Chapter 4. Rewritings for Queries with Grouping and Aggregation 51

and the class of conjunctive queries with label systems representing multisets presented

by Ioannidis and Ramakrishnan [IR95]. In the latter two cases, tuples are returned to-

gether with their multiplicity. This can be obtained in our conjunctive aggregate queries

by using the aggregation function count(∗).

4.2 Algorithms

In this section, we present query rewriting algorithms under the aggconsistentΣ se-

mantics for a class of queries that extends the class Cforest of the previous chapter with

operators for grouping and aggregation. In Section 4.2.1, we present the rewriting algo-

rithm for queries with bag semantics (i.e., the count(*) operator), and in Section 4.2.2

we present the algorithm for queries with the unary aggregation functions sum, min, and

max.

4.2.1 Queries with Bag Semantics

In this subsection, we give a query rewriting algorithm for conjunctive queries with bag

semantics (i.e., the count(*) operator). We start with an example, and then give the

general algorithm. The example illustrates how we can build upon the results for query

rewriting conjunctive queries under set-theoretic semantics of the previous chapter.

Example 4.1. Let R be a schema with one relation symbol employee. Assume that r

has two attributes: emplKey (the name of the employee) and salary. Let Σ be a set that

consists of only one constraint stating that emplKey is the key of relation employee.

Consider the following query q1, which counts the number of occurrences of each

salary (it corresponds to query q3 of Example 2.1).

q1(s, v): select s, count(*) as v

from employee(e, s)

group by s

Let I be a database instance such that I = {employee(John, 1000), employee(John, 2000),

employee(Mary, 1000), employee(Ali, 1000)}. There are two repairs of I with respect to

Σ: I1 = {employee(John, 1000), employee(Mary, 1000), employee(Ali, 1000)} and I2 =

{employee(John, 2000),employee(Mary, 1000), employee(Ali, 1000)}. Furthermore, q1(I1) =

{(1000, 3)} and q1(I2) = {(1000, 2), (2000, 1)}. By Definition 2.4, aggconsistentΣ(q1, I) =

Chapter 4. Rewritings for Queries with Grouping and Aggregation 52

{(1000, 2, 3)}. That is, the salary 1000 is an answer that appears at least twice and at

most three times in the result of applying q1 on the repairs.

Let us focus on obtaining the greatest lower bound for q1. From the previous chapter,

we know how to obtain consistent answers for conjunctive queries without aggregation

under set-theoretic semantics. We would like to reuse such results here. An obvious

strategy (shown to be incorrect shortly) is to first remove grouping and aggregation

from q1, obtain the consistent answers under set-theoretic semantics, and finally apply

grouping and aggregation to the intermediate result. That is, first compute the consistent

answers for the following query q′1(s):

select s

from employee(e, s)

We can express q′1 in conjunctive query notation as follows: q′1(s) = ∃e. employee(e, s).

Let QConsistent′(s) be the first-order query obtained by applying RewriteForest(q′1, Σ),

the algorithm introduced in the previous chapter. Suppose that now apply the operator

count(*) to the the result of QConsistent′(s) as follows:

select s, count(*)

from QConsistent′(s)

group by s

It is easy to see that this strategy leads to a wrong result. Since the result of the

consistent answers to q′1 (consistentΣ(q′1, I)) is {(1000)}, we would incorrectly conclude

that the greatest lower bound for 1000 is one, when in fact it is two. Clearly, the cause

for the incorrect result is that cardinalities are lost in the set-theoretic consistent answers

that we computed as an intermediate step. But, is there any way of obtaining the correct

bounds for the aggregate query, and yet be able to reuse the notion of set-theoretic

consistent answers as an intermediate step? The answer is positive: we can use a “root

key value at a time” principle. In this case, this corresponds to making the variable e

(for employee name) free because it is at the key position of employee(e, s), the literal

at the root (and only node) of q′1. We will obtain the consistent answer one employee

at a time in the intermediate result, and then project out the employees (since they

are not retrieved by q1). The intermediate result will be guaranteed to have the correct

cardinalities despite the fact that it is obtained using set semantics. The intuitive reason

Chapter 4. Rewritings for Queries with Grouping and Aggregation 53

is that repairs are sets of tuples that satisfy the key constraints, and hence every employee

name appears exactly once in each repair.

Following the previous discussion, let q′′1 be the query q′1, where the variable e is made

free. That is, let q′′1(e, s) = employee(e, s). The set-theoretic consistent answers for q′′1 are

consistentΣ(q′′1 , I) = {(Mary, 1000), (Ali, 1000)}. We can now project out the employee

names and count the number of occurrences of salary 1000, arriving at the correct lower

bound for count(*) in q1.

Let us now turn our attention to the computation of the lowest upper bound of q1.

Since aggconsistentΣ(q1, I) = (1000, 2, 3), the salary 1000 is an answer that appears

at most three times in the results of applying q1 to the repairs. We can use q′′1(e, s) =

employee(e, s) to obtain the lowest upper bound of salary 1000 as follows:

select s, count(*) as lub

from q′′1(e, s)

group by s

However, this query also retrieves the tuple (2000, 1) which should not be in the result

of aggconsistentΣ(q1, I) because the salary 2000 does not appear in q1(I1). This means

that we must make sure that the values for the grouping variables are in the consistent

answers for q′′1 . We can do this by employing the first-order rewriting QConsistent(e, s)

of query q′′1 , which can be obtained by invoking the algorithm RewriteForest. Now, we

can rule out 2000 from the final result because there is no tuple for salary 2000 in the

result of QConsistent(e, s). This can be achieved with the following query:

select s, count(*) as lub

from employee(e, s) ∧ ∃e′.QConsistent(e′, s)group by s

Query Rewriting Algorithm

In Figure 4.1, we give the rewriting algorithm for aggregate conjunctive queries with

the count(∗) aggregation function. The algorithm works for queries q of the form

select ~z, count(*)

from q∗(~z)

group by ~z

Chapter 4. Rewritings for Queries with Grouping and Aggregation 54

where q∗ is a conjunctive query in Cforest. The reason for requiring q∗ to be in Cforest is

that, as we motivated in the previous example, we would like to build upon the results for

first-order rewriting of conjunctive queries under set-theoretic semantics. In the previous

chapter, we showed how to obtain such rewritings for the conjunctive queries in class

Cforest.

By definition, the join graph of all queries in Cforest is a forest. We can then instantiate

the values for the key attributes at each root literal of the join graph of q∗, using the

“root key value at a time” strategy that we illustrated in the previous example. More

precisely, let G be the join graph of q∗. We will construct a conjunctive query q′ that

has the same literals as q∗, but all the variables that are at the key of some root of G are

free in q′.

Following the algorithm, let R1(~x1, ~y1), . . . , Rm(~xm, ~ym) be the literals at the roots

of all trees in G. Let ~x =⋃

i=1...m ~xi, let ~z′ = ~z − ~x. Let φ(~w, ~z) be the conjunction

of literals of q∗, and let ~w′ = ~w − ~x. We define q′ as q′(~x, ~z′) = ∃ ~w′.φ(~x, ~w′, ~z′). The

advantage of query q′ is that since the variables at the key of all root literal are free,

each tuple appears exactly once in the answer to q′ in the repairs (we will show this

formally in Lemma 4.4). Thus, set and bag-set semantics coincide in the answer to q′.

We can exploit this fact by computing the set-theoretic consistent answers for q′ as an

intermediate result towards producing the consistent answers to the aggregate query q.

The first-order query rewriting QConsistent for q′ is obtained by invoking the algorithm

RewriteForest given in Figure 3.2 of Chapter 3.

The greatest lower bound is computed with the following query, which counts the

number of occurrences of tuples for ~z (the grouping variables) in the consistent answer

to q′.

QGlb(~z, low) = select ~z, count(*)

from QConsistent(~x, ~z′)

group by ~z

Notice that the free variables of QConsistent, ~x and ~z′, contain the variables of ~z, but

may have additional variables. In the final result, we are projecting out these additional

variables, since they are not in the select clause of the query q.

The lowest upper bound is obtained by counting the number of tuples that satisfy

q′(~x, ~z′) and checking that some instantiation of the grouping variables of ~z appear in the

Chapter 4. Rewritings for Queries with Grouping and Aggregation 55

RewriteCount(q, Σ)

Input: A query q of the form

select ~z, count(*)

from q∗(~z)

group by ~z

where q∗ is a conjunctive query in Cforest

Σ, a set of key constraints (one per relation)

Output: Q, an aggregate first-order query that computes aggconsistentΣ(q, I)

for every database I

Let G be the join graph of q

Let R1(~x1, ~y1), . . . , Rm(~xm, ~ym) be the literals at the roots of all trees of G

Let ~x =⋃

i=1...m ~xi

Let ~z′ = ~z − ~x

Let φ(~w, ~z) be the conjunction of literals of q∗

Let ~w′ = ~w − ~x

Let q′(~x, ~z′) = ∃ ~w′.φ(~x, ~w′, ~z′)

Let QConsistent(~x, ~z′) be the query obtained by invoking RewriteForest(q′, Σ)

Let QGlb(~z, low) = select ~z, count(*)

from QConsistent(~x, ~z′)

group by ~z

Let ~x′ = ~x− ~z

Let QLub(~z, up) = select ~z, count(*)

from q′(~x, ~z′) ∧ (∃~x′.QConsistent(~x, ~z′))

group by ~z

Let Q(~z, low, up) = QGlb(~z, low) ∧ QLub(~z, up)

return Q

Figure 4.1: Query rewriting algorithm for queries with count(*).

Chapter 4. Rewritings for Queries with Grouping and Aggregation 56

consistent answers of q′. This is obtained with the query ∃~x′.QConsistent(~x, ~z′), where

~x′ are the variables of ~x that are not free variables of q.

QLub(~z, up) = select ~z, count(*)

from q′(~x, ~z′) ∧ (∃~x′.QConsistent(~x, ~z′))

group by ~z

4.2.2 Queries with the sum, min, and max Functions

In Figure 4.2, we present the query rewriting algorithm for queries with the sum, min,

and max aggregation functions. The main difference with the rewritings produced by

RewriteCount is that aggregation is performed here in two levels. At the inner level of

the rewriting, we aggregate the values for u (the value that is aggregated in the original

query), and we group by the key-root attributes (vector ~x in the figure). We then project

out the key-root attributes that are not in the select clause of the input query, and

apply the aggregation function of the input query.

For example, the greatest lower bound of the max function is computed as follows:

QGlb(~z, low) =

select ~z, max(bottom)

from

select ~x, ~z′, min(u) as bottom

from QConsistent(~x, ~z′) ∧ q′′(~x, ~z′, u)

group by ~x, ~z′

group by ~z

Notice that, as in RewriteCount, the lower bound is obtained by selecting tuples from

QConsistent(~x, ~z′). In addition, we now have a conjunct q′′(~x, ~z′, u), which retrieves the

values for the aggregate attribute u. The inner level of aggregation consists in this case

of the computation of the bottom attribute, as the minimum for the values retrieved for

u. The outer level applies the max function (i.e., the function of the original query) to

the values of the bottom attribute.

Chapter 4. Rewritings for Queries with Grouping and Aggregation 57

RewriteAgg(q, Σ)

Input: A query q of the form

select ~z, [max(u)|min(u)|sum(u)]from q∗(~z, u)

group by ~z

where q∗ is a conjunctive query in Cforest

Σ, a set of key constraints (one per relation)

Output: Q, an aggregate first-order query that computes aggconsistentΣ(q, I)

for every database I

Let G be the join graph of q

Let R1(~x1, ~y1), . . . , Rm(~xm, ~ym) be the literals at the roots of all trees of G

Let ~x =⋃

i=1...m ~xi

Let ~z′ = ~z − ~x

Let φ(~w, ~z, u) be the conjunction of literals of q∗

Let ~w′ = ~w − ~x

Let q′(~x, ~z′) = ∃ ~w′, u.φ(~x, ~w′, ~z′, u)

Let QConsistent(~x, ~z′) be the query obtained by invoking RewriteForest(q′,Σ)

Let q′′(~x, ~z′, u) = ∃ ~w′.φ(~x, ~w′, ~z′, u)

Let ~x′ = ~x− ~z − u

if the aggregate function is max then

QGlb(~z, low) =

select ~z, max(bottom)

from

select ~x, ~z′, min(u) as bottom

from QConsistent(~x, ~z′) ∧ q′′(~x, ~z′, u)

group by ~x, ~z′

group by ~z

QLub(~z, up) =

select ~z, max(top)

from

select ~x, ~z′, max(u) as top

from q′′(~x, ~z′, u) ∧ (∃~x′.QConsistent(~x, ~z′))

group by ~x, ~z′

group by ~z

endif

continues on next page...

Figure 4.2: Query rewriting algorithm for queries with aggregation

Chapter 4. Rewritings for Queries with Grouping and Aggregation 58

continued from previous page...

if the aggregate function is sum then

QGlb(~z, low) =

select ~z, sum(bottom)

from

select ~x, ~z′, min(u) as bottom

from QConsistent(~x, ~z′) ∧ q′′(~x, ~z′, u)

group by ~x, ~z′

having bottom ≥ 0

∨select ~x, ~z′, min(u) as bottom

from q′′(~x, ~z′, u) ∧ (∃~x′.QConsistent(~x, ~z′))

group by ~x, ~z′

having bottom < 0

group by ~z

QLub(~z, up) =

select ~z, sum(top)

from

select ~x, ~z′, max(u) as top

from q′′(~x, ~z′, u) ∧ (∃~x′.QConsistent(~x, ~z′))

group by ~x, ~z′

having top > 0

∨select ~x, ~z′, max(u) as top

from QConsistent(~x, ~z′) ∧ q′′(~x, ~z′, u)

group by ~x, ~z′

having top ≤ 0

group by ~z

endif

continues on next page...

Figure 4.2: Query rewriting algorithm for queries with aggregation

Chapter 4. Rewritings for Queries with Grouping and Aggregation 59

continued from previous page...

if the aggregate function is min then

QGlb(~z, low) =

select ~z, min(bottom)

from

select ~x, ~z′, min(u) as bottom

from q′′(~x, ~z′, u) ∧ (∃~x′.QConsistent(~x, ~z′))

group by ~x, ~z′

group by ~z′

QLub(~z, up) =

select ~z, min(top)

from

select ~x, ~z, max(u) as top

from QConsistent(~x, ~z′) ∧ q′′(~x, ~z′, u)

group by ~x, ~z′

group by ~z

endif

Let Q(~z, low, up) = QGlb(~z, low) ∧ QLub(~z, up)

return Q

Figure 4.2: Query rewriting algorithm for queries with aggregation

Chapter 4. Rewritings for Queries with Grouping and Aggregation 60

4.3 Correctness of the Algorithms

In this section, we prove the correctness of the query rewriting algorithms of this chapter.

We consider the following class of queries, which we call Caggforest.

Definition 4.1. Let q be an aggregate conjunctive query. We say that q is in class

Caggforest if q is of the form

select ~z, [count(*)| F(u)]

from q∗(~z, u)

group by ~z

where q∗ is a conjunctive query in Cforest, and F is one of the aggregation functions

min, max or sum.

The main result of this section is the following theorem:

Theorem 4.2. Let R be a schema. Let Σ be a set of integrity constraints, consisting of

one key dependency per relation of R. Let q(~z, v) be a query in Caggforest. Let Q(~z, l, u)

be the first-order aggregate query returned by RewriteCount(q, Σ) or RewriteAgg(q, Σ)

(depending on the aggregate function of the query).

Let I be an instance over R. If q has the aggregate function sum, assume that the

aggregated attribute ranges over positive numbers on I.

Then, for every tuple ~t, and pair of real numbers low and up, we have that (~t, low, up) ∈aggconsistentΣ(q, I) iff (~t, low, up) ∈ Q(I).

Notice that for the sum operator we have an additional requirement: the aggregated

variable must take only positive numbers. The rewriting for sum, however, does produce

sound bounds for arbitrary numbers (positive or negative), as we prove in Section 4.3.3.

The algorithms use the first-order query rewritings of the previous chapter as a build-

ing block. The semantics of those rewritings is set-theoretic, whereas the aggregate

functions we consider in this chapter take bags as input. In Section 4.3.1, we show that

for a subclass of the conjunctive queries in Cforest, the cardinality of the query results on

every repair is exactly one. Thus, for this subclass, it is not necessary to keep track of

tuple multiplicities in the intermediate results. Recall that in Chapter 3, we showed that

for every query q in a subclass of Cforest, there is a “pessimistic” repairM such that q(M)

Chapter 4. Rewritings for Queries with Grouping and Aggregation 61

retrieves all the consistent answers to q. We will use the notion of pessimistic repair to

prove that the bounds produced by the rewritings are tight. We will also need the dual

notion of an “optimistic” repair, which we introduce in Section 4.3.2. In Section 4.3.3, we

show that the ranges produced by the query rewritings are sound, in the sense that the

value of the aggregation function falls within the range on every repair. In Section 4.3.4,

we show that the ranges produced by the query rewritings are tight, in the sense that

they are satisfied in at least one repair. Finally, in Section 4.3.5 we put it all together,

and give the proof of correctness of the rewritings.

4.3.1 Building Upon First-Order Rewritings

The semantics of first-order rewritings is set-theoretic, whereas aggregate functions take

bags as input. In this subsection, we show that for a class of conjunctive queries that

is relevant in the query rewriting algorithms, the cardinality of the tuples in the result

of applying a query to the repairs is always one. As a consequence, for such queries, it

suffices to obtain a set-theoretic first-order rewriting. The result of applying the first-

order rewriting to the inconsistent database can be used as an intermediate step towards

obtaining the consistent answers for conjunctive queries with aggregation.

The queries with the aforementioned property are the conjunctive queries in Cforest,

where all the variables at key positions of some root of the join graph are free. The

proof is given in Lemma 4.4. The lemma makes use of an auxiliary result, that we give

next, which focuses on queries in Cforest that satisfy the additional condition that the

join graph must be a tree (instead of a forest). Intuitively, we show that in each repair

I, each tuple ~t in the query result is obtained “due to” the same set of tuples in I. More

formally, we show that if S and S ′ are sets that contain exactly one tuple per relation of

I and such that ~t ∈ q(S) and ~t ∈ q(S ′), then S ′ = S.

Lemma 4.3. Let q(~z) be a query in Cforest. Assume that the join graph T of q is a

tree, and that all the variables at key positions of the literal at the root of T are free in q

(that is, there is a literal R(~x, ~y) at the root of T such that ~x ⊆ ~z). Let I be a database

instance over the schema of q, and Σ be a set consisting of at most one key dependency

per relation of q. Let I be a repair of I wrt Σ. Let S and S ′ be sets that contain exactly

one tuple per relation of I and such that ~t ∈ q(S), and ~t ∈ q(S ′). Then, S ′ = S.

Proof. The proof is by induction on the number of literals of q.

Chapter 4. Rewritings for Queries with Grouping and Aggregation 62

Base case. Assume that q has exactly one literal. Assume towards a contradiction

that S 6= S ′. Then, there are distinct tuples ~t0 and ~t′0 in I such that ~t ∈ q({~t0}) and

~t ∈ q({~t′0}). Let R(~x, ~y) be the only literal of q. Since all the variables at key positions of

the root literal of T are free, and ~z are the free variables of q, we have that ~x ⊆ ~z. Thus,

there are vectors of values ~c, ~d and ~d′ such that ~d 6= ~d′, ~t0 = R(~c, ~d), and ~t′0 = R(~c, ~d′).

Thus, I 6|= Σ. But I is a repair of I wrt Σ; contradiction.

Inductive step. Assume that q has more than one literal. Let R be a literal of q

that appears at a leaf of T (recall that T is a tree). Let ~t0 and ~t′0 be tuples of S and S ′,

respectively, such that ~t0 = R(~c, ~d) and ~t′0 = R(~c′, ~d′).

Let M be a set that consists of all the tuples of S, except the one for literal R.

Let M ′ be a set that consists of all the tuples of S ′, except the one for literal R. By

inductive hypothesis, M = M ′. Notice that M and M ′ are the only subsets of S and S ′,

respectively, that satisfy these conditions since S and S ′ contain exactly one tuple per

relation of I.

Let R′(~x′, ~y′) be the parent of R in T . Then, there is a tuple ~t1 in R′ and valuations

ν and ν ′ such that ~t1 ∈ S, ~t1 ∈ S ′, {~t0,~t1} |= R′(~x′, ~y′) ∧ R(~x, ~y)[~z/~t][ν], and {~t′0,~t1} |=R′(~x′, ~y′) ∧ R(~x, ~y)[~z/~t][ν ′]. Notice that ν(~y′) = ν ′(~y′). Since q ∈ Cforest, there is a full

nonkey-to-key join from R′ to R. Thus, all the variables of ~y′ appear in ~x. Therefore,

ν(~x) = ν ′(~x); and ~c = ~c′. Assume towards a contradiction that ~t0 6= ~t′0. Then, there are

tuples R(~c, ~d) and R(~c′, ~d′) in I such that ~c = ~c′ and ~d 6= ~d′. This means that I 6|= Σ.

But I is a repair of I wrt Σ; contradiction.

In the next lemma, we show that for queries in Cforest such that the variables at key

positions of all root literals are free, the cardinality of each tuple in the query result is

exactly one.

Lemma 4.4. Let R be a schema. Let Σ be a set of integrity constraints, consisting of

one key dependency per relation of R. Let q(~z) be a conjunctive query over R such that

q ∈ Cforest. Let G be the join graph of q. Let R1(~x1, ~y1), . . . , Rm(~xm, ~ym) be the literals

at the root of each connected component (tree) of G. Assume that ~x1, . . . , ~xm are free

variables in q (i.e., they occur in ~z).

Let I be an instance over R. Let I be a repair of I wrt Σ. Let B be a bag such that

B = q(I) under bag semantics. Let ~t be such that ~t ∈ q(I). Then, |~t|B = 1.

Proof. Assume towards a contradiction that |~t|B > 1. Then, there are distinct sets S and

S ′ that contain exactly one tuple per literal of q and such that ~t ∈ q(S), and ~t ∈ q(S ′).

Chapter 4. Rewritings for Queries with Grouping and Aggregation 63

Since q ∈ Cforest, G is a forest. For each 1 ≤ i ≤ m, let Ti be the tree whose root is Ri.

Let φi(~w, ~z) be the conjunction of the literals of Ti. Let qi(~z) = ∃~w.φi(~w, ~z). Recall that

~xi (the variables at the key of the root literal of Ti) are free, and therefore occur in ~z.

Thus, qi satisfies the conditions of Lemma 4.3.

Since S 6= S ′, ~t ∈ q(S), and ~t ∈ q(S ′), there must be some i and some sets M and M ′

such that M 6= M ′, M ⊆ S, M ′ ⊆ S ′, M and M ′ have one tuple for each relation symbol

in φi, ~t ∈ qi(M), and ~t ∈ qi(M′). But this contradicts Lemma 4.3 above.

4.3.2 An “Optimistic” Repair

Recall that in Chapter 3 we showed that for every query q in a subclass of Cforest, there is

a “pessimistic” repair M such q(M) retrieves all the consistent answer to q. In Section

4.3.4, we will use M to prove the tightness of the query rewritings. For example, if

we apply an aggregate query on M, the value that we get for the count(*) aggregate

function corresponds to the greatest lower bound computed by the rewriting produced

by RewriteCount(q, Σ).

For the lowest upper bound, we will need the notion of an “optimistic” repair N . The

name “optimistic” comes from the fact that in this repair, if a tuple ~t can be obtained

from some repair of the inconsistent database, then the tuple is also in q(N ). In Lemma

4.6, we show the existence of such a repair.

Before proving the existence of the optimistic repair, we formally define the notion

of possible answers. This notion can be considered as dual to the notion of consistent

answers. While a consistent answer is one that holds in the query results obtained from

all the repairs, a possible answer is one that holds in the query result from at least one

repair.

Definition 4.5 (Possible Answers). Let R be a schema. Let Σ be a set of integrity

constraints. Let I be an instance over R (possibly inconsistent with respect to Σ). Let

q be a query over R. We say that a tuple ~t is a possible answer for q with respect to Σ

if there exists a repair I of I with respect to Σ such that ~t ∈ q(I). We denote this as

~t ∈ possibleΣ(q, I).

For a Boolean query q over R, we say that possibleΣ(q, I) = true if there exists a

repair I of I with respect to Σ such that I |= q. We say that possibleΣ(q, I) = false if

for every repair I of I with respect to Σ, I 6|= q.

Chapter 4. Rewritings for Queries with Grouping and Aggregation 64

Lemma 4.6. Let q(~x) be a query in Cforest, whose join graph T is a tree and where

R(~x, ~y) is the literal at the root of T . Let I be an instance. Then, there is a repair Nsuch that for all ~c if ~c ∈ possibleΣ(q, I), then ~c ∈ q(N ).

Proof. Let N be the instance instance built by BuildOptimisticRepair(q, I) (the al-

gorithm given in Figure 4.3). We will prove the claim by induction on the number of

literals of q.

Base case. Assume that q consists of exactly one literal R(~x, ~y). Let ~t be the

tuple selected by the algorithm in the iteration for literal R and the vector of values ~c.

Assume towards a contradiction that N 6|= ∃~w.R(~c, ~y). Then, {~t} 6|= ∃~w.R(~c, ~y). Since

possibleΣ(∃~w.R(~c, ~y), I) = true, there is some repair I of I such that I |= ∃~w.R(~c, ~y).

Thus, there is a tuple ~t′ such that {~t′} |= ∃~w.R(~c, ~y). Notice that ~t and ~t′ can be added

to N only during the iteration for the vector of values ~c. Since {~t} 6|= ∃~w.R(~c, ~y) and

{~t′} |= ∃~w.R(~c, ~y), the algorithm never selects tuple ~t. But ~t ∈ N ; contradiction.

Inductive step. Assume that q has more than one literal. Let φ(~w, ~x) be the

conjunction of literals of q. Let T1, . . . , Tm be the subtrees of T such that the root of

Tj is a child of the root of T , for 1 ≤ j ≤ m. For each 1 ≤ j ≤ m, let Sj(~xj, ~yj)

be the literal at the root of Tj. Let φj be the conjunction of the literals of Tj. Let

~wj = {w : w is a variable of φj, and w 6∈ ~xj}. Let qj(~xj) = ∃~wj.φj(~xj, ~wj). Let

Nj = BuildOptimisticRepair(qj, I).

Assume towards a contradiction that ~c 6∈ q(N ). Let ~t be the tuple of I selected by the

algorithm in the iteration for literal R and the vector of values ~c. Then, ~t ∈ N , and there

is some ~d such that ~t = R(~c, ~d). Since ~c 6∈ q(N ), there must be some j, some valuation

ν for the variables of ~y, and some ~cj such that 1 ≤ j ≤ m, ν(~y) = ~d, ν(~xj) = ~cj, and

~cj 6∈ qj(Nj).

Since possibleΣ(q(~c), I) = true, there is some repair I of I such that ~c ∈ q(I).

Thus, there is some tuple ~t′ in I, some ~d′, and some valuation ν for the variables of ~y

such that ~t′ = R(~c, ~d′), ν(~y) = ~d′, and the following condition holds: for every j and

tuple of values ~c′j such that 1 ≤ j ≤ m and ν(~xj) = ~c′j, we have that ~c′j ∈ qj(I). Thus,

possibleΣ(qj(~cj), I) = true. By inductive hypothesis ~c′j ∈ qj(Nj). Thus, the algorithm

selects ~t′ in the construction of N , rather than ~t. But ~t ∈ N ; contradiction.

Chapter 4. Rewritings for Queries with Grouping and Aggregation 65

Algorithm BuildOptimisticRepair

Input: q(~x), a query in Cforest of the form ∃~w.φ(~w, ~x),whose join graph T is a tree with root literal R(~x, ~y)

Σ, a set of key constraints, one per relationI, a database instance

Initialize N as an empty instance

if φ has exactly one literal thenfor each ~c such that there is some R(~c, ~d) in I do

if there is some ~d such that R(~c, ~d) ∈ I,

and {R(~c, ~d)} |= ∃~w.R(~c, ~y) then

Let ~t = R(~c, ~d)else

Let ~t be any tuple of I such that ~t = R(~c, ~d), for some ~dend ifAdd ~t to N

end forelse

/* φ has more than one literal*/Let S1, . . . , Sm be the children of R in Tfor j := 1 to m do

Let Tj be the subtree of T whose root is Sj

Let φj be the conjunction of literals of Tj

Let ~wj = {w : w is a variable that occurs in φj and ~w, and w 6∈ ~xj}Let qj(~xj) = ∃~wj.φj(~xj, ~wj)Let Nj = BuildOptimisticRepair(qj, I)Add Nj to N

end forfor each ~c such that there is some R(~c, ~d) in I do

if there is some ~d and some valuation ν for the variables of ~y such that R(~c, ~d) ∈ I,

ν(~y) = ~d, and there is no j and ~cj such that ν(~xj) = ~cj and ~cj 6∈ qj(Nj) then

Let ~t = R(~c, ~d)else

Let ~t be any tuple of I such that ~t = R(~c, ~d), for some ~dend ifAdd ~t to N

end forend if

Figure 4.3: Algorithm to build the “optimistic” repair

Chapter 4. Rewritings for Queries with Grouping and Aggregation 66

4.3.3 Sound Ranges

In this subsection, we show that the ranges produced by the query rewritings are sound,

in the sense that the value of the aggregation function falls within the returned range on

every repair.

The next lemma shows that the rewritings produced by RewriteCount compute sound

ranges.

Lemma 4.7. Let R be a schema. Let Σ be a set of integrity constraints, consisting of

one key dependency per relation of R. Let q(~z, v) be a query of the following form:

select ~z, count(*)

from q∗(~z)

group by ~z

where q∗(~z) is a conjunctive query in Cforest.

Let Q be the first-order aggregate query returned by RewriteCount(q, Σ). Let I be a

database instance over R. Let I be a repair of I wrt Σ. Let ~t be a tuple, and low and up

be a pair of real numbers such that (~t, low, up) ∈ Q(I) and ~t ∈ consistentΣ(q∗, I). Let d

be such that (~t, d) ∈ q(I). Then, low ≤ d ≤ up.

Proof. Let G be the join graph of q. Let R1(~x1, ~y1), . . . , Rm(~xm, ~ym) be the literals at the

roots of all trees of G. Let φ(~w, ~z) be the conjunction of literals of q∗. Let ~x =⋃

i=1...m ~xi,

let ~z′ = ~z − ~x, and let ~w′ = ~w − ~x. Let q′(~x, ~z′) = ∃ ~w′.φ(~x, ~w′, ~z′). Let ~x′ = ~x − ~z. Let

QConsistent(~x, ~z′) be the query obtained by invoking RewriteForest(q′, Σ).

Lower Bound. Since (~t, low, up) ∈ Q(I), the lower bound low of ~t is computed with

the following query:

QGlb(~z, glb) = select ~z, count(*)

from QConsistent(~x, ~z′)

group by ~z

Assume towards a contradiction that d < low. Then, there is a tuple (~c, ~t′) such

that (~c, ~t′) ∈ QConsistent(I) and (~c, ~t′) 6∈ q′(I). Then, (~c, ~t′) 6∈ consistentΣ(q′, I). By

Theorem 3.5, we conclude that (~c, ~t′) 6∈ QConsistent(I); contradiction.

Upper Bound. Since (~t, low, up) ∈ Q(I), the upper bound up of ~t is computed with

the following query:

Chapter 4. Rewritings for Queries with Grouping and Aggregation 67

Let QLub(~z, lub) = select ~z, count(*)

from q′(~x, ~z′) ∧ (∃~x′.QConsistent(~x, ~z′))

group by ~z

Assume towards a contradiction that d > up. Then, there is a valuation ν and a tuple

(~c, ~t′) such that ν(~x) = ~c, ν(~z′) = ~t′, ν(~z) = ~t, (~c, ~t′) ∈ q′(I), and either (1) (~c, ~t′) 6∈ q′(I);

or (2) I 6|= (∃~x′.QConsistent(~x, ~z′)).

Assume that (1) (~c, ~t′) 6∈ q′(I). Since I is a repair of I, by Proposition 3.6, I ⊆ I.

Thus, (~c, ~t′) 6∈ q′(I); contradiction. Assume that (2) I 6|= (∃~x′.QConsistent(~x, ~z′)).

Recall that ~x′ = ~x − ~z. By Theorem 3.5, (~c′, ~t′) 6∈ consistentΣ(q′, I), for every ~c′. In

particular, (~c, ~t′) 6∈ consistentΣ(q′, I). Recall that there is a valuation ν for the variables

of ~x and ~z′ such that ν(~x) = ~c, ν(~z′) = ~t′ and ν(~z) = ~t. Thus, ~t 6∈ consistentΣ(q∗, I);

contradiction.

The next lemma shows that the rewritings for queries with the sum operator compute

sound ranges.

Lemma 4.8. Let R be a schema. Let Σ be a set of integrity constraints, consisting of

one key dependency per relation of R. Let q(~z, v) be a query of the following form:

select ~z, sum(u)

from q∗(~z, u)

group by ~z

where q∗(~z, u) is a conjunctive query in Cforest.

Let Q be the first-order aggregate query returned by RewriteAgg(q, Σ). Let I be a

database instance over R. Let I be a repair of I wrt Σ. Let ~t be a tuple, and low and up

be a pair of real numbers such that (~t, low, up) ∈ Q(I) and ~t ∈ consistentΣ(q∗, I). Let d

be such that (~t, d) ∈ q(I). Then, low ≤ d ≤ up.

Proof. Let G be the join graph of q. Let R1(~x1, ~y1), . . . , Rm(~xm, ~ym) be the literals at

the roots of all trees of G. Let φ(~w, ~z, u) be the conjunction of literals of q∗. Let

~x =⋃

i=1...m ~xi, let ~z′ = ~z − ~x, and let ~w′ = ~w − ~x. Let ~x′ = ~x − ~z − u. Let

q′(~x, ~z′) = ∃ ~w′, u.φ(~x, ~w′, ~z′, u). Let QConsistent(~x, ~z′) be the query obtained by in-

voking RewriteForest(q′, Σ). Let q′′ be the query q′′(~x, ~z′, u) = ∃ ~w′.φ(~x, ~w′, ~z′, u).

Lower Bound. Since (~t, low, up) ∈ Q(I), the lower bound low of ~t is computed with

the following query:

Chapter 4. Rewritings for Queries with Grouping and Aggregation 68

QGlb(~z, glb) = select ~z, sum(v)

from QContribConsistent(~x, ~z′, v) ∨ QContribNonConsistent(~x, ~z′, v)

group by ~z

where QContribConsistent is the following query:

QContribConsistent(~x, ~z′, bottom) =

select ~x, ~z′, min(u) as bottom

from QConsistent(~x, ~z′) ∧ q′′(~x, ~z′, u)

group by ~x, ~z′

having bottom ≥ 0

and QContribNonConsistent is the following query:

QContribNonConsistent(~x, ~z′, bottom) =

select ~x, ~z′, min(u) as bottom

from q′′(~x, ~z′, u) ∧ (∃~x′.QConsistent(~x, ~z′))

group by ~x, ~z′

having bottom < 0

Assume towards a contradiction that d < low. Since (~t, d) ∈ q(I), we must consider

the following cases.

First, assume that there is a valuation ν for the variables in ~z, ~x such that ν(~z) = ~t,

ν(~x) = ~c , ν(~z′) = ~t′, and

• (~c, ~t′) 6∈ q′(I); and

• there is some e such that e > 0; and

• either (~c, ~t′, e) ∈ QContribConsistent ∨ QContribNonConsistent(I).

Since e > 0, (~c, ~t′, e) ∈ QContribConsistent(I). Since (~c, ~t′) 6∈ q′(I), (~c, ~t′) 6∈consistentΣ(q′, I). By Theorem 3.5, we conclude that (~c, ~t′) 6∈ QConsistent(I). There-

fore, (~c, ~t′, e) 6∈ QContribConsistent(I); contradiction.

Second, assume that there is a valuation ν for the variables in ~z, ~x such that ν(~z) = ~t,

ν(~x) = ~c , ν(~z′) = ~t′, and

Chapter 4. Rewritings for Queries with Grouping and Aggregation 69

• there is some e′ such that and e′ < 0; and

• (~c, ~t′, e′) ∈ q′′(I); and

• for every e such that e < 0, we have that (~c, ~t′, e) 6∈ QContribConsistent ∨QContribNonConsistent(I).

Since I ⊆ I and (~c, ~t′, e′) ∈ q′′(I), we have that (~c, ~t′, e′) ∈ q′′(I). Since by hy-

pothesis, ~t ∈ consistentΣ(q∗, I), (~c′, ~t′) ∈ consistentΣ(q′, I) for some ~c′. By Theorem

3.5, (~c′, ~t′) ∈ QConsistent(I). Thus, I |= ∃~x′.QConsistent(~x, ~z′)[~z/~t]. Since e′ < 0,

(~c, ~t′, e′) ∈ q′′(I) and I |= ∃~x′.QConsistent(~x, ~z′)[~z/~t], we conclude that (~c, ~t′, e′) ∈QContribNonConsistent(I); contradiction.

Third, assume that there is a valuation ν for the variables in ~z, ~x such that ν(~z) = ~t,

ν(~x) = ~c , ν(~z′) = ~t′, and

• there is some e such that (~c, ~t′, e) ∈ QContribConsistent∨QContribNonConsistent(I);

and

• there is some e′ such that e′ < e; and

• (~c, ~t′, e′) ∈ q′′(I).

Assume that (~c, ~t′, e) ∈ QContribConsistent(I). Then, (~c, ~t′) ∈ QConsistent(I),

and (~c, ~t′, e) ∈ q′′(I). Since I ⊆ I, and (~c, ~t′, e′) ∈ q′′(I), we have that (~c, ~t′, e′) ∈q′′(I). Notice that e and e′ correspond to the attribute bottom of QContribConsistent.

This attribute is computed as min(u), that is the minimum of the values of u for the

tuples of (~c, ~t′). Since (~c, ~t′, e) and (~c, ~t′, e′) satisfy the conditions of the from clause of

QContribConsistent, e < e′; contradiction.

Now, assume that (~c, ~t′, e) ∈ QContribNonConsistent(I). Since I ⊆ I, (~c, ~t′, e′) ∈q′′(I). Since e corresponds to the attribute bottom of QContribNonConsistent, e < e′;

contradiction.

Upper Bound The proof for the lowest upper bound is analogous to the proof for

the greatest lower bound.

The next lemma shows that the rewritings for queries with the min and max aggrega-

tion functions compute sound ranges.

Lemma 4.9. Let R be a schema. Let Σ be a set of integrity constraints, consisting of

one key dependency per relation of R. Let q(~z, v) be a query of the following form:

Chapter 4. Rewritings for Queries with Grouping and Aggregation 70

select ~z, [min(u)| max(u)]from q∗(~z, u)

group by ~z

where q∗(~z, u) is a conjunctive query in Cforest.

Let Q be the first-order aggregate query returned by RewriteAgg(q, Σ). Let I be a

database instance over R. Let I be a repair of I wrt Σ. Let ~t be a tuple, and low and up

be a pair of real numbers such that (~t, low, up) ∈ Q(I) and ~t ∈ consistentΣ(q∗, I). Let d

be such that (~t, d) ∈ q(I). Then, low ≤ d ≤ up.

Proof. Let G be the join graph of q. Let R1(~x1, ~y1), . . . , Rm(~xm, ~ym) be the literals at

the roots of all trees of G. Let φ(~w, ~z, u) be the conjunction of literals of q∗. Let

~x =⋃

i=1...m ~xi, let ~z′ = ~z − ~x, and let ~w′ = ~w − ~x. Let ~x′ = ~x − ~z − u. Let

q′(~x, ~z′) = ∃ ~w′, u.φ(~x, ~w′, ~z′, u). Let QConsistent(~x, ~z′) be the query obtained by in-

voking RewriteForest(q′, Σ). Let q′′ be the query q′′(~x, ~z′, u) = ∃ ~w′.φ(~x, ~w′, ~z′, u).

Lower Bound. Suppose that the aggregate function of q is max. Since (~t, low, up) ∈Q(I), the lower bound low of ~t is computed with the following query:

QGlb(~z, glb) = select ~z, max(u)

from QContribConsistent(~x, ~z′, u)

group by ~z

where QContribConsistent is the following query:

QContribConsistent(~x, ~z′, bottom) =

select ~x, ~z′, min(u) as bottom

from QConsistent(~x, ~z′) ∧ q′′(~x, ~z′, u)

group by ~x, ~z′

Assume towards a contradiction that d < low. Then, there is a valuation ν for the

variables in ~z, ~x such that ν(~z) = ~t, ν(~x) = ~c , ν(~z′) = ~t′, and

• there is some e such that (~c, ~t′, e) ∈ QContribConsistent(I); and

• there is some e′ such that e′ < e; and

Chapter 4. Rewritings for Queries with Grouping and Aggregation 71

• (~c, ~t′, e′) ∈ q′′(I).

We arrive to a contradiction by the same arguments used in Lemma 4.8 for queries

with the sum operator.

Now, suppose that the aggregate function of q is min. Since (~t, low, up) ∈ Q(I), the

lower bound low of ~t is computed with the following query:

QGlb(~x, ~z, bottom) =

select ~z, min(bottom)

from QContribNonConsistent(~x, ~z′, u)

group by ~z

where QContribNonConsistent is the following query:

select ~x, ~z′, min(u) as bottom

from q′′(~x, ~z′, u) ∧ (∃~x′.QConsistent(~x, ~z′))

group by ~x, ~z′)

Assume towards a contradiction that d < low. Then, there is a valuation ν for the

variables in ~z, ~x such that ν(~z) = ~t, ν(~x) = ~c , ν(~z′) = ~t′, and

• there is some e such that (~c, ~t′, e) ∈ QContribNonConsistent(I); and

• there is some e′ such that e′ < e; and

• (~c, ~t′, e′) ∈ q′′(I).

We arrive to a contradiction by the same arguments used in Lemma 4.8 for queries

with the sum operator.

Upper Bound For the max operator, we can give an argument analogous to the

argument given for the lower bound of the min operator. For the min operator, we

can give an argument analogous to the argument given for the lower bound of the max

operator.

Chapter 4. Rewritings for Queries with Grouping and Aggregation 72

4.3.4 Tight Ranges

In this section, we show that the ranges produced by the query rewritings are tight. For

this, we must exhibit two repairs, where the result of the aggregation function corresponds

to the greatest lower bound in one repair, and to the lowest upper bound in the other. For

example, if the query has the count(*) operator, the repair that we need for the greatest

lower bound turns out to be the “pessimistic” repair M used in the correctness proof of

the first-order rewritings of Section 3.3.3. For the lowest upper bound, the needed repair

is the “optimistic” repair N that we introduced in Section 4.3.2.

We start by showing that the rewritings produced by RewriteCount give tight bounds.

In the next lemma, we show that the greatest lower bound of count(*) can be obtained

by executing the query on the pessimistic repair M. We also show that the query

rewriting that we obtain correctly returns such bound.

Lemma 4.10. Let R be a schema. Let Σ be a set of integrity constraints, consisting of

one key dependency per relation of R. Let q(~z) be a query of the following form:

select ~z, count(*)

from q∗(~z)

group by ~z

where q∗(~z) is a query in Cforest.

Let G the the join graph of q. Let R1(~x1, ~y1), . . . , Rm(~xm, ~ym) be the literals at the roots

of each tree of G. Let φ(~w, ~z) be the conjunction of literals of q∗. Let ~x =⋃

i=1...m ~xi,

let ~z′ = ~z − ~x, and let ~w′ = ~w − ~x. Let q′(~x, ~z′) = ∃ ~w′.φ(~x, ~w′, ~z′). Let Q(~z, l, u) be the

first-order aggregate query returned by RewriteCount(q, Σ). Let I be an instance over

R. Let ~t be a tuple and low and up be a pair of real numbers.

Then, there is a repair M of I wrt Σ and a bag B such that B = q(M), and the

following conditions hold:

1. for every valuation ν such that ν(~x) = ~c, ν(~z′) = ~t′, and ν(~z) = ~t, if (~c, ~t′) ∈ q′(M),

then ~c ∈ consistentΣ(q′[~z/~t], I), and

2. if (~t, low, up) ∈ aggconsistentΣ(q, I), then |~t|B = low, and

3. if (~t, low, up) ∈ Q(I), then |~t|B = low.

Chapter 4. Rewritings for Queries with Grouping and Aggregation 73

Proof. Let M be the pessimistic repair obtained by invoking the algorithm BuildPess-

imisticRepair(q, Σ, I). Condition (1) holds by Lemma 3.10. We must now prove Con-

ditions (2) and (3).

In order to prove Condition 2, let ~t be a tuple, and low, and up be a pair of real

numbers such that (~t, low, up) ∈ aggconsistentΣ(q, I). Then, there is a repair I of I

wrt Σ and a bag B′ such that B′ = q(I) and |~t|B′ = low. Furthermore, by Lemma

4.7, since M is a repair of I wrt Σ, |~t|B ≥ low. Assume towards a contradiction that

|~t|B > low. Then, there is a valuation ν for the variables of ~x and ~z such that ν(~x) = ~c,

ν(~z) = ~t and ν(~z′) = ~t′, and one of the following conditions holds:

• (~c, ~t′) ∈ q′(M) and |(~c, ~t′)|B > 1; or

• (~c, ~t′) ∈ q′[~z/~t](M) and (~c, ~t′) 6∈ q′[~z/~t](I).

Assume that (~c, ~t′) ∈ q′(M) and |(~c, ~t′)|B > 1. This contradicts Lemma 4.4. Now, as-

sume that (~c, ~t′) ∈ q′[~z/~t](M) and (~c, ~t′) 6∈ q′[~z/~t](I). Then, (~c, ~t′) 6∈ consistentΣ(q′[~z/~t], I).

By Condition 1, we have that (~c, ~t′) 6∈ q′[~z/~t](M); contradiction.

In order to prove Condition 3, let ~t, low, and up be such that (~t, low, up) ∈ Q(I).

Since M is a repair of I, by Lemma 4.7, |~t|B ≥ low. Let QConsistent(~x, ~z′) be the query

obtained by invoking RewriteForest(q′, Σ). Then, the lower bound low of ~t is computed

with the following query:

QGlb(~z, low) = select ~z, count(*)

from QConsistent(~x, ~z′)

group by ~z

Assume towards a contradiction that |~t|B > low. Then, there is a valuation ν for the

variables of ~x and ~z such that ν(~x) = ~c, ν(~z) = ~t and ν(~z′) = ~t′, and one of the following

conditions holds:

• (~c, ~t′) ∈ q′(M) and |(~c, ~t′)|B > 1; or

• (~c, ~t′) ∈ q′(M) and (~c, ~t′) 6∈ QConsistent(I).

Assume that (~c, ~t′) ∈ q′(M) and |(~c, ~t′)|B > 1. This contradicts Lemma 4.4. Now,

assume that (~c, ~t′) ∈ q′(M) and (~c, ~t′) 6∈ QConsistent(I). Since (~c, ~t′) 6∈ QConsistent(I),

by Theorem 3.5, (~c, ~t′) 6∈ consistentΣ(q′, I). Then, by Condition 1, we have that (~c, ~t′) 6∈q′(M); contradiction.

Chapter 4. Rewritings for Queries with Grouping and Aggregation 74

In the next lemma, we show that the lowest upper bound of count(*) can be obtained

by executing q on the optimistic repair N . We also show that the query rewriting of q

correctly returns such bound.

Lemma 4.11. Let R be a schema. Let Σ be a set of integrity constraints, consisting

of one key dependency per relation of R. Let q(~z) be a query in Cforest of the following

form:

select ~z, count(*)

from q∗(~z)

group by ~z

where q∗(~z) is a query in Cforest.

Let G the the join graph of q. Let R1(~x1, ~y1), . . . , Rm(~xm, ~ym) be the literals at the roots

of each tree of G. Let φ(~w, ~z) be the conjunction of literals of q∗. Let ~x =⋃

i=1...m ~xi,

let ~z′ = ~z − ~x, and let ~w′ = ~w − ~x. Let q′(~x, ~z′) = ∃ ~w′.φ(~x, ~w′, ~z′). Let Q(~z, l, u) be the

first-order aggregate query returned by RewriteCount(q, Σ). Let I be an instance over

R. Let ~t be a tuple and low and up be a pair of real numbers.

Then, there is a repair N of I wrt Σ and a bag B such that B = q(N ), and the

following conditions hold:

1. for every valuation ν such that ν(~x) = ~c and ν(~z) = ~t, if ~c ∈ possibleΣ(q′[~z/~t], I),

then ~c ∈ q′[~z/~t](N ), and

2. if (~t, low, up) ∈ aggconsistentΣ(q, I), then |~t|B = up, and

3. if (~t, low, up) ∈ Q(I), then |~t|B = up.

Proof. Let N be the optimistic repair obtained by invoking the algorithm BuildOpti-

misticRepair(q, Σ, I). Condition (1) holds by Lemma 4.6. We must now prove Condi-

tions (2) and (3).

In order to prove Condition 2, let ~t be a tuple, and low and up be real numbers such

that (~t, low, up) ∈ aggconsistentΣ(q, I). Then, there is a repair I of I wrt Σ and a bag

B′ such that B′ = q(I) and |~t|B′ = up. Furthermore, since N is a repair of I wrt Σ, by

Lemma 4.7, |~t|B ≤ up. Assume towards a contradiction that |~t|B < up. Then, there is a

valuation ν for the variables of ~x and ~z such that ν(~x) = ~c, ν(~z) = ~t and ν(~z′) = ~t′, and

one of the following conditions holds:

Chapter 4. Rewritings for Queries with Grouping and Aggregation 75

• (~c, ~t′) ∈ q′(I) and |(~c, ~t′)|B′ > 1; or

• (~c, ~t′) 6∈ q′(N ) and (~c, ~t′) ∈ q′(I).

Assume that (~c, ~t′) ∈ q′(I) and |(~c, ~t′)|B′ > 1. This contradicts Lemma 4.4. Now,

assume that (~c, ~t′) 6∈ q′(N ) and (~c, ~t′) ∈ q′(I). Then, ~c ∈ possibleΣ(q′[~z/~t], I). By

Condition 1, we have that ~c ∈ q′[~z/~t](N ); contradiction.

In order to prove Condition 3, let ~t, low, and up be such that (~t, low, up) ∈ Q(I).

Since N is a repair of I, by Lemma 4.7, |~t|B ≤ up. Let ~x′ = ~x−~z. Let QConsistent(~x, ~z′)

be the query obtained by invoking RewriteForest(q′, Σ). Since (~t, low, up) ∈ Q(I), the

upper bound up of ~t is computed with the following query:

Let QLub(~z, up) = select ~z, count(*)

from q′(~x, ~z′) ∧ (∃~x′.QConsistent(~x, ~z′))

group by ~z

Assume towards a contradiction that |~t|B < up. Then, there is a valuation ν for the

variables of ~x and ~z such that ν(~x) = ~c, ν(~z) = ~t and ν(~z′) = ~t′, and either:

• (~c, ~t′) is accounted for more than once in the from clause of QLub; or

• (~c, ~t′) 6∈ q′(N ), (~c, ~t′) ∈ q′(I), and I |= (∃~x′.QConsistent[~z/~t]).

Assume that (~c, ~t′) is accounted for more than once in the from clause of QLub. This

is a contradiction since by definition the from clause of a first-order aggregate query is

computed using set semantics. Now, assume that (~c, ~t′) 6∈ q′(N ), (~c, ~t′) ∈ q′(I), and

I |= (∃~x′.QConsistent[~z/~t]). Since (~c, ~t′) ∈ q′(I), we have that ~c ∈ possibleΣ(q′[~z/~t], I).

Thus, by Condition 1, ~c ∈ q′[~z/~t](N ); contradiction.

For the unary operators, the proof of tightness proceeds in an analogous way, except

that the optimistic and pessimistic repairs have to be modified to ensure every tuple has

the minimum (or maximum, depending on the case) for attribute u. We next show how

to obtain a pessimistic repair for queries with the sum operator.

Algorithm BuildPessimisticRepairForSum (q, I,M∗)

Input: A query q of the form

select ~z, sum(u)

Chapter 4. Rewritings for Queries with Grouping and Aggregation 76

from q∗(~z)

group by ~z

where q∗ is a conjunctive query in Cforest

I, an instance

M∗, an pessimistic repair

Output:M, an pessimistic repair

Initialize M as M∗

Let R(~x, ~y) be the literal of q where u appears

for each tuple R(~c, ~d) of M do

Let ν be a valuation for the variables of R such that ν(~x) = ~c and ν(~y) = ~d

for every valuation ν ′ for the variables of R such that ν ′(~x) = ~c′, ν ′(~y) = ~d′,

R(~c′, ~d′) ∈ I, and ν(z) = ν ′(z) for every z such that z 6= u do

if ν ′(u) < ν(u) then

Replace R(~c, ~d) with R(~c′, ~d′) in Mend if

end for

end for

Notice in the algorithm that a tuple R(~c, ~d) is replaced only if there is another tuple

with the same values, except for the attribute u, and the other tuple has a smaller value

on u (condition ν ′(u) < ν(u) in the algorithm). In the rewriting for the lower bound of

the sum operator, this corresponds to the fact that for positive values we aggregate over

the minimum value of u for all tuples in the intermediate result. In contrast, for the upper

bound, we aggregate over the maximum value of u. Thus, for the upper bound, a similar

algorithm can be used, where we replace tuples for which the condition ν ′(u) > ν(u)

is satisfied. Since we choose the conditions that correspond to positive numbers in the

rewriting given in RewriteAgg, the tightness results for the sum operator need to restrict

the domain of the aggregated value to range over positive numbers (for min and max we

do not have this restriction). In Figure 4.4, we summarize the repairs that must be

modified in order to obtain the tight bounds of each aggregation function, and which

condition must be checked.

The following lemma shows that the greatest lower bound computed for the sum

operator can be obtained from the pessimistic repair computed with the procedure given

above. We also show that our query rewriting correctly returns such bound.

Chapter 4. Rewritings for Queries with Grouping and Aggregation 77

Function Bound Repair Condition

max glb pessimistic ν ′(u) < ν(u)

max lub optimistic ν ′(u) > ν(u)

sum glb pessimistic ν ′(u) < ν(u)

sum lub optimistic ν ′(u) > ν(u)

min glb optimistic ν ′(u) < ν(u)

min lub pessimistic ν ′(u) > ν(u)

Figure 4.4: Repairs that must be used to obtain the tight bounds of unary operators

Lemma 4.12. Let R be a schema. Let Σ be a set of integrity constraints, consisting of

one key dependency per relation of R. Let q(~z) be a query of the following form:

select ~z, sum(u)

from q∗(~z, u)

group by ~z

where q∗(~z, u) is a conjunctive query in Cforest and u ranges over the positive numbers.

Let G the the join graph of q. Let R1(~x1, ~y1), . . . , Rm(~xm, ~ym) be the literals at the roots

of each tree of G. Let φ(~w, ~z, u) be the conjunction of literals of q∗. Let ~x =⋃

i=1...m ~xi,

let ~z′ = ~z − ~x, and let ~w′ = ~w − ~x. Let q′(~x, ~z′) = ∃ ~w′, u.φ(~x, ~w′, ~z′, u). Let Q(~z, l, u)

be the first-order aggregate query returned by RewriteAgg(q, Σ). Let I be an instance

over R. Let ~t be a tuple and low and up be a pair of real numbers. Let q′′(~x, ~z′, u) =

∃ ~w′.φ(~x, ~w′, ~z′, u).

Then, there is a repair M of I wrt Σ and some value d such that (~t, d) ∈ q(M), and

the following conditions hold:

1. for every valuation ν such that ν(~x) = ~c, ν(~z′) = ~t′, and ν(~z) = ~t, if (~c, ~t′) ∈ q′(M),

then ~c ∈ consistentΣ(q′[~z/~t], I), and

2. if (~t, low, up) ∈ aggconsistentΣ(q, I), then d = low, and

3. if (~t, low, up) ∈ Q(I), then d = low.

Proof. Let M∗ be the repair obtained by invoking the algorithm BuildPessimistic-

Repair(q, Σ, I). Let M be the repair obtained by invoking the algorithm BuildPess-

imisticRepairForSum(q, I,M∗). Condition (1) holds by Lemma 3.10. We must now

prove Conditions (2) and (3).

Chapter 4. Rewritings for Queries with Grouping and Aggregation 78

In order to prove Condition 2, let ~t be a tuple, and low, and up be a pair of real

numbers such that (~t, low, up) ∈ aggconsistentΣ(q, I). Then, there is a repair I of I

wrt Σ such that (~t, low) ∈ q(I). Furthermore, by Lemma 4.8, since M is a repair of I wrt

Σ, d ≥ low. Assume towards a contradiction that d > low. Let B = q′(M). Then, there

is a valuation ν for the variables of ~x and ~z such that ν(~x) = ~c, ν(~z) = ~t and ν(~z′) = ~t′,

and one of the following conditions holds:

• (~c, ~t′) ∈ q′(M) and |(~c, ~t′)|B > 1; or

• there are e and e′ such that e > e′, (~c, ~t′, e) ∈ q′(M) and (~c, ~t′, e′) ∈ q′(I); or

• (~c, ~t′) ∈ q′[~z/~t](M) and (~c, ~t′) 6∈ q′[~z/~t](I).

Assume that (~c, ~t′) ∈ q′(M) and |(~c, ~t′)|B > 1. This contradicts Lemma 4.4. Now,

assume that there are e and e′ such that e > e′, (~c, ~t′, e) ∈ q′(M) and (~c, ~t′, e′) ∈q′(I). Let ν ′ and ν ′′ be valuations such that for every w 6= u, ν(w) = ν ′(w) and

ν(w) = ν ′′(w); ν ′(w) = e; and ν ′′(w) = e′. Since M is constructed using the algo-

rithm BuildPessimisticRepairForSum and I ⊆ I, ν ′(w) < ν ′′(w). Thus, e < e′;

contradiction. Finally, assume that (~c, ~t′) ∈ q′[~z/~t](M) and (~c, ~t′) 6∈ q′[~z/~t](I). Then,

(~c, ~t′) 6∈ consistentΣ(q′[~z/~t], I). By Condition 1, we have that (~c, ~t′) 6∈ q′[~z/~t](M); con-

tradiction.

In order to prove Condition 3, let ~t, low, and up be such that (~t, low, up) ∈ Q(I).

Since M is a repair of I, by Lemma 4.8, d ≥ low. Let QConsistent(~x, ~z′) be the query

obtained by invoking RewriteForest(q′, Σ). Since u ranges only over positive numbers,

the lower bound low of ~t is computed with the following query:

QGlb(~z, glb) = select ~z, sum(v)

from QContribConsistent(~x, ~z′, v)

group by ~z

where QContribConsistent is the following query:

QContribConsistent(~x, ~z′, bottom) =

select ~x, ~z′, min(u) as bottom

from QConsistent(~x, ~z′) ∧ q′′(~x, ~z′, u)

group by ~x, ~z′

Chapter 4. Rewritings for Queries with Grouping and Aggregation 79

having bottom ≥ 0

Assume towards a contradiction that d > low. Then, there is a valuation ν for the

variables of ~x and ~z such that ν(~x) = ~c, ν(~z) = ~t and ν(~z′) = ~t′, and one of the following

conditions holds:

• (~c, ~t′) ∈ q′(M) and |(~c, ~t′)|B > 1; or

• there are e and e′ such that e > e′, (~c, ~t′, e) ∈ q′(M) and

(~c, ~t′, e′) ∈ QContribConsistent(I); or

• (~c, ~t′) ∈ q′(M) and (~c, ~t′) 6∈ QConsistent(I).

Assume that (~c, ~t′) ∈ q′(M) and |(~c, ~t′)|B > 1. This contradicts Lemma 4.4. Now,

assume that there are e and e′ such that e > e′, (~c, ~t′, e) ∈ q′(M) and (~c, ~t′, e′) ∈QContribConsistent(I). Since e′ is computed as min(u) in QContribConsistent,

and M ⊆ I, e′ < e; contradiction. Finally, assume that (~c, ~t′) ∈ q′(M) and (~c, ~t′) 6∈QConsistent(I). Since (~c, ~t′) 6∈ QConsistent(I), by Theorem 3.5, we have that (~c, ~t′) 6∈consistentΣ(q′, I). Then, by Condition 1, we have that (~c, ~t′) 6∈ q′(M); contradic-

tion.

Notice that the proof above is similar to the one for Lemma 4.10, except that we need

to account for the fact that each tuple may contribute a value greater than one. A proof

similar to Lemma 4.11 can be given for the lowest upper bound.

4.3.5 Putting It All Together

The next lemma states the correctness of the algorithm RewriteCount. The correctness

for the unary operators can be obtained analogously by employing the optimistic and

pessimistic repairs as shown in Figure 4.4.

Lemma 4.13. Let R be a schema. Let Σ be a set of integrity constraints, consisting

of one key dependency per relation of R. Let q(~z) be a query in Cforest of the following

form:

select ~z, count(*)

from q∗(~w, ~z)

group by ~z

Chapter 4. Rewritings for Queries with Grouping and Aggregation 80

Let Q(~z, l, u) be the first-order aggregate query returned by RewriteCount(q, Σ). Let

I be an instance over R. Then, for every tuple ~t, and pair of real numbers low and up,

we have that (~t, low, up) ∈ aggconsistentΣ(q, I) iff (~t, low, up) ∈ Q(I).

Proof. Let G be the join graph of q. Let R1(~x1, ~y1), . . . , Rm(~xm, ~ym) be the literals at

the roots of all trees of G. Let φ(~w, ~z) be the conjunction of literals of q∗. Following the

algorithm RewriteCount, let ~x =⋃

i=1...m ~xi, let ~z′ = ~z − ~x, and let ~w′ = ~w − ~x. Let

~x′ = ~x−~z. Let q′(~x, ~z′) = ∃ ~w′.φ(~x, ~w′, ~z′). Let QConsistent(~x, ~z′) be the query obtained

by invoking RewriteForest(q′, Σ).

(⇒) Let ~t be a tuple and low and up be real numbers such that (~t, low, up) ∈aggconsistentΣ(q, I). By Lemma 4.10, there is a “pessimistic” repair M of I wrt Σ

and a bag B such that B = q(M), and the following conditions hold:

1. for every valuation ν such that ν(~x) = ~c, ν(~z′) = ~t′, and ν(~z) = ~t, if (~c, ~t′) ∈ q′(M),

then ~c ∈ consistentΣ(q′[~z/~t], I), and

2. if (~t, low, up) ∈ aggconsistentΣ(q, I), then |~t|B = low.

Since, (~t, low, up) ∈ aggconsistentΣ(q, I), by item (2) above, |~t|B = low. Assume

towards a contradiction that (~t, low, up) 6∈ Q(I). Let low′ be a value computed as follows:

QGlb(~z, low′) = select ~z, count(*)

from QConsistent(~x, ~z′)

group by ~z

Assume that low′ < low. Then, there is a valuation ν for the variables of ~x and ~z

such that ν(~x) = ~c, ν(~z) = ~t, ν(~z′) = ~t′, and one of the following conditions holds:

• (~c, ~t′) ∈ q′(M) and |(~c, ~t′)|B > 1; or

• (~c, ~t′) ∈ q′(M) and (~c, ~t′) 6∈ QConsistent(I).

Assume that (~c, ~t′) ∈ q′(M) and |(~c, ~t′)|B > 1. This contradicts Lemma 4.4. Now,

assume that (~c, ~t′) ∈ q′(M) and (~c, ~t′) 6∈ QConsistent(I). By Theorem 3.5, (~c, ~t′) 6∈consistentΣ(q′, I). By Condition 1 above, (~c, ~t′) 6∈ q′(M); contradiction.

Assume towards a contradiction that low′ > low. Then, there is a valuation ν for

the variables of ~x and ~z such that ν(~x) = ~c, ν(~z) = ~t, ν(~z′) = ~t′, (~c, ~t′) 6∈ q′(M) and

Chapter 4. Rewritings for Queries with Grouping and Aggregation 81

(~c, ~t′) ∈ QConsistent(I). Since (~c, ~t′) ∈ QConsistent(I), by Theorem 3.5, (~c, ~t′) ∈consistentΣ(q′, I). Then, since M is a repair of I wrt Σ, we have that (~c, ~t′) ∈ q′(M);

contradiction.

By Lemma 4.11, there is an “optimistic” repair N of I wrt Σ and a bag B such that

B = q(N ), and the following conditions hold:

1. for every valuation ν such that ν(~x) = ~c and ν(~z) = ~t, if ~c ∈ possibleΣ(q′[~z/~t], I),

then ~c ∈ q′[~z/~t](N ), and

2. if (~t, low, up) ∈ aggconsistentΣ(q, I), then |~t|B = up.

Since, (~t, low, up) ∈ aggconsistentΣ(q, I), by item (2) above, |~t|B = up. Assume

towards a contradiction that (~t, low, up) 6∈ Q(I). Let up′ be a value computed as follows:

Let QLub(~z, up′) = select ~z, count(*)

from q′(~x, ~z′) ∧ (∃~x′.QConsistent(~x, ~z′))

group by ~z

Assume that up′ < up. Then, there is a valuation ν for the variables of ~x and

~z such that ν(~x) = ~c, ν(~z) = ~t, ν(~z′) = ~t′, (~c, ~t′) 6∈ q′(N ), (~c, ~t′) ∈ q′(I), and I |=∃~x′.QConsistent(~x, ~t′). Since (~c, ~t′) ∈ q′(I), (~c, ~t′) ∈ possibleΣ(q′, I). Thus, by Lemma

4.6, (~c, ~t′) ∈ q′(N ); contradiction.

Assume that up′ < up. Then, there is a valuation ν for the variables of ~x and ~z such

that ν(~x) = ~c, ν(~z) = ~t, ν(~z′) = ~t′, and one of the following two cases holds. First,

(~c, ~t′) ∈ q′(N ) and |(~c, ~t′)|B > 1. But this contradicts Lemma 4.4. Second, (~c, ~t′) ∈ q′(N )

and either (1) (~c, ~t′) 6∈ q′(I), or (2) I 6|= ∃~x′.QConsistent(~x, ~t′). Assume that (1) (~c, ~t′) 6∈q′(I). Since N is a repair of I wrt Σ, N ⊆ I. Thus, (~c, ~t′) 6∈ q′(N ); contradiction.

Assume that (2) I 6|= ∃~x′.QConsistent(~x, ~t′). Recall that ~x′ = ~x − ~z. By Theorem 3.5,

(~c′, ~t′) 6∈ consistentΣ(q′, I), for every ~c′. In particular, (~c, ~t′) 6∈ consistentΣ(q′, I). Thus,

(~c, ~t′) 6∈ q′(N ); contradiction.

(⇐) Let ~t be a tuple and low and up be real numbers such that (~t, lb, up) ∈ Q(I). In

order to prove that (~t, low, up) ∈ aggconsistentΣ(q, I), we must show that:

1. For every repair I of I wrt Σ, if B = q(I), then low ≤ |~t|B ≤ up.

2. There is a repair I of I wrt Σ, and a bag B such that B = q(I) and |~t|B = low.

Chapter 4. Rewritings for Queries with Grouping and Aggregation 82

3. There is a repair I of I wrt Σ, and a bag B such that B = q(I) and |~t|B = up.

Claim 1 follows by Lemma 4.7. Claim 2 follows by Lemma 4.10. Claim 3 follows by

Lemma 4.11.

4.4 Related Work

Our work on aggregation is inspired by Arenas et al. [ABC+03b], who were the first to

propose the use of ranges in a semantics for consistent query answering. The work of

Arenas et al. is restricted to queries of the following form:

select F (A)

from r

where F is an aggregation function, r is a single relation, and A is an attribute from

r. Notice that such queries have no grouping and no selection or join conditions (i.e., no

where clause). In this chapter, we consider a much richer class of queries. For the class

of queries considered by Arenas et al., the semantics proposed in their paper and our

semantics for aggregate queries coincide. However, we need to extend their semantics in

order to be able to deal with queries that perform grouping.

In their paper, Arenas et al. [ABC+03b] consider functional dependencies. If there

is exactly one functional dependency on the (only) relation of the query, they show that

the problem of obtaining the lowest upper and greatest lower bounds is tractable for the

count(*), min, max, sum, and avg functions. Except for avg, we considered all these

functions in our class Caggforest. Arenas et al. also show the intractability of queries with

the count(distinct) operator and exactly one functional dependency. If the relation

of the query has more than one functional dependency, they show that the problem

of obtaining tight bounds is intractable for all the aggregate functions they consider

(count(*), min, max, sum, and avg, count(distinct)). This gives further evidence of

the maximality of the class considered in this chapter: going from one to two functional

dependencies may lead to intractability even for queries on just one relation and with no

grouping.

Chapter 5

Complexity-Theoretic Analysis

In the previous chapters, we presented query rewriting algorithms that work on a broad

class of queries. In this chapter, we show the maximality of this class based on complexity-

theoretic arguments. In Section 5.1, we show that minimal relaxations of the conditions of

the class lead to intractability. Then, in Section 5.2, we embark on a more ambitious goal:

for a large class of conjunctive queries, we show that the conditions of the class Cforest

presented in Chapter 3 are not only sufficient, but they are also necessary conditions for

a query to be first-order rewritable.

5.1 Minimal Relaxations of Cforest

In this section, we show that minimal relaxations of the conditions of Cforest lead to

intractability. In particular, we show the intractability of the problem of computing

consistent answers for: (1) a conjunctive query whose join graph is a cycle of length

two; and (2) a conjunctive query whose join graph is a forest, but the query has some

nonkey-to-key joins that are not full.

Chomicki and Marcinkowski [CM05] proved that the problem of computing consistent

answers for a query with a single nonkey-to-nonkey join is coNP-complete. Their result

used a query with repeated relation symbols (specifically, a query with only two literals

both for a single relation R). We can use their insight to show that the problem of

computing consistent answers for the following query without repeated relation symbols,

but with a single nonkey-to-nonkey join is also coNP-complete.

qnk = ∃x, x′, y.S1(x, y) ∧ S2(x′, y)

83

Chapter 5. Complexity-Theoretic Analysis 84

Notice that qnk has a cycle of length two (actually, a nonkey-to-nonkey join), and

no nonkey-to-key joins. Our proof of hardness is a simple modification to the re-

sults of Chomicki and Marcinkowski [CM05] and uses a reduction from the problem

MONOTONE-3SAT, which is well known to be NP-complete. The only difference between

the MONOTONE-3SAT and 3SAT problems is that the former assumes that the input 3CNF

propositional formula is monotone. That is, each clause Φi contains either positive or

negative atoms, but not both. We shall say that a clause that contains only positive

(negative) atoms is a positive (negative) clause.

Lemma 5.1. Let q be the query ∃x, x′, y.S1(x, y)∧ S2(x′, y). Then, CONSISTENT(q, Σ) is

coNP-hard.

Proof. We will prove hardness by reduction from MONOTONE-3SAT. Let Φ = Φ1∧ · · · ∧Φm

be a 3CNF formula such that each clause Φi contains either positive or negative atoms,

but not both. We shall build an instance I as follows:

• For each positive clause Φi and each atom z that occurs in Φi, we add a tuple

S1(i, z) to I.

• For each negative clause Φi and each atom z that occurs in Φi, we add a tuple

S2(i, z) to I.

We now show that consistentΣ(q, I) = false iff Φ is satisfiable.

(⇒) Since consistentΣ(q, I) = false, there exists a repair I of I such that I 6|= q.

We now build a valuation v for the variables of Φ as follows. For each variable z, we let

v(z) = true if there is some i such that S1(i, z) ∈ I; and we let v(z) = false if there is

some i such that S2(i, z) ∈ I. It is easy to see that v is a truth valuation that satisfies

Φ.

(⇐) Assume that Φ is satisfiable. Let v be a truth assignment for the variables of Φ.

We shall build a repair I as follows. For each positive clause Φi, select a variable z that

appears in Φi and such that v(z) = true. Let S1(i, z) ∈ I. For each negative clause Φi,

select a variable z that appears in Φi and such that v(z) = false. Let S2(i, z) ∈ I. It is

easy to see that I 6|= q.

Now, we show the intractability of the problem for a conjunctive query whose join

graph is a forest, but the query has nonkey-to-key joins that are not full. In particular,

we focus on the following query:

Chapter 5. Complexity-Theoretic Analysis 85

∃x, x′, w, w′, z, z′, m.R1(x, w) ∧R2(m,w, z) ∧R3(x′, w′) ∧R4(m,w′, z′)

We prove hardness by showing a reduction from the problem of computing the con-

sistent answers for the query qnk shown to be coNP-hard in Lemma 5.1.

Lemma 5.2. Let q be the query ∃x, x′, w, w′, z, z′,m.R1(x,w)∧R2(m,w, z)∧R3(x′, w′)∧

R4(m,w′, z′). Let q′ be the query ∃x, x′, y.S1(x, y)∧S2(x′, y). Then, there is a polynomial

time reduction from the problem CONSISTENT(q′, Σ′) to the problem CONSISTENT(q, Σ).

Proof. Let I ′ be an instance over the schema of q′. We shall build an instance I over the

schema of q as follows:

Initialize I as the empty instance

for each tuple S1(c1, d1) ∈ I ′ do

Add R1(c1, d1) to I

end for

for each tuple S2(c2, d2) ∈ I ′ do

Add R3(c2, d2) to I

end for

Let cz, cz′ be some constants

for each valuation νq′ such that I ′ |= S1(x, y) ∧ S2(x′, y)[νq′ ] do

Let νq(x) = νq′(x)

Let νq(x′) = νq′(x

′)

Let νq(w) = νq′(y)

Let νq(w′) = νq′(y)

Let cm be a newly-created constant

Let νq(m) = cm

Let νq(z) = cz

Let νq(z′) = cz′

Add tuple R2(m,w, z)[νq] to I

Add tuple R4(m,w′, z′)[νq] to I

end for

We claim that consistentΣ(q′, I ′) = true iff consistentΣ(q, I) = true.

(⇒) Let I be a repair of I. We shall build an instance I ′ as follows:

Chapter 5. Complexity-Theoretic Analysis 86

for each tuple R1(c1, d1) of I do

Add a tuple S1(c1, d1) to I ′end for

for each tuple R3(c2, d2) of I do

Add a tuple S2(c2, d2) to I ′end for

Notice that R1 and S1 (and, similarly, R3 and S2) have the same extensions in I and I ′,

respectively. Thus, since I is a repair of I, I ′ is a repair of I ′. Since consistentΣ(q′, I ′) =

true, I ′ |= q′. Thus, there is a valuation νq′ such that I ′ |= S1(x, y) ∧ S2(x′, y)[νq′ ]. Let

c1 = νq′(x), c2 = νq′(x′), d = νq′(y). Let cz and cz′ be the constants used in the algorithm

that constructs I. Let cm be the constant created in the algorithm for the iteration

corresponding to νq′ . Let νq be a valuation for the variables of q such that:

• νq(x) = c1

• νq(x′) = c2

• νq(w) = d

• νq(w′) = d

• νq(m) = cm

• νq(z) = cz

• νq(z′) = cz′

Since S1(c1, d) ∈ I ′, R1(c1, d) ∈ I. Since S2(c2, d) ∈ I ′, R3(c2, d) ∈ I. By Proposition

3.6, I ′ ⊆ I ′. Thus, S1(c1, d) ∈ I ′ and S2(c2, d) ∈ I ′. Since cm is the constant chosen in the

iteration for νq′ in the algorithm that constructs I, R2(cm, d, cz) ∈ I and R4(cm, d, cz′) ∈ I.

By Proposition 3.7, R2(cm, d, e) ∈ I and R4(cm, d, e′) ∈ I, for some e, e′. Thus, I |= q[νq].

(⇐) Let I ′ be a repair of I ′. We shall build an instance I as follows.

for each tuple S1(c1, d1) ∈ I ′ do

Add R1(c1, d1) to Iend for

for each tuple S2(c2, d2) ∈ I ′ do

Chapter 5. Complexity-Theoretic Analysis 87

Add R3(c2, d2) to Iend for

for each tuple R2(c1, c2, d) ∈ I do

Add R2(c1, c2, d) to Iend for

for each tuple R4(c1, c2, d) ∈ I do

Add R4(c1, c2, d) to Iend for

We now show that I is a repair of I. First, notice that R1 and S1 (and, similarly, R3

and S2) have the same extensions in I and I ′, respectively. Second, in the construction

of I, every tuple of R2 and R4 is given a distinct key value. Then, by Propositions 3.6

and 3.7, every tuple in the extension of R2 in I is in the extension of R2 in I; and every

tuple in the extension of R4 in I is in the extension of R4 in I.

Since consistentΣ(q, I) = true, I |= q. Thus, there exists some valuation νq such

that I |= R1(x,w) ∧ R2(m,w, z) ∧ R3(x′, w′) ∧ R4(m,w′, z′)[νq]. By construction of I, if

R2 and R4 join on m, then νq(w) = νq(w′). Let νq′ be such that:

• νq′(x) = νq(x)

• νq′(x′) = νq(x

′)

• νq′(y) = νq(w) = νq(w′)

It is easy to see that I ′ |= S1(x, y) ∧ S2(x′, y)[νq′ ]. Thus, I ′ |= q′.

5.2 A Dichotomy Result

5.2.1 The Class C∗

In Chapter 3, we presented a query rewriting algorithm which works on a class of queries

that we call Cforest. Clearly, Cforest gives sufficient conditions for a query to be first-

order rewritable. In this section, we address the following question: for which class of

queries does Cforest also give necessary conditions? That is, we show a class of queries

such that the problem of computing the consistent answers is coNP-complete for every

query of the class which does not satisfy the conditions of Cforest. Notice that this

Chapter 5. Complexity-Theoretic Analysis 88

establishes a dichotomy between first-order rewritability and coNP-completeness, and

is therefore much stronger than the complexity results that we presented in Section

5.1 (and, in fact, all the complexity results present in the consistent query answering

literature [CLR03a, CM05]). In the literature, a class C is said to be coNP-hard if there

is at least one query q ∈ C such that CONSISTENT(q, Σ) is a coNP-hard problem. Under

such a definition, it suffices to exhibit just one intractable query in order to conclude

that the entire class is coNP-complete. In contrast, in this section we will present a class

of queries such that for every query q in the class, CONSISTENT(q, Σ) is coNP-complete.

We will focus on conjunctive queries without repeated relation symbols and all of

whose nonkey-to-key joins are full. Within this class, there are some queries for which

the existence of a cycle is not a sufficient condition for intractability. Consider, for

example, the query q = ∃x, y.R1(x, y) ∧ R2(x, y). The join graph of this query is not a

forest; yet, it can be rewritten as follows:

∃x, y.R1(x, y) ∧R2(x, y) ∧ ∀y′.(R1(x, y′) → y′ = y) ∧ ∀y′.(R2(x, y′) → y′ = y)

Recall that the problem of computing consistent answers is intractable for the query

qnk = ∃x, x′, y.R1(x, y)∧R2(x′, y). Notice that qnk and q have exactly the same join graph.

The only difference between them is that in qnk, the two literals are related exclusively

by a nonkey-to-nonkey join; whereas in q, they are related by both a key-to-key and a

nonkey-to-nonkey join. Our intuition is that a query with a cyclic join graph may be

tractable only if there are literals related by more than one type of join (e.g., nonkey-

to-nonkey and key-to-key). We formalize this intuition with the definition of a class C∗,which essentially “separates” the different types of joins of the query. In C∗, every pair of

literals can be related by at most one of type of join (i.e., key-to-key, nonkey-to-nonkey,

and nonkey-to-key).

Definition 5.3. Let q be a conjunctive query without repeated relation symbols and all

of whose nonkey-to-key joins are full. We say that q is in class C∗ if for every pair R

and R′ of literals of q at most one of the following conditions holds:

• there is a key-to-key join between R and R′.

• there is a nonkey-to-nonkey join between R and R′.

• there are literals R1 . . . Rm in q such that there is a nonkey-to-key join from R to

R1, from Rm to R′, and from Ri to Ri+1, for every i such that 1 ≤ i < m.

Chapter 5. Complexity-Theoretic Analysis 89

Notice that C∗ is a fairly broad class of queries. For example, it includes the class

of queries that have exclusively nonkey-to-key joins. In general, the only queries that

are outside C∗ are the ones that have a pair of literals related by more than one type of

join. As anecdotal evidence of the practicality of the class, the only query in the TPC-H

benchmark [TPC03] that has nonkey-to-nonkey joins (Query 5) is in C∗. From the results

of this chapter, we can immediately conclude that the problem of computing consistent

answers for this query is not first-order rewritable.

We will consider a class, called Chard, of all queries of C∗ that are not in Cforest. The

main result of this chapter, Theorem 5.5, proves that the problem of computing the

consistent answers for every query of Chard is coNP-complete.

Definition 5.4. We say that a query q is in class Chard if q ∈ C∗ and q 6∈ Cforest.

Theorem 5.5. Let q be a query such that q ∈ Chard. Then, CONSISTENT(q, Σ) is coNP-

complete in data complexity.

Our motivation to provide a dichotomy for C∗ is the following. First, for a fairly broad

class of queries we can test in polynomial time if the problem of computing consistent

answers is tractable. Second, our results are an initial step towards proving a dichotomy

for the larger class of all conjunctive queries. Indeed, as a result of our work, future

efforts for finding dichotomy results for conjunctive queries need to focus only on queries

whose literals are related by more than one type of join.1

In general, by Ladner’s Theorem [Lad75], there are classes of coNP problems for

which there is no dichotomy between P and coNP-complete problems. However, this

is not the case for the class of queries that is the focus of this section. In fact, as a

corollary of Theorems 3.5 and 5.5, we get a dichotomy between membership in P and

coNP-completeness. Notice that, given a query q such that q ∈ C∗, it can be decided in

polynomial time on which side of the dichotomy the query q falls.

Corollary 5.6. Let q be a query such that q ∈ C∗. Then, CONSISTENT(q, Σ) is either in

P , or it is coNP-complete.

Under a complexity-theoretic assumption, we also get a dichotomy between first-order

rewritability and first-order inexpressibility for the class C∗. That is, for all the queries

of C∗ that are not in Chard, we can produce a first-order rewriting using our algorithm

1Since C∗ intersects, but does not contain Cforest, we know that there are queries outside C∗ for whichthe problem of computing consistent answers is tractable.

Chapter 5. Complexity-Theoretic Analysis 90

RewriteForest. For the queries of Chard, since the problem of obtaining consistent an-

swers is coNP-complete, there is no first-order rewriting, unless P=NP (which is unlikely).

Corollary 5.7. Let q be a query such that q ∈ C∗. Assuming P 6= NP , the problem

CONSISTENT(q, Σ) is first-order rewritable iff q ∈ Cforest.

Tractable but not First-Order Rewritable Queries

An interesting question is whether there are queries for which the problem of computing

consistent answers is tractable, yet not first-order rewritable. Although this remains

open for conjunctive queries without inequalities, we now show that there are tractable

conjunctive queries with inequalities that are not first-order rewritable.

Consider a schema with one binary relation R(E, S). Assume that E is the key of

the relation. Consider the following query q:

q = ∃e1, e2, s : R(e1, s) ∧R(e2, s) ∧ e1 6= e2

In order to find the consistent answers for q, we construct a graph of the inconsistent

database instance as follows.2 Let I be a database instance with one binary relation

R(E, S). The graph G of I is a bipartite graph G, with partitions E and S. Partitions

E and S have one vertex for each value in the active domain of attributes E and S,

respectively. The set of edges of G consists of all tuples (e, s) of R.

We use the graph of I to introduce the following necessary and sufficient condition

for consistentΣ(q, I) = false.

Lemma 5.8. Let I be a database with one binary relation R(E, S), possibly inconsistent

wrt a functional dependency Σ = {E → S}. Then, consistentΣ(q, I) = false iff the

graph G of I has a perfect matching.

Proof. ⇐ Assume that G has a perfect matching M . We can build an instance I by

creating a tuple in I for each edge in M . Since M is a matching, each vertex from

partition S is incident to at most one edge. Therefore, I 6|= q. Also, since the matching

is perfect, every key appears in I. Consequently, I is minimal, and therefore it is a repair

of I wrt Σ.

2Notice that unlike the join graph of a query, this graph is constructed from a database instance, nota query.

Chapter 5. Complexity-Theoretic Analysis 91

⇒ Assume that consistentΣ(q, I) = false. Then, there must exist a repair I of I

wrt Σ such that I 6|= q. We can construct a graph G′ by selecting the edges of G that

correspond to tuples of I. It is easy to see that G′ is a perfect matching of G.

There are a number of algorithms in the literature for deciding the existence of a

perfect bipartite matching. For example, one of the best known is given by Hopcroft and

Karp [HK75], and runs in O(n2.5) time. Therefore, q is a tractable query. We now show

that no approach based on query-rewriting works for q.

Theorem 5.9. There is no first-order rewriting Q of q such that consistentΣ(q, I) =

Q(I) for every instance I.

Proof. Let A1, . . . , An be a system of distinct representatives. A system of distinct rep-

resentatives [Ost70] of A1, . . . , An is a sequence of n distinct elements a1, . . . , an with

ai ∈ Ai, 1 ≤ i ≤ n. Let R be a binary relation that encodes A1, . . . , An as follows:

R(i, x) iff x ∈ Ai. Let G be the graph of R as constructed above. Clearly, G has a

perfect matching iff A1, ..., An has a system of distinct representatives. By Lemma 5.8,

consistentΣ(q, I) = false iff G has a perfect matching.

Let I be the database instance that consists of relation R. Assume that there is

a first order query Q such that I 6|= Q iff consistentΣ(q, I) = false. Then, Q can

test whether A1, ..., An has a system of distinct representatives. But it is known in the

literature [LW95] that relational algebra, with an appropriate encoding of sets, cannot

test whether a family of sets has a system of distinct representatives; contradiction.

5.2.2 Basic Intractable Cases

The intractability of all queries in Chard will be shown as follows. First, we show in

Lemma 5.10 that the problem of computing consistent answers for conjunctive queries

is in coNP. This is a result known in the literature, but we briefly give a proof for our

setting. For hardness, we will use a reduction from the problem of computing consistent

answers for one of two particular queries to the problem of computing consistent answers

for q. One of these specific queries is the query qnk = ∃x, x′, y.S1(x, y) ∧ S2(x′, y). This

query has a nonkey-to-nonkey join, and was shown to be intractable in Lemma 5.1. The

other query has a cycle of nonkey-to-key joins, and is shown to be intractable in Lemma

5.11.

Chapter 5. Complexity-Theoretic Analysis 92

The next lemma shows that the problem of computing consistent answers for con-

junctive queries is in coNP.

Lemma 5.10. Let q be a conjunctive query. The problem CONSISTENT(q, Σ) is in coNP.

Proof. Let I be an instance. In order to decide whether ~t 6∈ consistentΣ(q, I), it suffices

to show a repair I of I such that I 6|= q[~t]. The size of I is polynomially bounded by the

size of I. In particular, by Proposition 3.6, I ⊆ I. Furthermore, I 6|= q[~t] can be checked

in polynomial time, since q is a conjunctive query.

In the next lemma, we show the coNP hardness of computing consistent answers for

one of the two particular queries that will be used in Lemma 5.14. The coNP hardness

of the other query was proven in Lemma 5.1.

Lemma 5.11. Let q = ∃x, y.T1(x, y) ∧ T2(y, x). Then, the problem CONSISTENT(q, Σ) is

coNP-hard.

Proof. We will prove hardness by reduction from MONOTONE-3SAT. Let Φ = Φ1∧ · · · ∧Φm

be a monotone 3CNF formula. We shall build an instance I as follows:

• For each atom z, let Φi1 , . . . , Φin be the positive clauses where z occurs. Add tuples

T1(< Φi1 , . . . , Φin >, z) and T2(z, < Φi1 , . . . , Φin >) to I.

• For each atom z, let Φi1 , . . . , Φin be the negative clauses where z occurs. Add tuples

T1(< Φi1 , . . . , Φin >, z) and T2(z, < Φi1 , . . . , Φin >) to I.

We now show that consistentΣ(q, I) = false iff Φ is satisfiable.

(⇒) Since consistentΣ(q, I) = false, there exists a repair I of I such that I 6|= q.

Assume towards a contradiction that there are tuples T1(c, z) ∈ I and T1(c′, z) ∈ I such

that c 6= c′. By construction of I, if T2(z, d) ∈ I, then d = c or d = c′. By Propositions

3.6 and 3.7, either T2(z, c) ∈ I or T2(z, c′) ∈ I. Thus, I |= q; contradiction.

We now build a valuation v for the variables of Φ as follows. For each variable z,

we let v(z) = true if there is some c such that T1(c, z) ∈ I and c is a list of positive

clauses; and we let v(z) = false if there is some i such that T1(c, z) ∈ I, and c is a list

of negative clauses. It is easy to see that v is a truth valuation that satisfies Φ.

(⇐) Assume that Φ is satisfiable. Let v be a truth assignment for the variables of Φ.

We shall build a repair I as follows. For each positive clause Φi, select a variable z that

appears in Φi and such that v(z) = true. Add T1(c, z) to I, where c is a list of positive

Chapter 5. Complexity-Theoretic Analysis 93

clauses. For each negative clause Φi, select a variable z that appears in Φi and such that

v(z) = false. Add T1(c, z) to I, where c is a list of negative clauses. For each variable

z, if v(z) = false, add T2(z, c) to I, where c is a list of positive clauses; if v(z) = true,

add T2(z, c) to I, where c is a list of negative clauses. It is easy to see that I 6|= q.

We now give some auxiliary results before proving Lemma 5.14. The next lemma

generalizes Lemma 5.11 from cycles of length two to the case of cycles of arbitrary length.

Lemma 5.12. Let q be the query ∃w1, . . . , wm.S1(wm, w1)∧S2(w1, w2)∧· · ·∧Sm(wm−1, wm).

Let q′ = ∃x, y.T1(x, y)∧T2(y, x) Then, there is a polynomial time reduction from the prob-

lem CONSISTENT(q′, Σ′) to the problem CONSISTENT(q, Σ).

Proof. Let I ′ be an instance over the schema of q′. We shall build an instance I over the

schema of q as follows:

for each valuation νq′ for the variables of q′ such that I ′ |= T1(x, y) ∧ T2(y, x)[νq′ ] do

Let νq(wm) = νq′(x)

Let νq(w1) = νq′(y)

Create a new constant cnew

for i := 2 to m− 1 do

Let νq(wi) = cnew

end for

Add the tuples of S1(wm, w1) ∧ S2(w1, w2) ∧ · · · ∧ Sm(wm−1, wm)[νq] to I

end for

We claim that consistentΣ(q′, I ′) = true iff consistentΣ(q, I) = true.

(⇒) Let I be a repair of I over the schema of q. We shall build a repair I ′ over the

schema of q′ as follows:

for each tuple S1(cm, c1) of I do

Add a tuple T1(cm, c1) to I ′for each cnew such that S2(c1, cnew) ∈ I and Sm(cnew, cm) ∈ I do

Add a tuple T2(c1, cm) to I ′end for

end for

Since consistentΣ(q′, I ′) = true, I ′ |= q′. Thus, there is a valuation νq′ such that

Chapter 5. Complexity-Theoretic Analysis 94

I ′ |= T1(x, y) ∧ T2(y, x)[νq′ ]. Let cm = νq′(x), c1 = νq′(y). Since T2(c1, cm) ∈ I ′, there

exists cnew such that S2(c1, cnew) ∈ I and Sm(cnew, cm) ∈ I. Let νq be a valuation for the

variables of q such that:

• νq(wm) = cm

• νq(w1) = c1

• νq(wi) = cnew, for 1 < i < m

Since T1(cm, c1) ∈ I ′, S1(cm, c1) ∈ I. By construction of νq, S2(c1, cnew) ∈ I and

Sm(cnew, cm) ∈ I. For 2 < i ≤ m, notice that by construction of I, there are no tuples

Si(ci, di) and Si(ci, d′i) in I such that di 6= d′i. Therefore, by Propositions 3.6 and 3.7,

every tuple in the extension of Si in I appears in the extension of Si in I. By construction

of I, Si(cnew, cnew) ∈ I, for 3 ≤ i ≤ m − 1. Thus, Si(cnew, cnew) ∈ I. We conclude that

I |= S1(wm, w1) ∧ S2(w1, w2) ∧ . . . Sm(wm−1, wm)[νq]. Thus, I |= q.

(⇐) Let I ′ be a repair of I ′. We shall build an instance I as follows.

for each tuple T1(cm, c1) of I ′ do

Add a tuple S1(cm, c1) to ILet cnew be a constant such that S2(c1, cnew) ∈ I and Sm(cnew, cm) ∈ I

Add a tuple S2(c1, cnew) to Ifor i := 3 to m− 1 do

Add a tuple Si(cnew, cnew) to Iend for

Add a tuple Sm(cnew, cm) to Iend for

It is easy to see that I is a repair of I. Since consistentΣ(q, I) = true, I |=q. Thus, there exists some valuation νq such that I |= S1(wm, w1) ∧ S2(w1, w2) ∧. . . Sm(wm−1, wm)[νq]. Let νq′ be such that:

• νq′(x) = νq(wm)

• νq′(y) = νq(wm1)

It is easy to see that I ′ |= T1(x, y) ∧ T2(y, x)[νq′ ]. Thus, I ′ |= q′.

Chapter 5. Complexity-Theoretic Analysis 95

5.2.3 Generalizing the Basic Cases

Our strategy for proving the dichotomy will be to show that if q has a subquery q′ that

is known to be intractable (in particular, a cycle), then q is not tractable. This does not

hold in general, but as we show with the next auxiliary result, it holds for the queries in

C∗.

Lemma 5.13. Let q be a Boolean query such that q ∈ C∗. Let R1(~x1, ~y1), . . . ,

Rn(~xn, ~yn) be the literals of q. Let q′ be a Boolean query. Let S1(x1, y1), . . . ,

Sm(xm, ym) be the literals of q′, where m ≤ n. Assume that the join graph of q′ is a cycle.

Let L = {x1, y1, . . . , xm, ym}. Assume that:

• xi occurs in ~xi, for 1 ≤ i ≤ m, and

• yi occurs in ~yi, for 1 ≤ i ≤ m, and

• for 1 ≤ i ≤ m, if w ∈ L and w occurs in Ri, then w occurs in Si.

Then, there is a polynomial-time reduction from the problem CONSISTENT(q′, Σ′) to

CONSISTENT(q, Σ).

Proof. Let F = {w : w occurs in Ri, and 1 ≤ i ≤ m}−L. Let U = {w : w occurs in q}−F − L.

Let I ′ be an instance over the schema of q′. We shall build an instance I over the

schema of q as follows:

for each variable w such that w ∈ F do

Create a new constant cnew

Let νF (w) = cnew

end for

for each valuation νq′ for the variables of q′ such that I ′ |= S1(x1, y1)∧· · ·∧Sm(xm, ym)[νq′ ]

do

for each variable w such that w ∈ F do

Let νq(w) = νF (w)

end for

for each variable w such that w ∈ U do

Create a new constant cnew

Let νq(w) = cnew

Chapter 5. Complexity-Theoretic Analysis 96

end for

for i := 1 to m do

Let νq(xi) = νq′(xi)

Let νq(yi) = νq′(yi)

end for

Add the tuples of R1(~x1, ~y1) ∧ · · · ∧Rn(~xn, ~yn)[νq] to I

end for

We claim that consistentΣ(q′, I ′) = true iff consistentΣ(q, I) = true.

(⇒) Let I be a repair of I over the schema of q. We shall build an instance I ′ over

the schema of q′ as follows.

for i := 1 to m do

for each tuple Ri(~ci,~di) of I do

Let ci be the constant that appears in ~ci at the position of one of the occurrences

of xi in ~xi.

Let di be the constant that appears in ~di at the position of yi in ~yi

Add Si(ci, di) to I ′end for

end for

We make the following observations with respect to the construction of I ′. By con-

struction of I, if Ri(~ci,~di) ∈ I, the same constant appears in ~ci at all the positions where

xi appears in ~xi. By Proposition 3.6, I ⊆ I. Thus, in the construction of I ′, it suffices

to choose the constant that occurs in ~ci at any of the positions where xi occurs in ~xi.

Assume that I ′ is not a repair of I ′. Then, there are constants ci, di and d′i such

that di 6= d′i, Si(ci, di) ∈ I ′ and Si(ci, d′i) ∈ I ′. By construction of I ′, there are tuples

Ri(~ci,~di) ∈ I and Ri(~c

′i,

~d′i) ∈ I such that ci appears in ~ci and ~c′i at all the positions

where xi appears in ~xi; and di and d′i appear in ~di and ~d′i, respectively, at the position

of yi in ~yi. Clearly, ~di 6= ~d′i. By construction of I, if w is a variable such that w 6∈ L,

w is assigned the value νF (w) in every tuple of I. By Proposition 3.6, I ⊆ I. Thus,

~ci = ~c′i. Since ~di 6= ~d′i, I does not satisfy the key constraints of Σ. Thus I is not a repair;

contradiction. We conclude that I ′ is a repair of I ′.

Since consistentΣ(q′, I ′) = true, I ′ |= q′. Thus, there is some valuation νq′ such

Chapter 5. Complexity-Theoretic Analysis 97

that I ′ |= S1(x1, y1) ∧ · · · ∧ Sm(xm, ym)[νq′ ]. Let νm be a valuation for the variables of

R1, . . . , Rm such that:

• νm(xi) = νq′(xi), for 1 ≤ i ≤ m

• νm(yi) = νq′(yi), for 1 ≤ i ≤ m

• νm(w) = νF (w) if w ∈ F

Let w be a variable that appears in Ri, for 1 ≤ i ≤ m. If w ∈ L and w occurs in

Ri, by hypothesis, w occurs in Si. If w 6∈ L, then w ∈ F , by definition of F . Since

I ′ |= S1(x1, y1) ∧ · · · ∧ Sm(xm, ym)[νq′ ], and νm(w) = νF (w) if w ∈ F , we conclude that

I |= R1(~x1, ~y1) ∧ · · · ∧Rm(~xm, ~ym)[νm].

By construction of I, there is a valuation νq for the variables of q such that:

• νm(w) = νq(w) if w appears in Ri, for 1 ≤ i ≤ m; and

• I |= R1(~x1, ~y1) ∧ · · · ∧Rn(~xn, ~yn)[νq].

Let Ri(~xi, ~yi) be a literal of q such that i > m. Notice that we assume that the join

graph of q′ is a cycle. Since q is in C∗, there exists some variable w such that w occurs in

~xi and w does not occur in any of R1, . . . , Rm. Thus, w ∈ U . Since the variables of U are

assigned a distinct constant in every iteration of the algorithm that constructs I, if two

tuples Ri(~ci,~di) and Ri(~c

′i,

~d′i) are added at different iterations, then ~ci 6= ~c′i. Therefore,

by Proposition 3.6 and 3.7, every tuple in the extension of Ri in I is in the extension of

Ri in I. Therefore, I |= R1(~x1, ~y1) ∧ · · · ∧Rn(~xn, ~yn)[νq].

(⇐) Let I ′ be a repair of I ′. We shall build an instance I as follows.

for i := 1 to m do

for each tuple Si(ci, di) of I ′ do

Let Ri(~ci,~di) be a tuple of I such that ci appears in ~ci at all the positions of xi in

~xi, and di appears in ~di at the position of yi in ~yi

Add Ri(~ci,~di) to I

end for

end for

for i := m + 1 to n do

for each tuple Ri(~ci,~di) in I do

Chapter 5. Complexity-Theoretic Analysis 98

Add Ri(~ci,~di) to I

end for

end for

We will now show that I is a repair of I. Towards a contradiction, assume that I is

not a repair of I. Then, there are values ~ci, ~di, and ~d′i such that ~di 6= ~d′i, Ri(~ci,~di) ∈ I,

and Ri(~ci,~d′i) ∈ I.

First, assume that 1 ≤ i ≤ m. For every variable w such that w 6∈ L and w occurs

in Ri, w ∈ F . Thus, w is assigned the same constant νF (w) in every tuple of I. By

Proposition 3.6, I ⊆ I. Therefore, there are constants ci, di and d′i such that di 6= d′i, ci

appears in ~ci at the positions of xi in ~xi, and di and d′i appears in ~di and ~d′i, respectively,

at the position of yi in ~yi. By construction of I, there are tuples Si(ci, di) and Si(ci, d′i)

in I ′. Since di 6= d′i, I ′ does not satisfy the key constraints of Σ′. Thus, I ′ is not a repair;

contradiction.

Now, assume that m < i ≤ n. Notice that we assume that the join graph of q′ is

a cycle. Since q is in C∗, there exists some variable w such that w occurs in ~xi and

w does not occur in any of R1, . . . , Rm. Thus, w ∈ U . Since the variables of U are

assigned a different constant in every iteration of the algorithm that constructs I, if two

tuples Ri(~ci,~di) and Ri(~c

′i,

~d′i) are added at different iterations, then ~ci 6= ~c′i. Therefore,

the extension of Ri in I satisfies the key dependencies of Σ. Thus, by construction of

I, the extension of Ri in I satisfies the key constraints of Σ. Thus, I is a repair of I;

contradiction.

We conclude that I is a repair of I. Since consistentΣ(q, I) = true, I |= q. Thus,

there exists some valuation νq such that I |= R1(~x1, ~y1) ∧ · · · ∧Rn(~xn, ~yn)[νq]. Let νq′ be

a valuation for the variables of q′ such that, for 1 ≤ i ≤ m:

• νq′(xi) = νq(xi)

• νq′(yi) = νq(yi)

It is easy to see that I ′ |= S1(x1, y1) ∧ · · · ∧ Sm(xm, ym)[νq′ ]. Thus, I ′ |= q′.

We are now ready to prove Lemma 5.14, which gives a polynomial-time reduction

from the problem of computing consistent answers for the queries of Lemmas 5.1 or 5.11

to every query in Chard. From this, Theorem 5.5 follows directly.

Chapter 5. Complexity-Theoretic Analysis 99

Lemma 5.14. Let q be a query such that q ∈ Chard. Then, there is a polynomial-time

reduction from CONSISTENT(q′, Σ′) to CONSISTENT(q, Σ), where q′ is one of the following

queries:

• ∃x, x′, y.S1(x, y) ∧ S2(x′, y)

• ∃x, y.T1(x, y) ∧ T2(y, x)

Proof. Let G be the join graph of q. Let G′ be an induced subgraph of G such that:

• G′ is connected, and

• G′ is not a tree, and

• if G′′ is a proper induced subgraph of G′, and G′′ is connected, then G′′ is a tree.

Let P = 〈R1, R2, R1〉 be a cycle of G′. Let R1(~x1, ~y1) and R2(~x2, ~y2) be the literals in

G′. Assume that there is some variable y such that y occurs in ~y1 and ~y2. By Definition

of C∗, there is no key-to-key join between R1 and R2. Therefore, there exists a variable

x such that x occurs in ~x1, and x does not occur in ~x2; and a variable x′ such that x′

occurs in ~x2 and x′ does not occur in ~x1. Let q′ = S1(x, y) ∧ S2(x′, y). By Lemma 5.13,

there is a polynomial-time reduction from CONSISTENT(q′, Σ′) to CONSISTENT(q, Σ).

Let P = 〈R1, . . . , Rm, R1〉 be a cycle of G′. Let R1(~x1, ~y1),. . . , Rm(~xm, ~ym) be the

literals of P . Let w1, w2, . . . , wm be variables such that wi occurs in ~yi and in R(i mod m)+1,

for every 1 ≤ i ≤ m. Assume that there is some wi such that 1 ≤ i ≤ m and wi occurs in

some literal Rj of q such that j 6= i and j 6= (i mod m)+1. Then {R1, . . . , Ri, Rj, . . . , R1}is a cycle. Therefore G′ contains a proper induced subgraph G′′ such that G′′ is connected,

and G′′ is not a tree; contradiction. Let q′′ = S1(wm, w1)∧S2(w1, w2)∧ . . . Sm(wm−1, wm).

It can be checked that q and q′′ satisfy the conditions of Lemma 5.13. Consequently,

there is a polynomial-time reduction from CONSISTENT(q′′, Σ′′) to CONSISTENT(q, Σ). Let

q′ = ∃x, y.T1(x, y)∧T2(y, x). By Lemma 5.12, there is a polynomial-time reduction from

CONSISTENT(q′, Σ′) to CONSISTENT(q′′, Σ′′).

Finally, we give the proof for Theorem 5.5, the main result of this chapter.

Theorem 5.5. Let q be a query such that q ∈ Chard. Then, CONSISTENT(q, Σ) is coNP-

complete in data complexity.

Chapter 5. Complexity-Theoretic Analysis 100

Proof. By Lemma 5.10, CONSISTENT(q, Σ) is in coNP. In order to prove hardness, let q′

be one of the following queries:

• ∃x, x′, y.S1(x, y) ∧ S2(x′, y)

• ∃x, y.T1(x, y) ∧ T2(y, x)

By Lemma 5.14, there is a polynomial-time reduction from CONSISTENT(q′, Σ′) to

CONSISTENT(q, Σ). By Lemmas 5.1 and 5.11, CONSISTENT(q′, Σ′) is coNP-hard. Thus,

CONSISTENT(q, Σ) is coNP-hard.

5.3 Related Work

Chomicki and Marcinkowski [CM05] and Calı, Lembo and Rosati [CLR03a] thoroughly

study the decidability and complexity of consistent query answering for several classes

of queries and integrity constraints. In order to show intractability of a class, they

take the usual approach of exhibiting one query of the class for which the problem is

intractable. To the best of our knowledge, the result that we present in Section 5.2 is the

first dichotomy result in the area of consistent query answering.

Both Chomicki and Marcinkowski and Calı, Lembo and Rosati show that the problem

of obtaining consistent answers for conjunctive queries under primary key constraints is

coNP-complete. Chomicki and Marcinkowski also show an example of a query with just

one literal but two key dependencies for which the problem is coNP-complete. This gives

further support for our decision of considering exactly one key dependency per relation.

Calı, Lembo and Rosati show the undecidability of the problem of obtaining consis-

tent answers when the set of constraints contains primary keys and arbitrary inclusion

dependencies. They also show the problem becomes decidable for foreign key constraints

(it is coNP-complete). Chomicki and Marcinkowski study the same problem but under

a semantics where only tuple deletion is allowed (i.e., repairs are always subsets of the

inconsistent database). In this case, the problem is Π2p-complete, and becomes coNP-

complete if the inclusion dependencies are restricted to be acyclic.

Chapter 6

ConQuer: System Implementation

and SQL Rewritings

In this chapter, we present ConQuer, a system for querying inconsistent databases.

We demonstrated this system at the International Conference on Very Large Databases

(VLDB) [FFM05b]. In Section 6.1, we describe the system implementation and a typical

scenario where it can be used. Then, in Sections 6.2 and 6.3, we present the SQL rewrit-

ings that are at the core of ConQuer’s approach. In Section 6.4, we show how, if desired,

ConQuer can process the database offline in order to improve the performance of the

queries. Finally, in Section 6.5, we review other systems that are related to ConQuer.

6.1 System Implementation

ConQuer is implemented in Java and follows a modular architecture. It consists of the

following components:

• Query Rewriting Module. It rewrites an input SQL query into another SQL

query that computes the consistent answers. The details of the rewritings are

presented in Sections 6.2 to 6.4. The SQL queries are parsed using javacc.

• Query Execution Engine. The rewritten queries are executed using IBM DB2

UDB Version 8.2. The connection with the database is done through JDBC.

• Conflict Resolution Module. Provides a tracing facility to find the data that

leads to differences between the answer to the original query and the consistent

answer. This module also permits a user to update the database to correct errors.

101

Chapter 6. ConQuer: System Implementation and SQL Rewritings 102

Figure 6.1: Interface for entering hypothetical primary key constraints in ConQuer

• User Interface. Query results are displayed using a Web-accessible interface that

is implemented in PHP.

We illustrate a typical use case of ConQuer on a database with information about

airports. The user first specifies a set of primary key constraints using the interface shown

in Figure 6.1. These are the constraints that should hold on a consistent database, but

may be violated by the actual database that is being queried. Notice that for the same

schema and database, there is the flexibility of running queries under different sets of

potentially violated primary key constraints. Then, the user writes a SQL query within

the interface. In Figure 6.2, we show a query where the user is asking for all the countries

that have airports located north of parallel 63N. The result to the query is shown in Figure

6.3. The consistent answers are shown in bold, and the “potential answers” (i.e., possible

answers that are not consistent answers) are shown in italics. For example, in this case

“Italy” is a potential answer.

While consistent answers are best suited for decision making, potential answers can be

used to understand the reasons why a database is inconsistent. In this case, the user could

click on “Italy” and obtain an explanation, which is shown in Figure 6.4. The explanation

is the lineage (or why-provenance) [BKT01, CW03] of the result, i.e., the tuples in the

database that contribute to the answer. According to the explanation, Italy is a potential

answer because it has one airport that appears as satisfying the query (parallel 63) in

Chapter 6. ConQuer: System Implementation and SQL Rewritings 103

Figure 6.2: Interface for entering queries in ConQuer

one tuple, and violating it (parallel 45) in another. Notice that in the comment to the

query, the user wrote “select countries that are located north of Trondheim”. Trondheim

is a Norwegian city, and the user may have background knowledge telling that all Italian

cities are south of Norwegian cities. Thus, the user could use the explanation obtained

from ConQuer in order to remove the tuple for the Italian airport located on parallel 63.

6.2 ConQuer Rewritings for Queries without Aggre-

gation

In this section, we present the SQL rewritings produced by ConQuer for a class of Select-

Project-Join (SPJ) queries with set semantics. We delay the treatment of conjunctive

queries that return duplicates until the next section, where the number of duplicates

returned by the queries can be counted with the count(*) aggregate function. We first

give the query rewriting algorithm, and then we illustrate it with a number of examples.

6.2.1 Rewriting Algorithm

We now present a SQL rewriting algorithm for SPJ queries that are equivalent to a

conjunctive query in the class Cforest, introduced in Definition 3.4, which we repeat next.

Chapter 6. ConQuer: System Implementation and SQL Rewritings 104

Figure 6.3: Query results in ConQuer

Figure 6.4: Query explanation in ConQuer

Chapter 6. ConQuer: System Implementation and SQL Rewritings 105

Definition 3.4. Let q be conjunctive query without repeated relation symbols and all of

whose nonkey-to-key joins are full. Let G be the join graph of q. We say that q ∈ Cforest

if G is a forest (i.e., every connected component of G is a tree).

The above definition requires three conditions on the conjunctive query. First, that

the query has no repeated relation symbols. For an SPJ SQL query, this means that each

relation can be used at most once in the where clause. Second, that all its nonkey-to-key

joins must be full. For an SPJ query, this means that if an attribute of a key of a relation

r1 is equated in the where clause with a nonkey attribute of another relation r2, then all

the attributes of the key of r1 are equated to nonkey attributes of r2. Finally, the join

graph of q must be a forest. The notion of a join graph is introduced in Definition 3.1,

and we repeat it next.

Definition 3.1 (join graph). Let q be a conjunctive query. The join graph G of q is a

directed graph such that:

• the vertices of G are the literals of q;

• there is an arc from literal Ri to literal Rj if i 6= j, and there is some variable w

such that w is existentially-quantified in q, w occurs at the position of a nonkey

attribute in Ri, and w occurs in Rj.

An analogous definition can be given for the join graph of an SPJ SQL query. The

vertices of the graph will be the relation symbols in the from clause of the query. Fur-

thermore, there will be an arc from relation ri to relation rj if there is an attribute A

in ri such that (1) A is not in the key of r1 (it is a nonkey attribute), (2) A does not

appear in the select clause of the query, and A is not equated to any attribute B such

that B appears in the select clause of the query (this corresponds to the notion of

an existentially-quantified variable for conjunctive queries); and (3) there is some equal-

ity in the where clause relating A to some attribute B of r2 (i.e., a nonkey-to-key or

nonkey-to-nonkey join).1

We can now give a definition analogous to Cforest for SPJ SQL queries. A query q is

in class Csqlforest if no relation appears twice in the from clause of q, all the nonkey-to-key

joins of q are full, and the join graph of q is a forest.

1This definition works for repeated relation symbols as well. In such case, we assume that if a relationappears more than once in the from clause, then it is aliased to a new name using the as operator.

Chapter 6. ConQuer: System Implementation and SQL Rewritings 106

We are now ready to give ConQuer’s rewriting algorithm for SPJ queries in Csqlforest.

The algorithm is called RewriteForestSQL and is shown in Figure 6.5. The algorithm

takes as input a SQL query q in Csqlforest and a set of key constraints (one per relation of

the schema), and returns a SQL rewriting Q of q.

In the rewriting Q, the attributes of the relations in q play different roles. In par-

ticular, we will distinguish the attributes that the query projects on (i.e., that appear

in the select clause), and the attributes that appear in the key of a relation that is

at the root of some tree in the join graph of q. In the rest of the discussion, we will

call these attributes projecting attributes, and key-root attributes, respectively. The for-

mer are denoted in Figure 6.5 with the symbols S1, . . . , Sl; the latter are denoted with

K1, . . . , Kn.

The rewriting Q has three subqueries, specified using a with clause: candidates-

SubQuery, countViolSubQuery and countProjSubQuery. The purpose of candidates-

SubQuery is to prune the number of values for the key-root attributes that should be

considered by the other subqueries. In particular, candidatesSubQuery applies the

selection conditions of the original query q, and projects on its key-root attributes. These

attributes are used to perform an inner join in the next subquery (countViolSubQuery).

If the selectivity of q is low (i.e., few tuples satisfy its conditions), and the query optimizer

pushes down the selection conditions of candidatesSubQuery in the query plan, we would

expect the rewriting to have a low overhead with respect to the original query. We validate

this conjecture in Section 7.2.

Let CONDS be the list of conditions in the where clause of q. In the from clause

of countViolSubQuery, we count the number of tuples that violate the conditions of

CONDS, we group by the key-root attributes, and keep the result in an attribute called

countViol as follows:

sum(case when CONDS then 0 else 1 end)

over (partition by K1, . . . , Kn)

as countViol

Notice the use of the partition by clause. This clause (introduced in the OLAP

Amendment to SQL [ISO01]) differs from the typical group by clause in that it permits

grouping by a set of attributes that may not include all the attributes in the select

Chapter 6. ConQuer: System Implementation and SQL Rewritings 107

clause. This is useful here because we “partition by” the root-key attributes, but the

select clause of countViolSubQuery also includes the projecting attributes of the query.

In the main body of the query, we filter out the tuples whose key-root attributes are

involved in a violation of CONDS by checking the condition countViol=0.

The from clause of subquery countViolSubQuery is obtained by calling a procedure

called GetJoinsExpression (shown in Figure 6.6), with the join graph of q and the list

of conditions CONDS as parameters. This procedure consists of two parts. In the first

part, an inner join is computed for the key-to-key joins of relations that are at the root

of some tree of the join graph. Notice that since these relations are in distinct connected

components of the join graph, they are not related by a nonkey-to-key join. In the second

part, the procedure produces a left outer join expression for each tree of the join graph.

This is done by recursively calling the procedure GetTreeJoinsExpression for the nodes

of each tree (also shown in Figure 6.6). The expression returned by GetTreeJoinsExpres-

sion is a left outer join of all relations in the input tree, listed in an order corresponding

to a preorder traversal of the trees.

We will illustrate shortly (in Example 6.4) the rewriting for queries where some of

the root-key attributes do not appear in the select clause (that is, some root-key at-

tributes are not projecting attributes). We will argue that in such cases, we would

like to count the number of distinct values for the projecting attributes, grouping by

the root-key attributes. We will also show how to do this by using the max aggre-

gate function (with a partition by clause) and the rank OLAP function. In the al-

gorithm RewriteForestSQL of Figure 6.5, the rank function is used in the subquery

countViolSubQuery, and the max function is used in the subquery countProjSubquery.

The result of this aggregation is kept in an attribute called countProjection, which

keeps the count of distinct values for each instantiation of the root-key variables. This

attribute is used in the main body of the rewriting, where we check countProjection=1.

In the subqueries, we project not only on the projecting attributes S1, . . . , Sl, but

also on the root-key attributes K1, . . . , Kn. However, in the main query of the rewriting

we project only on the attributes S1, . . . , Sl. In this way, the rewritten query Q and the

input query q return tuples for the same set of attributes.

For the sake of clarity, we omitted the order by clause in the query q. However,

dealing with ordering in the rewriting is quite easy. We just need to add the attributes

of the order by of q to the select clause of the subqueries, and include them in the

order by clause of the main body of the rewriting.

Chapter 6. ConQuer: System Implementation and SQL Rewritings 108

Algorithm RewriteForestSQL(q,Σ)

Input: q, a SQL query in Csqlforest of the form

select <list of attributes>from <list of relations>where <list of conditions>

Σ, a set of key constraints (one per relation)Output: Q, a SQL query that computes consistentΣ(q, I), for every database I

Let S1, . . . , Sl be the attributes in the select clause of qLet G be the join graph (forest) of qLet r1, . . . , rm be the relations at the root of all trees of GLet K1, . . . , Kn be the attributes in the keys of r1, . . . , rm

Let CONDS be the list of conditions in the where clause of qLet JOINS be the expression obtained by calling the procedure

GetJoinsExpression(G, CONDS) of Figure 6.6Let Q be the following SQL query:

with candidatesSubQuery as (select K1 as cK1,. . . ,Kn as cKn

from <list of relations in q>where CONDS ),

countViolSubQuery as (select K1, . . . , Kn,

S1, . . . , Sl,rank() over (partition by K1, . . . , Kn

order by S1, . . . , Sl) as rankProjection,sum(case when CONDS then 0 else 1 end)

over (partition by K1, . . . , Kn) as countViol,from JOINS ),where exists (select * from candidatesSubQuery

where K1 = cK1 and . . . and Kn = cKn),countProjSubQuery as (

select K1, . . . , Kn,S1, . . . , Sl,max(rankProjection) over (partition by K1, . . . , Kn)

as countProjection,countViol

from countViolSubQuery )

select distinct S1, . . . , Sl

from countProjSubQuerywhere countProjection = 1 and countViol=0

return Q

Figure 6.5: SQL query rewriting algorithm for SPJ queries in Csqlforest

Chapter 6. ConQuer: System Implementation and SQL Rewritings 109

6.2.2 Examples

We now present some examples to illustrate the use of the RewriteForestSQL algorithm.

In the examples, we first show the first-order rewriting that we obtain with the algorithms

of Chapter 3, and then we present the actual SQL query produced by ConQuer.

Selection

In the next example, we illustrate ConQuer’s SQL rewritings with a simple query that

has one selection condition.

Example 6.1. Let R be a schema with our standard employee(emplKey, salary) re-

lation. Consider a SQL query q1 that retrieves the names and salaries of all employees

whose salary is less than or equal to 1000.

q1: select distinct emplKey

from employee

where salary <= 1000

Using the notation for conjunctive queries, q1 can be written as follows:

q1(e) = ∃s.employee(e, s) ∧ s ≤ 1000

A first-order query rewriting that computes the consistent answers to q1 can be

obtained with the algorithms of Chapter 3. In particular, the rewriting returned by

RewriteForest(q1, Σ) is the following:

Q1(e) = ∃s.employee(e, s) ∧ s ≤ 1000 ∧ ∀s′.(employee(e, s′) → s′ ≤ 1000)

Notice that the first and second conjuncts of the first-order rewriting Q1 actually

correspond to the original query q1. Thus, the rewriting starts with a subquery called

candidatesSubQuery that retrieves the employee names that satisfy q1 (and are thus

candidates to be consistent answers).

Chapter 6. ConQuer: System Implementation and SQL Rewritings 110

Algorithm GetJoinsExpression(G, CONDS)

Input: G, a join graph that forms a forestCONDS, a list of conditions of the form xθy,

where θ is some binary comparison operator such as =, 6=, <, etc.Output: a subexpression of a SQL query

Let r1, . . . , rm be the relations at the root of all trees of GInitialize RJOINS as the string “r1”for i := 2 to m do

Let IJOINS be the conjunction of all join conditions (i.e., equalities) between attributesof ri−1 and ri

Concatenate “join ri on IJOINS” to RJOINSend forInitialize T JOINS as an empty expressionLet T1, . . . , Tm be the trees of G rooted at r1, . . . , rm

for i := 1 to m doConcatenate the expression returned by GetTreeJoinsExpression(Ti, CONDS) toT JOINS

end forreturn “RJOINS and T JOINS”

Algorithm GetTreeJoinsExpression(T, CONDS)

Input: T , a join graph that forms a treeCONDS, a list of conditions of the form xθy,

where θ is some binary comparison operator such as =, 6=, <, etc.Output: a subexpression of a SQL query

Initialize LOJOINS as an empty stringif T consists of more than one node r then

Let r1, . . . , rm be the relations whose root is a child of rfor i := 1 to m do

Let IJOINS be the conjunction of all join conditions (i.e., equalities) between at-tributes of r and ri

Concatenate “left outer join ri on IJOINS” to LOJOINSend forfor i := 1 to m do

Let Ti be the subtree of T rooted at ri

Concatenate the expression returned by GetTreeJoinsExpression(Ti, CONDS) toLOJOINS

end forend ifreturn LOJOINS

Figure 6.6: Procedures to obtain an expression for the joins of a query

Chapter 6. ConQuer: System Implementation and SQL Rewritings 111

with candidatesSubQuery as (

select emplKey

from employee

where salary <= 1000)

Since emplKey is a key of the relation employee, in the repairs, each employee name

will be associated with exactly one salary. However, in the inconsistent database, an

employee name may appear with several different salaries. Thus, the rewriting must

ensure that the employee names in the consistent answers are associated with salaries

satisfying the selection condition of the input query q1 (i.e., that the salary is less or

equal than 1000) in every tuple of the inconsistent relation employee where the employee

name appears. This is done in Q1 with the expression ∀s′.employee(e, s′) → s′ <= 1000.

It is straightforward to translate this expression into SQL using nested queries and the

not exists construct. However, from our empirical observations in the context of DB2,

we have noticed that such constructs lead in many cases to inefficient queries. Thus,

for the sake of efficiency, the rewritings produced by ConQuer avoid the not exists

construct. One way of doing this is to count, for each employee, the number of salaries

in the inconsistent database that violate the selection condition of q1. If there are no

violations (i.e., the number of salaries violating the condition for the employee is zero),

then the employee name satisfies the selection condition in every tuple of the inconsistent

relation. This can be achieved with the following subquery.

with countViolSubQuery as (

( select emplKey,

sum(case

when salary ≤ 1000 then 0 else 1 end) as countViol

from employee

where exists (select *

from candidatesSubQuery C where

C.emplKey=employee.emplKey)

group by emplKey)

In the above subquery, we count the number of violations for each employee. We keep

this count in an attribute called countViol. The final result of the query consists of the

Chapter 6. ConQuer: System Implementation and SQL Rewritings 112

employee names for which there are no violations (countViol = 0). In the subquery,

for each tuple of employee, we compute a case statement. If the salary in the tuple

is less than or equal to 1000 (i.e., it satisfies the selection condition of q1) we output

a value of zero (meaning no violation). Otherwise, we output 1 (meaning a violating

tuple). The query aggregates these values, summing them up for each employee name.

If the sum for an employee name is zero, that means that there are no violating tuples

involving that employee name. Otherwise, we get the number of violating tuples (hence

the name, countViol). In the main body of the query (which we give below), we return

all employee names that are not involved in any violation.

select emplKey

from countViolSubQuery

where countViol = 0

Join

We now present two examples to illustrate the rewriting of queries that contain join

conditions. In the first example, we show the rewriting for a query that has one join

condition. In the second example, we show the rewriting for a query with a more complex

join graph.

Example 6.2. Let R be a schema with relations employee(emplKey, deptFKey), and

dept(deptKey,mgrName). Consider a SQL query q2 that retrieves the names of all

employees whose department appears in the dept relation:

q2: select distinct emplKey

from employee,dept

where employee.deptFKey= dept.deptKey

Notice that q2 has an inner join specified with the condition employee.deptFKey=

dept.deptKey of its where clause. In conjunctive query notation, q2 can be written as

follows.

q2(e) = ∃d,m.employee(e, d) ∧ dept(d,m)

It can be easily checked that q2 is in the class Cforest of conjunctive queries. The

first-order query rewriting obtained by applying the algorithm RewriteForest(q2, Σ) is

the following:

Chapter 6. ConQuer: System Implementation and SQL Rewritings 113

Q2(e) = ∃d,m.employee(e, d) ∧ dept(d,m) ∧ ∀d.(employee(e, d) → ∃m.R2(d,m))

We could translate Q2 to SQL using a not exists construct to achieve the effect of

the universal quantifier. Although this may be a reasonable strategy for a simple query

like q2, we will show in the next example that it leads to deeply nested rewritings when

the original queries have several joins.

We now illustrate how to avoid the not exists construct in the rewritings. As in

the previous example, we can count, for each employee, the number of tuples violating

the conditions of the input query (in this case, the join condition). In order to detect

violations of the join condition employee.deptFKey=dept.emplKey, we need to check

whether there is a tuple in the employee relation whose department is not in the dept re-

lation. This can be achieved by performing a left outer join between the relations as

follows:

with candidatesSubQuery as (

select emplKey

from employee,dept

where employee.deptFKey= dept.deptKey ),

countViolSubQuery as (

select emplKey,

sum(case

when employee.deptFKey=dept.emplKey then 0 else 1 end)

as countViol

from employee left outer join dept

on employee.deptFKey=dept.emplKey

where exists (select *

from candidatesSubQuery C where

C.emplKey=employee.emplKey)

group by emplKey )

select emplKey

from countViolSubQuery

where countViol = 0

Chapter 6. ConQuer: System Implementation and SQL Rewritings 114

Notice that there is a subquery called countViolSubQuery, specified using a with

clause. In this subquery, we count the number of violations for each employee. We keep

this count in an attribute called countViol. The final result of the query consists of the

employee names for which there are no violations (countViol = 0). In the computa-

tion of countViol, we use a case statement. If there is a join with some tuple of the

dept relation, we output a value of zero (meaning no violation). Otherwise, we output 1

(meaning a violating tuple). Notice that we can detect the violations of the (inner) join

of the input query q2 because we are performing a left-outer join in the rewritten query

Q2. Had we performed an inner join in Q2, the tuples that do not join on the department

would have never been “seen” by the case statement.

As in the previous example, the query aggregates the values for countViol, summing

them up for each employee name. If the sum for an employee name is zero, there are no

violating tuples involving that employee name. Otherwise, we get the number of violating

tuples.

We just illustrated how we can avoid the use of not exists in the SQL rewritings

by performing a left outer join. In next example, we show why we adopt this strategy

in ConQuer: a naive translation may lead to a deeply nested query , where the level of

nesting may be as large as the number of relations in the from clause of the query.

Example 6.3. Let R be a schema with relations employee(emplKey, cityFKey, deptFKey),

dept(deptKey,mgrName), city(cityKey, provFKey), and prov(provKey, countryName).

Consider a SQL query q3 that retrieves the names of all employees that are located in

Canada and whose manager is Peter:

q3: select distinct emplKey

from employee, city, prov, dept

where employee.cityFKey=city.cityKey

and city.provFKey=prov.provKey

and employee.deptFKey=dept.deptKey

and prov.countryName= "Canada"

and dept.mgrName="Peter"

In conjunctive query notation, q3 can be written as follows.

q3(e) = ∃d, c, m, p.employee(e, d, c) ∧ city(c, p) ∧ prov(p, Canada) ∧ dept(d, Peter)

Chapter 6. ConQuer: System Implementation and SQL Rewritings 115

Figure 6.7: Join graph of query q3.

It can be checked that q3 is in class Cforest. In particular, notice that the join graph of q3

(given in Figure 6.7) is a tree. As shown in Chapter 3, a first-order rewriting of q3 can

be obtained by recursively traversing its join graph. The first-order query rewriting Q3

obtained by applying RewriteForest(q3, Σ) is the following:

Q3(e) = ∃d, c, m, p.employee(e, d, c) ∧ dept(d,m) ∧ city(c, p) ∧ prov(p, Canada) ∧Q′(e)

where :

Q′(e) = ∃d, c.employee(e, d, c) ∧ ∀d, c.employee(e, d, c) → (Q′′(c) ∧QIV (d))

Q′′(c) = ∃p.city(c, p) ∧ ∀p.city(c, p) → Q′′′(p)

Q′′′(p) = prov(p, Canada) ∧ ∀w′.(prov(p, w′) → w′ = Canada)

QIV (d) = dept(d, Peter) ∧ ∀u′.(dept(d, u′) → u′ = Peter)

The universal quantifiers can be translated to SQL using the not exists construct.

However, this may lead to an inefficient query. First, because it would have four self

joins (since each relation appears twice in the rewriting). Second, because each recursive

invocation of the algorithm produces a new universal quantifier, and a new subquery

within its scope. For example, Q′′ is under the scope of a universal quantifier for variable

d in Q′, and Q′′′ is under the scope of another universal quantifier (for variable p) in Q′′.

As a consequence, the level of nesting of the SQL rewriting Q3 would be three, which

corresponds to the height of the join graph.

As we showed in the previous example, in ConQuer we avoid using the not exists

construct by performing a left-outer join of the relations in each tree of the join graph.

The SQL rewriting produced by ConQuer in this case is the following:

Chapter 6. ConQuer: System Implementation and SQL Rewritings 116

with candidatesSubQuery as (

select emplKey

from employee,city, prov,dept

where employee.cityFKey=city.cityKey

and city.provFKey=prov.provKey

and employee.deptFKey=dept.deptKey

and prov.countryName= "Canada"

and dept.mgrName="Peter" ),

countViolSubQuery as (

select emplKey,

sum(case

when employee.cityFKey=city.cityKey

and city.provFKey=prov.provKey

and employee.deptFKey=dept.deptKey

and prov.countryName= "Canada"

and dept.mgrName="Peter"

then 0 else 1 end) as countViol

from employee left outer join dept on employee.deptFKey=dept.deptKey

left outer join city on employee.cityFKey=city.cityKey

left outer join prov on city.provFKey=prov.provKey

where exists (select *

from candidatesSubQuery C where

C.emplKey=employee.emplKey)

group by emplKey )

select emplKey

from countViolSubQuery

where countViol = 0

It is important to note that the SQL rewriting has only two subqueries, even though

q3 has four relations, and a join graph with a tree of depth three.

Chapter 6. ConQuer: System Implementation and SQL Rewritings 117

Projection and the Need for OLAP Functions

In Example 6.1, we dealt with a query that projects on the key attribute of the relation

employee. If a query does not project on the key attribute, then special care must be

taken in the rewriting. We illustrate this with the next example.

Example 6.4. Let R be a schema with our standard employee(emplKey, salary) rela-

tion. Let q4 be a query that retrieves all salaries (regardless of the employee name).

q4: select distinct salary

from employee

Comparing q4 to q1, the former query does not project on the key attribute emplKey,

and it has no where clause. In conjunctive query notation, q4 can be written as follows.

q4(s) = ∃e.employee(e, s)

The first-order query rewriting obtained by invoking RewriteForest(q4, Σ) is the

following.

Q4(s) = ∃e.employee(e, s) ∧ ∀s′.(employee(e, s′) → s′ = s)

Again, we would like to avoid the naive (but inefficient) translation of Q4 into SQL

that uses the not exists construct. Intuitively, Q4 returns the salaries s for which there

is at least one employee name that is associated to s and only to s in the tuples of the

inconsistent relation employee. In this way, we ensure that salary s will appear in every

repair. One way of writing Q4 in SQL is the following:

select salary

from employee

where emplKey is in

select emplKey

from employee

group by emplKey

having count(distinct salary)=1

Chapter 6. ConQuer: System Implementation and SQL Rewritings 118

In our empirical observations, the self join of the above query sometimes leads to

inefficient queries. The self join is needed because we are not including the salary

attribute in the select clause of the subquery. This is not an arbitrary decision. Rather,

it is forced by the syntax of SQL. In SQL, all the attributes of the select clause must

appear in the group by clause. If we include salary in the select clause of the

subquery, we must also group by it, and hence we are unable to count the number of

distinct salaries per employee name. We will show shortly how we overcome this problem

in ConQuer’s rewritings.

We just argued that there are some query rewritings for which there is no obvious way

of avoiding self joins, and that this is caused by the syntax of the group by clause. This

problem was addressed in the OLAP Amendment to the SQL standard [ISO01], which

introduces aggregate functions with a partition by clause. The OLAP Amendment to

the standard has been implemented by all major database vendors. In particular, for

DB2, the standard has been supported since Version 7 (we are using Version 8.2).

The partition by clause is more flexible than group by for two reasons. First, there

can be one partition by clause for each aggregate function, whereas there can only be

one group by for the entire query. Second, unlike group by, the attributes of the select

clause are not required to appear in the partition by clauses of the query. We illustrate

the use of the partition by clause with the next example.

Example 6.5. Consider the following SQL query:

select emplKey,salary,

sum(salary) over (partition by emplKey)

as countProjection

from employee

The query returns triples of values. The first two values of each triple correspond to

employee names and salaries in the relation employee. The last attribute is the sum of

the salaries for the employee name in the tuple. Notice that the attribute emplKey is

in the partition by clause, but the salary attribute is not. So we are projecting on

two attributes (emplKey and salary), but considering only one of them for grouping the

results of the aggregate function. This cannot be done with a group by clause.

Let us finish this example by showing an application of the query to an actual

database. Consider the database I = {employee(John, 1000), employee(John, 2000),

Chapter 6. ConQuer: System Implementation and SQL Rewritings 119

employee(Mary, 1000)}. The result of applying the SQL query above to I is the following

{(John, 1000, 3000), (John, 2000, 3000), (Mary, 1000, 1000)}.

In the next example, we show how the partition by clause could be used in order

to avoid self joins in the rewritings.

Example 6.4. (continued) Recall that we had obtained a rewriting of query q4 that

performs a self join on the employee relation. We can write an equivalent query without

a self join by taking advantage of the partition by clause.

with countProjSubQuery as (

select emplKey,

salary,

count(distinct salary) over (partition by emplKey) as countProj

from employee )

select salary

from countProjSubQuery

where countProj = 1

In the subquery countProjSubQuery, we obtain the number of distinct salaries for

each employee name (which we keep in a variable called countProj). The rewriting then

returns the salaries of employees for which there is exactly one salary in the database

(countProj = 1).

The query rewriting that we just obtained avoids the use of a self join by using the

partition by clause. Unfortunately, though, this is not the end of the story. The

version of DB2 that we use in ConQuer currently supports the partition by clause for

a variety of aggregate functions (such as sum, min, max, count(*), and avg), but it does

not support the count(distinct) function. Nevertheless, the effect of count(distinct)

can be obtained by combining the use of the max aggregation function (with a partition

by clause) and an OLAP function called rank() as follows.

with rankProjSubQuery as (

select emplKey, salary,

rank() over (partition by emplKey order by salary)

as rankProjection

Chapter 6. ConQuer: System Implementation and SQL Rewritings 120

from employee ),

countProjSubQuery as (

select emplKey, salary,

max(rankProjection) over (partition by emplKey)

as countProjection

from rankProjSubQuery )

select distinct salary

from countProjSubQuery

where countProjection = 1

First, let us explain the use of the rank() function. The syntax of rank() is the

following:

rank() over

(partition by <partition attributes> order by <order attributes>)

The function creates groups for each tuple of values (instantiation) of the attributes

in the partition by clause, as we discussed before for other functions. The tuples of

each group are ordered according to the attributes in the order by clause, and assigned

a number according to their position in this ordering. If there is a tie (in our example,

two tuples with the same employee name and salary), the tuples are mapped to the same

number.

Let us illustrate the semantics of the rank() function in the context of our example

rewriting. Consider a database I = {employee(John, 1000), employee(John, 2000),

employee(Mary, 1000)}. Then, the function rank() over (partition by emplKey

order by salary) would map (John, 1000) to 1, (John, 2000) to 2, and (Mary, 1000)

to 1.

Now consider the instance I as an inconsistent database with respect to Σ (which

contains a constraint stating that emplKey is the key of the employee relation). In

the subquery rankProjSubQuery of the rewritten query, we compute the ranking func-

tion for each tuple and keep the value in an attribute called rankProjection. Then,

in the subquery countProjSubQuery, we obtain the maximum value of the attribute

rankProjection for each employee name, and keep it in an attribute called count-

Projection. Notice that the “grouping” is done by employee names since the attribute

Chapter 6. ConQuer: System Implementation and SQL Rewritings 121

emplKey is in the partition by clause of the max aggregate function. In our example,

we would obtain {(John, 1000, 2), (John, 2000, 2),(Mary, 1000, 1)}. In the final result,

we would like to get salary 1000 because it appears associated with Mary in every re-

pair, but not 2000 because it does not appear in all repairs. We obtain this in the query

rewriting by checking the condition countProjection=1.

6.3 ConQuer Rewritings for SPJ Queries with Ag-

gregation

In this section, we present the SQL query rewritings produced by ConQuer for queries

with grouping and aggregation. We first present the algorithm and then illustrate it with

some examples.

6.3.1 Rewriting algorithm

We now present the SQL rewriting algorithm for SPJ queries with aggregation that are

equivalent to the aggregate conjunctive queries in class Caggforest, introduced in Definition

4.1, which we repeat next.

Definition 4.1. Let q be an aggregate conjunctive query. We say that q is in class

Caggforest if q is of the form

select ~z, [count(*)| F(u)]

from q∗(~z, u)

group by ~z

where q∗ is a conjunctive query in Cforest, and F is one of the aggregation functions

min, max, or sum.

We can now give a definition analogous to Caggforest for SPJ SQL queries with aggre-

gation.

Chapter 6. ConQuer: System Implementation and SQL Rewritings 122

Definition 6.1. We say that query q is in class Csqlaggforest if q is the form

select S1, . . . , Sl,[count(*)],F1(A1), . . . , Fu(Au)

from <list of relations>

where <list of conditions>

group by S1, . . . , Sl

where S1, . . . , Sl, A1, . . . , Au are attributes of the relations in the from clause, and

F1, . . . , Fu may be any of the aggregation functions min, max, and sum.

We are now ready to give ConQuer’s rewriting for queries in Csqlaggforest. The algorithm

is called RewriteAggSQL, and is shown in Figure 6.8. It takes as input a SQL query q in

class Csqlaggforest and a set of key constraints (one per relation of the schema), and returns

a SQL rewriting Q of q.

In the rewriting Q, the attributes of the relations in q play different roles. As in

the algorithm RewriteForestSQL for queries without aggregation, we have projecting

and key-root attributes. The former are the attributes that q projects on (i.e., that

appear in its select clause), and the latter are the attributes that appear in the key

of a relation that is at the root of some tree in the join graph of q. In addition, in

RewriteAggSQL, we have aggregation attributes, that is the attributes that appear as

arguments of some aggregation function of q. In Figure 6.8, we denote the projecting

attributes with the symbols S1, . . . , Sl; the key-root attributes with K1, . . . , Kn; and the

aggregation attributes with A1, . . . , Au.

We denote the aggregation functions of q with F1, . . . , Fu. In the figure, we assume

that the 0-ary function count(*) is present in the query (but during the explanation it

will be easy to see what can be dropped if count(*) is not present).

The rewriting Q has five subqueries, specified using a with clause: candidatesSub-

Query, countViolSubQuery, contribAllSubQuery, contribConsistentSubQuery, and

contribNonConsistentSubQuery.

As in the algorithm RewriteForestSQL, the purpose of candidatesSubQuery is to

determine the values for the key-root attributes that should be considered by the other

subqueries. The subquery countViolSubQuery has the same purpose (counting the num-

ber of violations per key-root value) as the subquery of the same name in the rewrit-

ing RewriteForestSQL. One difference is that here we need to compute the attribute

Chapter 6. ConQuer: System Implementation and SQL Rewritings 123

satConds which keeps track of whether each tuple satisfies the conditions of the query

(denoted as CONDS). The other difference is that in the select clause of the subquery,

we must project on the aggregation attributes since their values are needed to perform

aggregation in the rest of the rewriting.

The other three subqueries are used to compute the “contributions” to the lower and

upper bounds of each aggregate result. The subquery contribAllSubQuery computes,

for each instantiation of the key-root and projecting attributes, the minimum and max-

imum value for each aggregation attribute. In particular, in the subquery we group by

K1, . . . , Kn, S1, . . . , Sl (the key-root and projecting attributes), and for each aggregation

Fi(Ai) in the select clause of q, we compute attributes bottomAi and topAi as min(Ai)

and max(Ai), respectively. We also compute an attribute countProjection, to keep

track of the projection on nonkey attributes.

The subqueries contribConsistentSubQuery and contribNonConsistentSubQuery

compute the contribution of the “consistent” and “nonconsistent” tuples to the aggre-

gation. The former are the tuples whose key-root values satisfy the following two con-

ditions. First, they have the same value for the projecting attributes in every tuple

where they appear (checked with condition countProjection = 1). Second, they are

not involved in a violation of the selection conditions CONDS in any of the tuples where

they appear (checked with condition countViol=0). The tuples that violate at least

one of these conditions are considered “nonconsistent” and dealt with in the subquery

contribNonConsistentSubQuery.

For the “consistent” tuples, the contributions computed in contribConsistentSub-

Query correspond to the bottom and top values from contribAllSubQuery. That is,

the attributes bottomAi and topAi of contribAllSubQuery appear in the select clause

of contribConsistentSubQuery. The computation of the contributions of the “noncon-

sistent” tuples is more involved. In contribNonConsistentSubQuery, the expression of

the select clause that handles the contributions is obtained by calling the procedure

GetBoundsNonConsistent given in Figure 6.9. Notice in the figure that the contributions

are different depending on the aggregation function. The rationale and correctness proof

for these contributions were given in Chapter 4. In the figure, we do not include the 0-ary

operator count(*). For this operator, we need to return the attributes bottomCount and

topCount with values of zero and one, respectively.

In the subqueries, we project not only on the projecting attributes S1, . . . , Sl but

also on the root-key attributes K1, . . . , Kn. However, in the main query of the rewriting

Chapter 6. ConQuer: System Implementation and SQL Rewritings 124

we project and group by only the attributes S1, . . . , Sl (i.e., we project out the key-root

attributes). In this way, the rewritten query Q and the input query q return tuples

for the same set of attributes. We also compute the greatest lower bound (glbAi) and

lowest upper bound (lubAi) for each tuple of values for the projecting attributes. This

is obtained by performing the corresponding aggregation function (min, max, or sum) on

the top and bottom values computed in the previous subqueries. For the 0-ary func-

tion count(*), the bounds are computed by summing up the values of the attributes

bottomCount and topCount from the previous subqueries. Notice that there is also a

condition having sum(bottomCount) > 0. This is included in order to ensure that the

tuples for the projecting attributes are consistent answers.

For the sake of clarity, we omitted the order by clause in the query q. However,

dealing with ordering in the rewriting is quite easy. We just need to add the attributes

of the order by clause of q to the select clause of the subqueries, and finally add an

order by clause to the main subquery. The only special case that must be considered

is when an aggregate attribute appears in the order by clause. Since for each aggregate

attribute of q we have two attributes in the rewritten query (one for each bound), we

must (arbitrarily) decide whether the ordering will be by either the greatest lower or the

lowest upper bound.

Chapter 6. ConQuer: System Implementation and SQL Rewritings 125

Algorithm RewriteAggSQL(q, Σ)

Input: q, a SQL query in Csqlaggforest of the form

select <list of attributes>,<list of aggregation functions>from <list of relations>where <list of conditions>group by <list of attributes>

Σ, a set of key constraints (one per relation)Output: Q, a SQL query that computes aggconsistentΣ(q, I) for every database I

Let F1(A1), . . . , Fu(Au) be the aggregation function applications in the select clauseof the query, where each Fi is an aggregation function, and each Ai is an attributefrom a relation that appears in the from clause

Let S1, . . . , Sl be the attributes in the select clause of q (by definition of Csqlaggforest,

these are the attributes in the group by clause as well)Let G be the join graph (forest) of qLet r1, . . . , rm be the relations at the root of some tree of GLet K1, . . . , Kn be the attributes in the keys of r1, . . . , rm

Let CONDS be the list of conditions in the where clauseLet JOINS be the expression obtained by calling the procedure

GetJoinsExpression(G, CONDS) of Figure 6.6Let Q be the following SQL query:

with candidatesSubQuery as (

select K1 as cK1,. . . ,Kn as cKn

from <list of relations in q>where CONDS ),

with countViolSubQuery as (

select K1, . . . , Kn, S1, . . . , Sl, A1, . . . , Au

rank() over (partition by K1, . . . , Kn

order by S1, . . . , Sl) as rankProjection,sum(case when CONDS then 0 else 1 end)

over (partition by K1, . . . , Kn) as countViol,case when CONDS then ‘‘yes’’ else ‘‘no’’ end as satConds

from JOINSwhere exists (select * from candidatesSubQuery

where K1 = cK1 and . . . and Kn = cKn),

continued on next page...

Figure 6.8: SQL query rewriting algorithm for SPJ queries in Csqlaggforest

Chapter 6. ConQuer: System Implementation and SQL Rewritings 126

continues from previous page...

contribAllSubQuery as (

select K1, . . . , Kn, S1, . . . , Sl,min(A1) as bottomA1,max(A1) as topA1,...,

min(Au) as bottomAu,max(Au) as topAu,

max(rankProjection) over (partition by K1, . . . , Kn)as countProjection,

countViol

from countViolSubQuery

where satConds=‘‘yes’’

group by K1, . . . , Kn, S1, . . . , Sl,countViol,rankProjection )

contribConsistentSubQuery as (

select K1, . . . , Kn, S1, . . . , Sl,bottomA1,topA1,. . . ,bottomAu,topAu,

1 as bottomCount,

1 as topCount

from contribAllSubQuery

where countProjection = 1 and countViol=0 )

contribNonConsistentSubQuery as (

select K1, . . . , Kn, S1, . . . , Sl,GetBoundsNonConsistent(F, A1),. . . ,GetBoundsNonConsistent(F, Au),0 as bottomCount,

1 as topCount,

from contribAllSubQuery

where countProjection > 1 or countViol >= 1 )

select S1, . . . , Sl,F(bottomA1) as glbA1,F(topA1) as lubA1,. . . ,F(bottomAu) as glbAu,F(topAu) as lubAu,

sum(bottomCount) as glbCount, sum(topCount) as lubCount

from

( select * from contribConsistentSubQuery

union all

select * from contribNonConsistentSubQuery ) q

group by S1, . . . , Sl

having sum(bottomCount)>0

return Q

Figure 6.8: SQL query rewriting algorithm for SPJ queries in Csqlaggforest

Chapter 6. ConQuer: System Implementation and SQL Rewritings 127

Algorithm GetBoundsNonConsistent

Input: Fi, one of the aggregation functions sum, min, max

Ai, an attribute

Output: a subexpression of a SQL query

if Fi = sum then

return “case when

bottomAi < 0 then bottomAi

else 0 end as bottomAi,

case when

topAi > 0 then topAi

else 0 end as topAi”

end if

if Fi = min

return “bottomAi, 0 as topAi”

end if

if Fi = max

return “0 as bottomAi, topAi”

end if

Figure 6.9: Algorithm to obtain the bottom and top contributions of “nonconsistent”

tuples

6.3.2 Examples

We next illustrate the rewriting for a query that uses the count aggregation function.

Example 6.6. Let R be a schema with relation employee(emplKey, salary, age). Con-

sider a SQL query q5 that, for each age in the database, gives the number of occurrences

of the age on tuples for employees whose salary is less than or equal to 1000.

Chapter 6. ConQuer: System Implementation and SQL Rewritings 128

q5: select age, count(*)

from employee

where salary <= 1000

group by age

In the aggregate conjunctive query notation introduced in Chapter 4, q5 can be written

as follows.

q5(a, cnt) = select a, count(*)

from employee(e, s, a) ∧ s ≤ 1000

group by a

The above query is in the class Caggforest for which we gave a query rewriting algorithm

in Chapter 4. A key idea of that algorithm is to first produce a first-order rewriting for

a conjunctive query, and then perform aggregation on the result of the first-order query.

For our example, this conjunctive query is q′(e, a) = ∃s.employee(e, s, a)∧ s ≤ 1000. Let

us call QConsistent(e, s) to the result of invoking RewriteForest(q′, Σ) (the algorithm

introduced in Chapter 3).

Let Q5 be the query rewriting for q5 obtained by invoking RewriteCount(q5, Σ) (the

algorithm of Figure 4.1 of Chapter 4). In that rewriting, the greatest lower bound is

obtained as follows:

QGlb(s, glb)= select s, count(*)

from QConsistent(e, s)

group by s

Notice that aggregation is performed on the result of the first-order query QConsistent(e, s).

Thus, for computing the greatest lower bound in the SQL rewriting, we can reuse the al-

gorithm RewriteForestSQL introduced in Section 6.2. In particular, we will use the next

two subqueries, which are similar to those that would be produced by RewriteForestSQL(q′, Σ)

(we will show the differences next).

with candidatesSubQuery as (

select emplKey

from employee

where salary <= 1000 )

Chapter 6. ConQuer: System Implementation and SQL Rewritings 129

with countViolSubQuery as (

select emplKey,age,

rank() over (partition by emplKey

order by age) as rankProjection,

sum(case when salary <= 1000 then 0 else 1 end)

over (partition by emplKey) as countViol,

case when salary <= 1000 then ‘‘yes’’ else ‘‘no’’ end

as satConds

from employee

where exists (select *

from candidatesSubQuery C where

C.emplKey=employee.emplKey) )

with contribAllSubQuery as (

select emplKey,age,

max(rankProjection) over (partition by emplKey)

as countProjection,

countViol

from rankProjSubQuery

where satConds=‘‘yes’’

group by emplKey,age,countViol,rankProjection )

The above subqueries differ from the ones that would be produced by Rewrite-

ForestSQL in the following aspects. In countViolSubQuery, we compute an attribute

satConds that keeps track of whether each tuple satisfies or violates the selection con-

dition of q5 (i.e., that the salary is less than or equal to 1000). This is different from

the attribute countViol because countViol counts the violations for all tuples where a

key value (employee name, in this case) appears, whereas satConds may take different

values on different tuples of the same employee, depending on the salary that appears in

the tuple. The third subquery corresponds to the subquery countProjSubQuery of the

Chapter 6. ConQuer: System Implementation and SQL Rewritings 130

algorithm RewriteForestSQL, but it has a different name here (contribAllSubQuery)

because, as we will show shortly, it is used to compute the “contribution” of each tuple

to the lower and upper bounds of count(*). In this subquery, we check the condition

satConds=‘‘yes’’. The intuitive reason is that the tuples that do not satisfy the con-

ditions of q5 (and hence satConds = ‘‘no’’) do not contribute neither to the lower nor

to the upper bound of count(*), and should thus be filtered out.

Let us now consider the computation of the lowest upper bound. In the query Q5

returned by RewriteCount, this bound is obtained as follows:

QLub(a, lub) = select a, count(*)

from q′(e, a) ∧ (∃e.QConsistent(e, a))

group by s

In this case, aggregation is done on the result of the following first-order expression:

q′(e, a)∧(∃e.QConsistent(e, a)). The naive way of writing this expression in SQL may be

inefficient because QConsistent already contains q′ as a subexpression. A more efficient

way of writing Q5 in SQL involves computing the “contributions” of each tuple to the

value of count(*), with the two subqueries shown next.

One of the subqueries (called contribConsistentSubQuery) computes the contribu-

tion of the “consistent” tuples. These are the tuples for employees that (1) have the

same age (the attribute in the select clause of q5) in every tuple where they appear;

and (2) are not involved in a violation of the conditions of q5 in any of the tuples where

they appear (i.e., their salary is always less than or equal to 1000). This can be checked

with the condition countProjection = 1 and countViol=0. In addition, the subquery

has attributes bottomCount and topCount that are used in the main body of the query

to combine the contributions of the “consistent” and “nonconsistent” tuples. For the

consistent tuples, the contribution is one to both the lower and upper bounds.

with contribConsistentSubQuery as (

select emplKey,age

1 as bottomCount

1 as topCount

from contribAllSubQuery

where countProjection = 1 and countViol=0 )

Chapter 6. ConQuer: System Implementation and SQL Rewritings 131

The other subquery (called contribNonConsistentSubQuery) computes the contri-

butions of the “nonconsistent” tuples. We give this name to the tuples that are not

in the consistent answer of q′, but do satisfy q′. These tuple do not contribute to

the greatest lower bound of count(*), but they may contribute to the lowest upper

bound. In the SQL rewriting, the nonconsistent tuples are captured with the condition

countProjection > 1 or countViol >= 1. In addition, the subquery has attributes

bottomCount and topCount that are used in the main body of the query to combine

the contributions of the “consistent” and “nonconsistent” tuples. For the nonconsistent

tuples, the contribution is zero to the lower bound and one to the upper bound (compare

this to the consistent tuples, which contribute one to both bounds).

with contribNonConsistentSubQuery as (

select emplKey,age

0 as bottomCount,

1 as topCount

from contribAllSubQuery

where countProjection > 1 or countViol >= 1 )

Finally, the main body of the rewriting sums ups the contributions of each tuple to the

lower and upper bounds, and projects out the attribute emplKey. The condition having

sum(bottomCount)>0 is used to ensure that we return ages that are consistent answers.

As we mentioned before, this corresponds to checking the condition ∃e.QConsistent(e, a).

select age

sum(bottomCount) as glb,

sum(topCount) as lub

from

( select * from contribConsistentSubQuery

union all

select * from contribNonConsistentSubQuery ) q

group by age

having sum(bottomCount)>0

In the next example, we illustrate the rewriting for a query that has the sum aggre-

gation function. The rewritings for the min and max aggregation functions are similar.

Chapter 6. ConQuer: System Implementation and SQL Rewritings 132

Example 6.7. Consider the same schema as in the previous example. Let q6 be a SQL

query that, for each age in the database, gives the sum of all salaries in the database

that are less or equal than 1000.

q6: select age, sum(salary)

from employee

where salary <= 1000

group by age

The SQL rewriting of q6 is computed by ConQuer along the same lines of the rewriting

for query q5 of the previous example. As in that example, the rewriting starts with three

subqueries: candidatesSubQuery, countViolSubQuery and contribAllSubQuery. The

subquery countViolSubQuery counts the number of violations of the selection condition

for each key value (age), and is the same as in the previous example, except that it

includes the attribute salary in its select clause. The subquery contribAllSubQuery

computes the contribution of all key values to the final result. The only difference with

the previous example is that here we compute the minimum and maximum salary for

each employee (attributes bottomSalary and topSalary). This was not necessary in

the previous example since count(*) is a 0-ary function, whereas sum is a unary function

(in this case, taking the argument salary).

with candidatesSubQuery as (

select emplKey

from employee

where salary <= 1000 )

with countViolSubQuery as (

select emplKey,age,salary,

rank() over (partition by emplKey

order by age) as rankProjection,

sum(case when salary <= 1000 then 0 else 1 end)

over (partition by emplKey) as countViol,

case when salary <= 1000 then ‘‘yes’’ else ‘‘no’’ end

as satConds

from employee

Chapter 6. ConQuer: System Implementation and SQL Rewritings 133

where exists (select *

from candidatesSubQuery C where

C.emplKey=employee.emplKey) )

with contribAllSubQuery as (

select emplKey,age,

min(salary) as bottomSalary,

max(salary) as topSalary,

max(rankProjection) over (partition by emplKey)

as countProjection,

countViol

from rankProjSubQuery

where satConds=‘‘yes’’

group by emplKey,age,countViol,rankProjection )

Then, as in the previous example, the rewriting computes the contributions from the

“consistent” and “nonconsistent” tuples. For clarity of presentation, we will assume that

all salaries are positive values (but in the general algorithm we deal with the case of

negative values as well). For the “consistent tuples” (whose contributions are computed

in contribConsistentSubQuery), the bottom and top salaries computed in contribAll-

SubQuery contribute to the greatest lower bounds and lowest upper bounds, respectively.

The top salary also contributes to the lowest upper bound of the “nonconsistent” tuples

(whose contributions are computed in contribNonConsistentSubQuery). However, as

we explained in Chapter 4, the bottom salary does not contribute to the greatest lower

bound. Therefore, the attribute bottomSalary of contribNonConsistentSubQuery gets

a value of zero.

with contribConsistentSubQuery as (

select emplKey,age,

bottomSalary,

topSalary,

Chapter 6. ConQuer: System Implementation and SQL Rewritings 134

1 as bottomCount

from contribAllSubQuery

where countProjection = 1 and countViol=0 )

with contribNonConsistentSubQuery as (

select emplKey,age

0 as bottomSalary,

topSalary,

0 as bottomCount

from contribAllSubQuery

where countProjection > 1 or countViol >= 1 )

Finally, the main body of the rewriting sums up the contributions of each tuple

to the lower and upper bounds, and projects out the emplKey attribute. Notice that

as in the rewriting for query q5 of the previous example, we have a condition having

sum(bottomCount)>0. This is done because, again, we want to report only the ages that

appear for sure in every repair.

select age,

sum(bottomSalary) as glbSalary,

sum(topSalary) as lubSalary

from

( select * from contribConsistentSubQuery

union all

select * from contribNonConsistentSubQuery ) q

group by age

having sum(bottomCount)>0

6.4 Exploiting Precomputed Annotations

The main focus of the thesis is on query processing directly on the inconsistent database.

However, in some circumstances, it may be advantageous to process the database offline

in order to materialize data structures with information about constraint violations. This

Chapter 6. ConQuer: System Implementation and SQL Rewritings 135

precomputed data could then be exploited during online query answering to improve the

performance of the queries.

In this section, we will present a simple offline precomputation scheme, and show the

rewritings that ConQuer produces in order to exploit it. The scheme is based on annota-

tions attached to each tuple. The annotation consists of just one bit that states whether

the tuple satisfies or violates a given key constraint. If annotation are present, then

ConQuer can produce a rewriting that exploits them. We call such rewriting annotation-

aware. In the next example, we illustrate the annotation-aware rewritings. In the next

section, we will identify the scenarios where it is desirable to exploit the annotations, and

we will empirically validate the effectiveness of the annotation-aware rewritings.

Example 6.8. Let R be a schema with relations employee(emplKey, deptFKey) and

dept(deptKey,mgrName). We will give an example based on a SPJ query without ag-

gregation. However, the example shows all the ingredients of the rewritings on annotated

databases, and extending the rewriting to the case of rewritings for queries with aggre-

gation is straightforward.

Consider a SQL query q7 that retrieves the names of all employees whose department

manager is Peter:

q7: select distinct emplKey

from employee,dept

where employee.deptFKey= dept.deptKey and dept.mgrName=‘‘Peter’’

Consider the database I = {employee(John, Sales), employee(Mary,Engineering),

dept(Sales, Peter), dept(Sales, Tom), dept(Engineering, Peter)}. Suppose that we in-

struct ConQuer to process the database offline and annotate each tuple with a bit stating

whether it satisfies or violates the constraints of Σ. Assume that ConQuer augments the

set of attributes of each relation with an attribute called cons that stores the annotation.

The “annotated database” produced by ConQuer would then be the following.

employee dept

emplKey deptFKey cons deptKey mgrName cons

John Sales y Sales Peter n

Mary Engineering y Sales Tom n

Engineering Peter y

Chapter 6. ConQuer: System Implementation and SQL Rewritings 136

Note that the tuple for Mary in relation employee, and the tuple for Engineering in

relation dept have a value of ‘‘y’’ in their cons attributes, meaning that they do not

violate any constraint. If we join these tuples, we get a tuple that satisfies query q7.

Furthermore, it is easy to see that this will be the only tuple in the result for Mary.

Thus, it must be a consistent answer.

In general, the join of consistent tuples (i.e, tuples where cons = ‘‘y’’) produces

a consistent answer. For such tuples, it suffices to check whether the conditions of the

original query are satisfied (in this example, check that they satisfy q7). In this way, we

can avoid the possibly costly operations of the rewritings produced by the algorithms

RewriteForestSQL and RewriteAggSQL. In the rewriting, we capture these tuples in a

subquery called allConsistentSubQuery (allConsistent because they come from the

join of tuples all of which are consistent). The subquery consists of the input query and a

filter that requires every tuple in the join to have a value of ‘‘y’’ in the cons attribute.

with allConsistentSubQuery as (

select distinct emplKey

from employee,dept

where employee.deptFKey= dept.deptKey and dept.mgrName=‘‘Peter’’

and employee.cons=‘‘y’’ and dept.cons=‘‘y’’

Now, note that the tuple for John also satisfies the constraints and has a value of

‘‘y’’ in its cons attribute. However, this tuple joins with the tuples for the Sales

department, which violate the key constraint of their relation (they are annotated with

‘‘n’’). If we join the tuple for John with the tuple dept(Sales, Peter), the result satisfies

q7. But if we join with dept(Sales, Tom), the result does not satisfy the query. Thus,

John is not a consistent answer to q7.

To keep track of the join of tuples that may violate a constraint, we produce a rewrit-

ing that is similar to the one that would be produced by RewriteForestSQL, the only

difference being that we augment the candidatesSubQuery subquery of the rewriting

with a condition checking whether the cons attribute of at least one of the joined tu-

ples is set to ‘‘n’’. In our example, we check the condition employee.cons=‘‘n’’ or

dept.cons=‘‘n’’. The result obtained from these tuples is kept in a subquery called

someNonConsistentSubQuery (the name comes from the fact that some of the tuples of

the join may not be consistent).

Chapter 6. ConQuer: System Implementation and SQL Rewritings 137

with candidatesSubQuery (

select distinct emplKey

from employee,dept

where employee.deptFKey= dept.deptKey and dept.mgrName=‘‘Peter’’

and (employee.cons=‘‘n’’ or dept.cons=‘‘n’’) )

with countViolSubQuery as (

select emplKey,

sum(case

when employee.deptFKey=dept.emplKey

and dept.mgrName=‘‘Peter’’ then 0 else 1 end)

as countViol

from employee left outer join dept

on employee.deptFKey=dept.emplKey

where exists (select * from Candidates C where C.emplKey=employee.emplKey)

group by emplKey )

with someNonConsistentSubQuery as (

select emplKey

from countViolSubQuery

where countViol = 0)

Finally, the main body of the query takes the union of the tuples obtained with the

subqueries allConsistentSubQuery and someNonConsistentSubQuery.

select emplKey from

(select emplKeyfrom someNonConsistentSubQuery)

union all

(select emplKeyfrom someNonConsistentSubQuery)

Notice that this rewriting is correct even if annotations incorrectly mark a consistent

tuple as inconsistent. Hence, when deleting or updating a tuple, it is not mandatory to

update annotations.

Chapter 6. ConQuer: System Implementation and SQL Rewritings 138

6.5 Related Work

In this section, we review systems for managing inconsistent databases that are related

to ConQuer. Hippo [CMS04b, CMS04a] is a system that produces consistent answers

for unions of quantifier-free conjunctive queries (that is, unions of queries in the class

presented by Arenas, Bertossi, and Chomicki [ABC99]). Hippo does not consider queries

with aggregation, grouping or bag semantics. Apart from the class of queries that it can

handle, Hippo differs from ConQuer in the fact that it is not based on query rewriting.

Rather, Hippo takes the more procedural approach of producing a Java program which

computes the consistent answers. Although the program does interact with an RDBMS

back-end, most of the processing is done by processing an (in-memory) conflict graph

data structure that contains all the tuples that violate the constraints. The system may

not be able to operate on databases where this data structure does not fit in memory.

Hippo has been shown to scale to database of up to 300,000 tuples [CMS04b].

There are a number of systems for consistent query answering that rewrite queries into

powerful logics [CB00, LLR02, EFGL03, CB05]. Infomix [EFGL03] is a notable example

of such an approach. In Infomix, queries are rewritten into disjunctive logic programs.

Such programs are computationally more expensive than SQL, but also more expressive

and permit rewritings over a very rich class of query constraints. For example, Infomix

considers general functional, inclusion, and exclusion query constraints. These systems

focus on expressiveness, more than efficiency and scalability, and therefore address a

different design point than the one we are considering. To give an idea of the scale of

the difference, one of the few experimental studies available in the literature [EFGL03]

reports results for databases with at most 100 tuples violating primary key constraints

(over a database of 50,000 tuples). In contrast, the largest database that we used in the

experiments reported in the next chapter has 8.6 million inconsistent tuples (over a total

of 172 million tuples).

Chapter 7

Experimental Analysis

In this chapter, we validate the efficiency of ConQuer’s rewritings using IBM DB2 UDB

Version 8.2 (from now on, referred to as just DB2). In Section 7.1, we give a detailed

description of the experimental framework. Then, in Section 7.2, we report and analyze

the experimental results obtained within this framework.

7.1 Experimental Framework

7.1.1 System and Database Manager Configuration

The experiments were performed on a Sun v40z server class computer with 4 processors

and 8 GB of RAM, running RedHat Linux AS 4 kernel Version 2.6.9. The relational

database management system used to run the queries was IBM DB2 UDB Version 8.2.

We now describe some important parameters in the database configuration. The

buffer pool size was deliberately kept considerably below the system’s available memory.

This is because our aim is to test the overhead of the queries in environments where the

amount of primary memory is small compared to database size. In particular, the buffer

pool size was restricted to 400 MB (whereas the size of the largest database reported

here is 20 GB).

In order to reduce the number of variables to consider when comparing running

times, the query optimizer was set to use a degree of intra-parallelism (parameter DFT-

DEGREE) of 1, meaning that the query plan always chooses to use one processor, even

though there are four available in the system. The query optimization level, which dic-

tates the amount of time that the query optimizer may spend to produce a query plan,

139

Chapter 7. Experimental Analysis 140

was set to its highest value (parameter DFT QUERYOPT was set to 9) since the time to

produce a plan is always negligible with respect to the time to execute the fairly complex

queries that we use in our experiments.

For all databases, statistics were created by running the DB2 RUNSTATS command.

The parameters for statistics gathering were set as follows: the number of “most frequent”

values to be collected from each table (parameter NUM FREQVALUES) was set to 10;

and the number of quantiles for the distributions (parameter NUM QUANTILES) was

set to 20.

We created clustered indices for the (potentially violated) primary key attributes.

Notice that these indices cannot be declared as “unique” since the database may be

inconsistent. With respect to the annotations introduced in Section 6.4, we added an

attribute called cons to each table, and used it to keep track of whether each tuple satisfies

or violates the primary key constraints. For each relation, we declared a secondary index

on the attributes of the key plus the cons attribute. The values for the cons attributes

are computed offline. However, it is important to point out that in the experimental

results that we report here, this attribute is used only where we explicitly say that the

rewritings are annotation-aware. By default, we assume that the rewritings work on the

inconsistent database without exploiting precomputed information.

Regarding the indices of the database, we considered a worst-case and a typical sce-

nario. In the worst-case scenario, the only indices in the database are those for the key

attributes and the annotations. We also considered a more typical scenario, where several

indices are declared. In particular, we created all indices suggested by DB2’s Configu-

ration Advisor. In each database, the size of the indices proposed by the Configuration

Advisor corresponds to a third of the size of the database. The indices are shown in

Appendix B.

7.1.2 Inconsistent Database Instances

For the inconsistent databases, we employed the schema and data of TPC-H, the standard

benchmark for decision support systems. The schema is shown in Figure 7.1. The sizes

of the tables are also shown in Figure 7.1 (under their names), and are given in number

of tuples for a 1 GB instance. For example, the relation lineitem has 6 million tuples on

a 1 GB instance. As per the TPC-H standard, all tables except nation and region are

scaled proportionally to the size of the database (this is indicated with SF in the figure).

Chapter 7. Experimental Analysis 141

Figure 7.1: Schema specified in the TPC-H standard (taken from [TPC03])

Chapter 7. Experimental Analysis 142

The parameters used to build the databases are the following:

• The size s of the database. We considered databases of various sizes, up to 20

GB (172 million tuples). Notice that this size is 50 times larger than the size of the

buffer pool of the database (whose size is 400 MB).

• The percentage p of the database that is inconsistent. For example on a 1 GB

instance (8.6 million tuples) where p is 25%, there are 2.15 million tuples that

violate the key constraints of the schema. We created the databases in such a way

that every relation has the same value of p as the entire database. We experimented

with values of p ranging from 0% (totally consistent database) to 25%.

• The number of tuples n that share a common key value (and hence violate a key

constraint), for every key value in the inconsistent portion of the database. For

example, if n = 2, then every key value in the inconsistent portion of the database

appears in exactly two tuples. The value is fixed for every tuple of the inconsistent

portion (i.e., every key value of the database appears exactly one or n times). We

experimented with values of n ranging from 2 to 7.

The TPC Consortium provides a data generator called dbgen that produces database

instances compliant with the standard.1 Since the TPC-H standard does not consider

inconsistent databases, dbgen creates instances that do not violate the primary key con-

straints of the schema. For this reason, we modified the source code of dbgen in order to

produce a generator that creates inconsistent databases. The database generator creates

each table as follows. Let l be total number of tuples to be generated in the table. First,

we generate l.(1− p100

+ p100n

) tuples. Second, we randomly select l.p100.n

tuples from them.

Third, for each selected tuple ~t, we generate n−1 additional tuples by invoking the tuple

generation functions of dbgen. We replace the key values of the n − 1 generated tuples

with the key value of ~t.

7.1.3 Workload

The experiments were performed using queries specified in the TPC-H standard. There

are twenty two queries in the standard, twelve of which are aggregate conjunctive queries,

the type of queries that we handle in this work. The other ten queries have features

1The database generator can be obtained from the TPC Consortium’s website at http://www.tpc.org

Chapter 7. Experimental Analysis 143

that are beyond aggregate conjunctive queries, such as aggregation in nested subqueries

(Queries 2, 11, 15, 17, 18 and 20 of the specification), left outer joins (Query 13), and

negation (Queries 16, 21, and 22).

In our experiments, we will focus on eleven queries from the TPC-H specification

(Queries 1, 3, 4, 6, 7, 8, 9, 10, 12, 14, and 19). The original TPC-H queries together with

their rewritings are given in Appendix A. Notice that, of the twelve aggregate conjunctive

queries, we rule out only one query. This is Query 5 of the specification, which contains

a nonkey-to-nonkey join, which we cannot handle with our query rewriting algorithm.

(Following the results of Chapter 5, Query 5 is in class C∗ and thus has no query rewriting

into SQL). Of the eleven queries that we consider, six are strictly in class Csqlaggforest

(Queries 3, 4, 6, 9, 10, and 12), and the other five can be handled with our rewriting

algorithm RewriteAggSQL with little or no modification for the following reasons. First,

Queries 7 and 8 have repeated relation symbols that appear at leaf nodes of the join

graph. The algorithm RewriteAggSQL can handle this case, since the nonkey variables of

these repeated relation symbols are not involved in any join. Second, Queries 7 and 19

have disjunction involving equalities of attributes to constants. We showed in Chapter

3 that it is quite easy to extend the algorithm that produces a first-order rewriting to

handle this case, and the SQL rewriting algorithm RewriteAggSQL of this chapter can

be used for such cases without modification (the disjunction is considered part of the

selection conditions in the expression CONDS of Figures 6.8). Finally, Queries 8, and

14 perform an arithmetic operation (division) on the result of two aggregate operators,

and Query 1 computes an average. In such cases, we give bounds that are sound, but

not tight.2

In Figure 7.2, we summarize the main characteristics of the eleven queries used in

the experiments. For each query, we give the number of relations in the from clause,

the number of selection conditions in the where clause (this excludes join conditions),

the selectivity (as the percentage of joined tuples that satisfy the selection conditions of

the query), the number of projecting attributes in the select clause, and the number of

aggregate functions in the select clause. The queries in the TPC-H specification are pa-

rameterized, and the standard suggests values for these parameters. In the experiments,

we used the suggested values in all the queries. The selectivities reported in Figure 7.2

are based on these parameters.

2For the queries with the sum operator, all ranges are tight since the queries in the TPC-H standardonly aggregate over attributes with positive values.

Chapter 7. Experimental Analysis 144

relations selection selectivity projecting aggregation

conditions (in %) attrs functions

Q1 1 1 98.56 2 8

Q3 3 3 0.51 3 1

Q4 2 3 2.35 1 1

Q6 1 4 1.91 0 1

Q7 5 4 0.10 3 1

Q8 7 4 0.04 1 2

Q9 6 1 5.13 2 1

Q10 4 3 1.87 7 1

Q12 2 5 0.51 2 2

Q14 2 2 1.23 0 2

Q19 2 24 0.001 0 1

Figure 7.2: TPC-H queries used in the experiments

7.2 Experimental Results

In this section, we report the results of the experiments that we performed in order to

quantify the overhead of the rewritings produced by ConQuer.

7.2.1 Scalability

In this subsection, we study the scalability of ConQuer’s approach. In particular, we

show the effect of the size of the inconsistent databases on the overhead of the rewritten

queries. In Figure 7.3, we report the overhead of the eleven rewritten queries on a number

of databases where we fix the degree of inconsistency to 5% of the database (p = 5%), and

2 conflicts per inconsistent key value (n = 2). The size of the databases (reported on the

x-axis) ranges from 1 GB to 20 GB (that is, from 8.6 million tuples to 172 million tuples).

The databases are generated independently of each other, and correspond to the scenario

where indices are created only for the key attributes. On the y-axis, we report the

overhead of the rewritten queries, computed as the ratio between the running time of

the rewritten query over the running time of the original (non-rewritten) query. The

rewritings reported in the figure do not exploit annotations (i.e., they are unaware of

annotations, if any, computed as explained in Section 6.4).

Chapter 7. Experimental Analysis 145

For presentation purposes, we split the queries into three graphs. The queries are

grouped based on the behaviour of the overhead as the size of the databases increases.

The graph at the top shows queries where the overhead initially increases, but then

remains constant or decreases (Queries 1, 7, 12, 14). The graph in the middle shows

queries where the overhead increases monotonically with the size of the database (Queries

3, 8, 10). The rest of the queries are shown in the graph at the bottom (Queries 4, 6, 9,

19).

We identified two factors that have a significant impact on the overhead of the rewrit-

ings: the selectivity of the original queries, and the query plans chosen by DB2’s opti-

mizer. Let us start with the selectivity of the queries. To understand their effect, recall

that in the SQL rewriting algorithm RewriteAggSQL of Figure 6.8, there is a subquery

called candidatesSubQuery that is designed to exploit the selectivity of the original

queries. In particular, this subquery returns only the values for the root-key attributes

that satisfy the conditions of the original query. More specifically, let q be a query,

K1, . . . , Kn be the attributes that appear at some root of the join graph of q, and CONDSbe the selection conditions of q. Then, the rewriting produced by RewriteAggSQL(q, Σ)

has a subquery of the following form:

with candidatesSubQuery as (

select K1 as cK1,. . . ,Kn as cKn

from <list of relations in q>

where CONDS )

Clearly, the lower the selectivity of the original query q, the fewer tuples are returned

by candidatesSubQuery. The rest of the rewriting operates on the result of the following

subquery called countViolSubQuery.

with countViolSubQuery as (

select K1, . . . , Kn, S1, . . . , Sl, A1, . . . , Au

rank() over (partition by K1, . . . , Kn

order by S1, . . . , Sl) as rankProjection,

sum(case when CONDS then 0 else 1 end)

over (partition by K1, . . . , Kn) as countViol,

case when CONDS then ‘‘yes’’ else ‘‘no’’ end as satConds

from JOINS

Chapter 7. Experimental Analysis 146

0 2 4 6 8 10 12 14 16 18 200

1

2

3

4

5

6

7

Size (GB)

Ove

rhea

d

Q: 001Q: 007Q: 012Q: 014

0 2 4 6 8 10 12 14 16 18 200

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

Size (GB)

Ove

rhea

d

Q: 003Q: 008Q: 010

0 2 4 6 8 10 12 14 16 18 200

0.5

1

1.5

2

2.5

3

Size (GB)

Ove

rhea

d

Q: 004Q: 006Q: 009Q: 019

Figure 7.3: Size of the inconsistent database vs. overhead (running time of rewritten

query over running time of original query) for p = 5% and n = 2

Chapter 7. Experimental Analysis 147

where exists (select * from candidatesSubQuery

where K1 = cK1 and . . . and Kn = cKn)

Notice that the where clause of the subquery restricts the focus to the tuples that

join with those returned by candidatesSubQuery. Since all further processing in the

rewriting is done on the result of countViolSubQuery, the selectivity of the original

query q significantly affects the running time of the rewriting.

We can see in Figure 7.2 that the selectivity of Query 1 is much higher than the

selectivity of all the other queries. More specifically, Query 1 has a selectivity of 98.5%,

whereas the highest selectivity of the other ten queries is 5.1% (Query 9). This explains

the high overhead of the rewriting of Query 1, which goes up to 5.8 times the running

time of the original query on the 20 GB database. The selectivity also explains the low

overhead of Query 19. In this case, the overhead of the rewriting goes up to just 1.2

times the running time of the original query on the 20 GB instance. Notice in Figure 7.2

that this query has considerably less selectivity than all other queries: 0.001%. Thus, in

effect, the computation of candidatesSubQuery accounts for most of the running time

of the rewriting; with the computation of the other subqueries having a negligible cost.

We also observed that the query plans selected by DB2 have an effect on the over-

head. For example, all queries involve lineitem, the largest relation of the TPC-H

database, which contains 70% of all tuples in the database. Except for Queries 4 and

10, the running time of all queries (and their rewritings) is dominated by the size of the

lineitem relation. In particular, for all those queries, DB2 selects plans that involve a

costly table scan of the lineitem relation. In contrast, for queries 4 and 10 (and their

rewritings), the running time is dominated by the size of the smaller relation orders

for the following reasons. First, the plans involve a table scan of relation orders, with

the access to lineitem being done through its clustered index. Second, a low selectivity

predicate is applied on the tuples retrieved from orders, which are then joined with

those coming from lineitem. Thus, only a very small fraction of the tuples of lineitem

are actually accessed. We conjecture that for this reason most of the processing of both

the original and rewritten queries can be done in main memory, hence the low overhead

of the rewritings of Queries 4 and 10.

The low overhead of Query 6 (with a maximum of 2.1 on the 10 GB instance) can be

explained in terms of the shape of its rewriting. Notice in Figure 7.2 that this is a query

on one relation (hence no joins), and with a relatively low selectivity. Furthermore, it

Chapter 7. Experimental Analysis 148

does not perform any grouping (it has no projecting attributes) and computes just one

aggregate function. This results in a simpler and more efficient rewriting. In particular,

the attributes countProjection and rankProjection of the rewriting do not need to

be computed.

In Figure 7.3, we can observe three trends in the growth of the overhead as we increase

the size of the instances. For some queries, the overhead increases slowly with the size

of the instances (Queries 4, 6, 10, and 19). These are the low-overhead rewritings, and

thus the processing of both the original queries and their rewritings can be done mostly

in main memory. For others, the overhead increases monotonically at a relatively high

rate (Queries 3 and 8). A possible explanation for this behaviour is that the original

queries can do most of their processing in main memory, whereas this is not the case for

the more costly rewritings. Finally, for another group of queries (Queries 1, 7, 9, 12, and

14), the overhead grows up initially, and then either remains constant or decreases. The

reason is that as the size of the databases grow, the amount of available main memory

becomes small not only for the rewritten queries but also for the original queries. Hence,

the rate of growth of the ratio between them diminishes.

For Query 9, we slightly modified the query rewriting produced by RewriteAggSQL (the

modified rewriting is equivalent to the one produced by RewriteAggSQL). The reason for

this is that for the rewriting obtained with RewriteAggSQL, DB2 was producing a very in-

efficient query plan. For example, on a 2 GB database, the running time of the rewriting

was 28 times the running time of the original query.

We detected that the problem of the rewriting produced by RewriteAggSQL was in

the subquery candidatesSubQuery. To understand the reason, let us show a simplified

version of Query 9:

select n name as nation,

l extendedprice * (1 - l discount) - ps supplycost * l quantity

from part, supplier, lineitem, partsupp, orders, nation

where s suppkey = l suppkey

and ps suppkey = l suppkey

and ps partkey = l partkey

and p partkey = l partkey

and o orderkey = l orderkey

and s nationkey = n nationkey

Chapter 7. Experimental Analysis 149

and p name like ’%green%’

The subquery candidatesSubQuery produced by RewriteAggSQL is the following:

with candidatesSubQuery as (

select l orderkey,l linenumber

from part, supplier, lineitem, partsupp, orders, nation

where s suppkey = l suppkey

and ps suppkey = l suppkey

and ps partkey = l partkey

and p partkey = l partkey

and o orderkey = l orderkey

and s nationkey = n nationkey

and p name like ’%green%’

An important observation is that if we modify candidatesSubQuery, the rewrit-

ing will still be correct (i.e., compute the consistent answers of Query 9) as long as

candidatesSubQuery still returns the tuples that are candidates to be consistent an-

swers, i.e., that they satisfy the selection conditions of Query 9. Based on this observa-

tion, we modified the candidatesSubQuery subquery produced by RewriteAggSQL, and

detected that DB2 would produce a more efficient query plan. In particular, we removed

the relation partsupp from the from clause of candidatesSubQuery and the conditions

ps suppkey = l suppkey and ps partkey = l partkey from its where clause.

The overhead reported in Figure 7.3 corresponds to the modified rewriting. Notice

that we do not provide a value for the 20 GB database. The reason is that the execution

of the original Query 9 on the 20 GB database timed out in our experiments because

DB2 came up with a particulary inefficient plan, different from the one chosen for the

other instances.

Besides the peculiarities of each query, an important conclusion of these experiments

is that the query rewritings can scale to large database instances. Even for an instance

of 20 GB (172 million tuples) the overhead of the queries ranges from 1.2 (Query 19) to

5.8 (Query 1). This is remarkable if we take into account that the semantics of consistent

query answering is much more involved than the semantics of traditional query answering.

Let us now consider the rewritings that exploit annotations, as explained in Section

6.4. In our experiments, the only rewriting that benefited substantially from the annota-

tions was the one on Query 1. The other queries do not benefit from annotations due to

Chapter 7. Experimental Analysis 150

0 5 10 15 200

1

2

3

4

5

6

7

Size (GB)

Ove

rhea

d

Q: 1−annotationsQ: 1−no annotations

Figure 7.4: Size of the inconsistent database vs. overhead of the rewritings that exploit

and do not exploit annotations for Query 1 (for an instance where p = 5% and n = 2).

their low selectivity. Recall that Query 1 has a high selectivity of 98.5%, as opposed to

all other queries, whose selectivity is at most 5.1% (Query 9). Since the annotations (in

particular the cons attribute) are used in the where clause of one of the subqueries of the

annotation-aware rewriting, they are in effect reducing the selectivity of the rewriting,

thereby having a more significant impact on the queries with high selectivity.

In Figure 7.4, we focus on Query 1, and we compare the overhead of the annotation-

aware rewriting with the rewriting which does not exploit annotations. As in the previous

figure, we fix the degree of inconsistency to 5% of the database (p = 5%) and the number

of conflicts per inconsistent key value to 2 (n = 2). The size of the databases (reported on

the x-axis) ranges, as before, from 1 to 20 GB. The databases correspond to the scenario

where indices are created for the key attributes and the annotations. On the y-axis, we

report the overhead of the queries, computed as we explained above.

It can be observed that we get a substantial gain by exploiting the annotations. For

example, on the 20 GB instance, the overhead of the rewriting which does not exploit

annotations is 5.8, whereas the overhead of the annotation-aware rewriting is 3.3. That

is, the running time of the rewriting is reduced by 57% by exploiting the annotations.

Finally, we performed experiments on databases where indices are created by follow-

ing the suggestions of DB2’s Configuration Advisor, in addition to the indices on the

Chapter 7. Experimental Analysis 151

key attributes. In Figure 7.5, we report the overhead of the eleven rewritten queries on

a number of databases where we fix the degree of inconsistency to 5% of the database

(p = 5%), and 2 conflicts per inconsistent key value (n = 2). The size of the databases

(reported on the x-axis) ranges from 1 to 20 GB. On the y-axis, we report the overhead

of the rewritten queries, computed as explained above. The rewritings do not exploit

annotations. The indices suggested by the Configuration Advisor are shown in the ap-

pendix.

For presentation purposes, we present the queries in three graphs. Notice the different

(linear) scales of the graphs. The graphs at the top and center show queries with low

overhead, whereas the one at the bottom shows queries where the overhead is much

higher.

In the graph at the bottom of Figure 7.5, we can observe a sharp spike in the overhead

of Query 14 on the 5 GB database. The overhead jumps from 2.1 on the 3 GB database

to 25.5 on the 5 GB database; and then decreases to 3.2 on the 10 GB database. This is

due to an index of the 5 GB database that is particularly beneficial to the original query.

This is an index on the lineitem relation and on attributes (l shipdate, l discount,

l extendedprice, l partkey). The index is not present on any of the other databases.

There is a similar situation for Query 6. In this case, the overhead jumps from 5.1 on the

2 GB database to 31.2 on the 3 GB database. The overhead stays high on the 5 and 10

GB databases (28.3 and 33.5, respectively) and finally decreases sharply on the 20 GB

database to a value of 2.8.

For Queries 8, 9, and 19 we observe the opposite behavior: the overhead lies below

one. That is, the indices benefit considerably the rewritten query as opposed to the

original query. This is most noticeable on Query 9, whose overhead is 0.05 on the 2 GB

database, and 0.04 on the 5 GB database. Notice that the overhead behaves differently

on the 1, 3, and 10 GB databases, where the original query runs faster than the rewritten

query (the overhead is above 1). We do not report the overhead for the 20 GB database

since, as occurred in the scenario with only key constraints, the original query times out.

Excluding the above exceptions, the overhead of all queries is comparable with the

overhead in the scenario where there are indices only for the key constraints. For example,

on the 20 GB database, the overhead of all queries ranges from 1.2 (Query 19) to 5.8

(Query 1) on the databases with just key constraints; and from 1.06 (Query 19) to 5.4

(Query 1) on the databases with indices suggested by the Configuration Advisor.

Chapter 7. Experimental Analysis 152

0 2 4 6 8 10 12 14 16 18 200

0.5

1

1.5

2

2.5

3

3.5

4

Size (GB)

Ove

rhea

d

Q: 003Q: 004Q: 007Q: 008

0 2 4 6 8 10 12 14 16 18 200

5

10

15

20

25

30

35

Size (GB)

Ove

rhea

d

Q: 001Q: 006Q: 014

Figure 7.5: Size of the inconsistent database vs. overhead (running time of rewritten

query over running time of original query) for p = 5% and n = 2 using indices suggested

by Configuration Advisor

Chapter 7. Experimental Analysis 153

7.2.2 Effect of Degree of Inconsistency

In this subsection, we study the effect of the degree of inconsistency of the databases on

the performance of ConQuer’s rewritings. We consider the two parameters that determine

the degree of inconsistency: the percentage of the database being inconsistent (p), and

the number of conflicts per inconsistent key value (n).

In Figure 7.6, we report the overhead of the eleven queries on a number of databases

where we fix the size to 3 GB and the number of inconsistencies per key value to 2 (n = 2).

The percentage of inconsistency of the databases (reported on the x-axis) ranges from

0 (totally consistent database) to 25% (a quarter of the database being inconsistent).

On the y-axis, we report the overhead of the rewritten queries, computed as the ratio

between the running time of the rewritten query over the running time of the original

(non-rewritten) query. The rewritings reported in the figure do not exploit annotations

(i.e., they are unaware of annotations, if any). All the databases correspond to the

scenario where indices are created only for the key attributes.

We observed that the overhead is not considerably influenced by the percentage of

inconsistency. This is reasonable since in the rewriting we do not make a distinction

between tuples that violate or satisfy the constraints. In the figure, we can see an

anomaly for Query 14, with its overhead sharply decreasing from 0 to 1%, and then

sharply increasing from 1 to 5%. The reason for this is that, for the rewritten query

and the 1% inconsistent database, DB2 chooses a different plan. In particular, for the

rewritten query on all databases except the 1% inconsistent, DB2 chooses a plan that

includes one table scan of the lineitem relation and a join that accesses lineitem

through its clustered index. For the 1% inconsistent database, DB2 chooses a different

plan that involves two tablescans of lineitem and the application of a low selectivity

predicate in each case. In this case, the alternative plan turns out to be a good choice:

the overhead becomes lower than in the other cases.

In Figure 7.7, we turn our attention to the number of conflicts per inconsistent key

value. In particular, we report the overhead of the eleven queries on a number of databases

where we fix the size to 1 GB and the percentage of inconsistency to 5% (p = 5%). The

number of conflicts per inconsistent key value (reported on the x-axis) ranges from 1

(totally consistent database) to 7. On the y-axis, we report the overhead of the rewritten

queries, computed as in the other figures. The rewritings considered in the figure do not

exploit annotations.

Chapter 7. Experimental Analysis 154

0 5 10 15 20 250

1

2

3

4

5

6

7

Percentage of inconsistency

Ove

rhea

d

Q: 001Q: 003Q: 004Q: 006

0 5 10 15 20 250

0.5

1

1.5

2

2.5

3

Percentage of inconsistency

Ove

rhea

d

Q: 007Q: 008Q: 009Q: 010

0 5 10 15 20 250

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

Percentage of inconsistency

Ove

rhea

d

Q: 012Q: 014Q: 019

Figure 7.6: Percentage of inconsistency vs. overhead (running time of rewritten query

over running time of original query) for instances of 3 GB and n = 2

Chapter 7. Experimental Analysis 155

As with the percentage of inconsistency, we observed that the number of conflicts per

key value does not have a considerable effect on the overhead of the rewritten queries.

The only exception is Query 9, where the overhead decreases significantly as the number

of conflicts increases. We detected that this is because DB2’s optimizer was choosing

different plans on different instances. In particular, the plan chosen for the original

query on the database where n = 7 is so inefficient that it runs more slowly than the

corresponding rewritten query (and, hence, the overhead falls below 1).

Chapter 7. Experimental Analysis 156

1 2 3 4 5 6 70

1

2

3

4

5

6

7

Number of inconsistent tuples per key value

Ove

rhea

d

Q: 001Q: 003Q: 004Q: 006

1 2 3 4 5 6 70

0.5

1

1.5

2

2.5

3

Number of inconsistent tuples per key value

Ove

rhea

d

Q: 007Q: 008Q: 009Q: 010

1 2 3 4 5 6 70

0.5

1

1.5

2

2.5

3

3.5

4

Number of inconsistent tuples per key value

Ove

rhea

d

Q: 012Q: 014Q: 019

Figure 7.7: Number of conflicts per inconsistent key value (n) vs. overhead (running

time of rewritten query over running time of original query) for instances of 1 GB and

p = 5%

Chapter 8

Conclusions and Future Work

In this thesis, we presented ConQuer, a system for query answering over inconsistent

databases. We showed the correctness of ConQuer’s rewritings for a broad class of Select-

Project-Join queries with set and bag semantics, and with grouping and aggregation. We

also showed the maximality of the class of queries from a complexity-theoretic point of

view. The efficiency and scalability of the approach was empirically validated with an

extensive set of experiments on a commercial database system.

The assumptions of our work can be relaxed in different directions. For example,

we assumed that the set of constraints that might be violated consists exclusively of

key dependencies. It would be interesting to consider foreign key dependencies as well.

In this way, we would be covering the most common constraints that are supported by

commercial database systems. We are also interested in other constraints, for example

constraints arising from business rules (e.g., a rule saying that a car insurance policy

cannot be held by people who are younger than 18 years old). Regarding the data

model, ConQuer currently works on relational databases. An obvious extension is to

provide support to semi-structured data, such as XML documents. With respect to

queries, we would like to support more expressive query languages, where queries may

have disjunction and negation. We note that this direction of research has been started

recently by Lembo, Rosatti and Ruzzi [LRR06], who extend our class Cforest to consider

unions of conjunctive queries.

In this work, we provide exact algorithms that compute all the consistent answers

to a query. We would also like to explore approximation algorithms [Vaz01]. For ex-

ample, we could compute results where some consistent answers may be missing. For

Select-Project-Join queries, we could give a formal guarantee on the number of poten-

157

Chapter 8. Conclusions and Future Work 158

tially missing tuples. For queries with aggregation, we could also give formal guarantees

about the ranges for the aggregate functions. An interesting question is whether the

query rewriting algorithms used by ConQuer can be used as a building block of the

approximation algorithms.

It is easy to see that, in general, queries under the consistent answers semantics do

not compose. That is, the consistent answers of a first query cannot be used to compute

the consistent answers of other queries. However, it may be possible to produce auxiliary

data when executing the first query that could be used in turn to obtain the result of other

queries. We would like to characterize what kind of auxiliary information is necessary for

the composition of different classes of queries. One application of these results would be

for OLAP queries [CD97], where the computation of, e.g., roll-up operations is usually

done by composing queries.

ConQuer currently deals with inconsistencies that occur after the source data has

been transformed to conform to the schema of the integrated database. The problem of

creating the integrated database is called data exchange, and has recently been formalized

by Fagin, Kolaitis, Miller, and Popa [FKMP05]. In this framework, we are given a

source schema, a target schema, and a mapping, which is a declarative specification of

a transformation. Mappings are unidirectional in the data exchange framework, going

from the source to the target schema. The goal is, given a source database, to materialize

a target database that satisfies the mapping. We, together with other authors, have

proposed a generalization of the data exchange framework, called peer data exchange

[FKMT06], where the mapping may be bidirectional (source-to-target and target-to-

source). An important problem in the context of peer data exchange is the existence-

of-solutions problem, which consists of deciding whether it is actually possible to obtain

a target database that satisfies the mapping. Interestingly, the problem of computing

consistent answers under key constraints can be reduced to the existence-of-solutions

problem in the context of peer data exchange, where the key constraints are encoded in

the mapping [Fux04]. This reduction may contribute to the potential application of the

techniques presented in this thesis to the context of peer data exchange.

ConQuer provides an interface that enables the user to gradually clean the database.

In particular, when a query is submitted, the system shows the clean answers together

with a query explanation. The explanation can be extremely valuable, since it often

points to underlying errors in the database that require attention from the user. For key

constraints, the only actions that a user may perform are deleting or modifying tuples.

Chapter 8. Conclusions and Future Work 159

However, if other constraints are covered in the future, the explanations could trigger

more complex transformations on the database. There are interesting questions as to

how to specify such transformations using, for example, Extract-Transform-Load tools.

Nowadays, there are mature data integration and database management products

in the market. In our opinion, these products should be tightly coupled, with data

integration tools producing databases that are potentially inconsistent, and precise char-

acterizing the inconsistency; and database systems exploiting the knowledge about the

inconsistencies to produce better answers. We expect the results in this thesis to be an

initial step in this direction.

Bibliography

[ABC99] M. Arenas, L. Bertossi, and J. Chomicki. Consistent query answers in incon-

sistent databases. In Symposium on Principles of Database Systems (PODS),

pages 68–79, 1999.

[ABC00] M. Arenas, L. Bertossi, and J. Chomicki. Specifying and querying database

repairs using logic programs with exceptions. In International Conference

on Flexible Query Answering Systems, pages 27–41, 2000.

[ABC03a] M. Arenas, L. Bertossi, and J. Chomicki. Answer sets for consistent query

answering in inconsistent databases. Theory and Practice of Logic Program-

ming, 3(4-5):392–424, 2003.

[ABC+03b] M. Arenas, L. Bertossi, J. Chomicki, X. He, V. Raghavan, and J. Spinrad.

Scalar Aggregation in Inconsistent Databases. Theoretical Computer Science,

296:405–434, 2003.

[AD98] S. Abiteboul and O. M. Duschka. Complexity of answering queries using ma-

terialized views. In Symposium on Principles of Database Systems (PODS),

pages 254–263, 1998.

[AFM06] P. Andritsos, A. Fuxman, and R. J. Miller. Clean answers over dirty

databases: a probabilistic approach. In International Conference on Data

Engineering (ICDE), 2006. Paper 30.

[AHV95] S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases. Addison-

Wesley, 1995.

[AKG87] S. Abiteboul, P. Kanellakis, and G. Grahne. On the representation and

querying of sets of possible worlds. In ACM International Conference on the

Management of Data (SIGMOD), pages 34–48, 1987.

160

Bibliography 161

[AKWS95] S. Agarwal, A. Keller, G. Wiederhold, and K. Saraswat. Flexible relation:

An approach for the integration of data from multiple, possible inconsistent

databases. In International Conference on Data Engineering (ICDE), pages

495–504, 1995.

[ATMS04] P. Andritsos, P. Tsaparas, R. J. Miller, and K. C. Sevcik. Limbo: Scalable

clustering of categorical data. In International Conference on Extending

Database Technology (EDBT), pages 123–146, 2004.

[Bal91] B. Balzer. Tolerating inconsistency. In International Conference on Software

Engineering (ICSE), pages 158–165, 1991.

[BB03a] P. Barcelo and L. Bertossi. Logic programs for querying inconsistent

databases. In International Symposium on Practical Aspects of Declarative

Languages, pages 208–222, 2003.

[BB03b] L. Bravo and L. Bertossi. Logic programs for consistently querying data inte-

gration systems. In International Joint Conference on Artificial Intelligence

(IJCAI), pages 10–15, 2003.

[BBFL05] L. Bertossi, L. Bravo, E. Franconi, and A. Lopatenko. Fixing numerical at-

tributes under integrity constraints. In International Symposium on Database

Programming Languages (DBPL), pages 262–278, 2005.

[BC03] L. Bertossi and J. Chomicki. Logics for Emerging Applications of Databases,

chapter Query Answering in Inconsistent Databases, pages 43–83. Springer,

2003.

[Ber06] L. Bertossi. Consistent query answering in databases. ACM SIGMOD

Record, 35(2):68–76, 2006. Database Principles column.

[BKT01] P. Buneman, S. Khanna, and W. Tan. Why and where: A characterization of

data provenance. In International Conference on Database Theory (ICDT),

pages 316–330, 2001.

[BMFR05] P. Bohannon, F. Michael, W. Fan, and R. Rastogi. A cost-based model and

effective heuristic for repairing constraints by value modification. In ACM

International Conference on the Management of Data (SIGMOD), pages

143–154, 2005.

Bibliography 162

[BMP92] D Barbara, H. Garcia Molina, and D. Porter. The management of probabilis-

tic data. IEEE Transactions on Knowldge and Data Engineering (TKDE),

4:487–502, 1992.

[CB00] A. Celle and L. Bertossi. Querying inconsistent databases: Algorithms and

implementation. In Computational Logic (CL), pages 942–956, 2000.

[CB05] M. Caniupan and L. Bertossi. Optimizing repair programs for consistent

query answering. In International Conference of the Chilean Computer Sci-

ence Society, pages 3–12, 2005.

[CD97] S. Chaudhuri and U. Dayal. An overview of data warehousing and OLAP

technology. ACM SIGMOD Record, 26(1):65–74, 1997.

[CLR03a] A. Calı, D. Lembo, and R. Rosati. On the decidability and complexity of

query answering over inconsistent and incomplete databases. In Symposium

on Principles of Database Systems (PODS), pages 260–271, 2003.

[CLR03b] A. Calı, D. Lembo, and R. Rosati. Query rewriting and answering under

constraints in data integration systems. In International Joint Conference

on Artificial Intelligence (IJCAI), pages 16–21, 2003.

[CM77] A. Chandra and P. Merlin. Computable queries for relational databases. In

ACM Symposium on the Theory of Computing (STOC), pages 77–90, 1977.

[CM05] J. Chomicki and J. Marcinkowski. Minimal-change integrity maintenance

using tuple deletions. Information and Computation, 197(1-2):90–121, 2005.

[CMS04a] J. Chomicki, J. Marcinkowski, and S. Staworko. Computing Consistent

Query Answers using Conflict Hypergraphs. In International Conference

on Information and Knowledge Management (CIKM), pages 417–426, 2004.

[CMS04b] J. Chomicki, J. Marcinkowski, and S. Staworko. Hippo: A System for Com-

puting Consistent Answers to a Class of SQL Queries. In International Con-

ference on Extending Database Technology (EDBT), pages 841–844, 2004.

[CNS99] S. Cohen, W. Nutt, and A. Serebrenik. Rewriting aggregate queries using

views. In Symposium on Principles of Database Systems (PODS), pages

155–166, 1999.

Bibliography 163

[CNS03] S. Cohen, W. Nutt, and Y. Sagiv. Containment of aggregate queries. In

International Conference on Database Theory (ICDT), pages 111–125, 2003.

[CP87] R. Cavallo and M. Pittarelli. The theory of probabilistic databases. In

International Conference on Very Large Databases (VLDB), pages 71–81,

1987.

[CV93] S. Chaudhuri and M. Vardi. Optimization of real conjunctive queries. In

Symposium on Principles of Database Systems (PODS), pages 59–70, 1993.

[CW03] Y. Cui and J. Widom. Lineage tracing for general data warehouse transfor-

mations. Very Large Databases (VLDB) Journal, 12(1):41–58, 2003.

[DeM89] L. DeMichiel. Resolving database incompatibility: An approach to perform-

ing relational operations over mismatched domains. In IEEE Transactions

on Knowldge and Data Engineering (TKDE), pages 485–493, 1989.

[DJ03] T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John

Wiley, 2003.

[DS04] N. Dalvi and D. Suciu. Efficient query evaluation on probabilistic databases.

In International Conference on Very Large Databases (VLDB), pages 864–

875, 2004.

[EFGL03] T. Eiter, M. Fink, G. Greco, and D. Lembo. Efficient Evaluation of Logic

Programs for Querying Data Integration Systems. In International Confer-

ence on Logic Programming (ICLP), pages 163–177, 2003.

[FFM05a] A. Fuxman, E. Fazli, and R. J. Miller. ConQuer: Efficient management of in-

consistent databases. In ACM International Conference on the Management

of Data (SIGMOD), pages 155–166, 2005.

[FFM05b] A. Fuxman, D. Fuxman, and R. J. Miller. ConQuer: A system for effi-

cient querying over inconsistent databases. International Conference on Very

Large Databases (VLDB), pages 1354–1357, 2005.

[FFP05] S. Flesca, F. Furfaro, and F. Parisi. Consistent query answers on numeri-

cal databases under aggregate constraints. In International Symposium on

Database Programming Languages (DBPL), pages 279–294, 2005.

Bibliography 164

[FKMP05] R. Fagin, P. Kolaitis, R. Miller, and L. Popa. Data exchange: semantics and

query answering. Theoretical Computer Science, 336(1):89–124, 2005.

[FKMT06] A. Fuxman, P. Kolaitis, R. J. Miller, and W. Tan. Peer data exchange. ACM

Transactions on Database Systems, 2006. To appear in a special issue with

selected papers from PODS 2005.

[FM05] A. Fuxman and R. J. Miller. First-order query rewriting for inconsistent

databases. In International Conference on Database Theory (ICDT), pages

337–351, 2005.

[FM06] A. Fuxman and R. J. Miller. First-order query rewriting for inconsistent

databases. Journal of Computer and System Sciences (JCSS), 2006. To

appear.

[FPL+01] E. Franconi, A. Laureti Palma, N. Leone, S. Perri, and F. Scarcello. Census

data repair: a challenging application of disjunctive logic programming. In

Logic for Programming, Artificial Intelligence, and Reasoning (LPAR), pages

561–578, 2001.

[FR97] N. Fuhr and T. Rolleke. A probabilistic relational algebra for the integra-

tion of information retrieval and database systems. ACM Transactions on

Information Systems, 15(1):32–66, 1997.

[Fux04] A. Fuxman. A survey of the applications of schema mapping and the certain

answers semantics. Technical Report CSRG-541, University of Toronto, 2004.

Available at ftp://ftp.cs.toronto.edu/cs/ftp/pub/reports/csrg/541.

[GGZ01] G. Greco, S. Greco, and E. Zumpano. A logic programming approach to

the integration, repairing and querying of inconsistent databases. In Inter-

national Conference on Logic Programming (ICLP), pages 348–364, 2001.

[GLRR05] L. Grieco, D. Lembo, R. Rosati, and M. Ruzzi. Consistent query answer-

ing under key and exclusion dependencies: Algorithms and experiments.

In International Conference on Information and Knowledge Management

(CIKM), pages 792–799, 2005.

[GM96] S. Grumbach and T. Milo. Towards tractable algebras for bags. In Journal

of Computer and System Sciences (JCSS), volume 52, pages 570–588, 1996.

Bibliography 165

[GR95] P. Gardenfors and H. Rott. Handbook of Logic in Artificial Intelligence and

Logic Programming, volume 4, chapter Belief Revision, pages 35–132. Oxford

University Press, 1995.

[GRT99] S. Grumbach, M. Rafanelli, and L. Tininini. Querying aggregate data.

In Symposium on Principles of Database Systems (PODS), pages 174–184,

1999.

[GZ00] S. Greco and E. Zumpano. Querying inconsistent databases. In Logic for

Programming, Artificial Intelligence, and Reasoning (LPAR), pages 308–325,

2000.

[HK75] J. Hopcroft and R. M. Karp. An O(n2.5) algorithm for maximum matching

in bipartite graphs. SIAM Journal of Computing, 2:225–231, 1975.

[HLNW01] L. Hella, L. Libkin, J. Nurmonen, and L. Wong. Logics with aggregate

operators. Journal of the ACM, 48(4):880–907, 2001.

[IR95] Y. Ioannidis and R. Ramakrishnan. Containment of conjunctive queries: Be-

yond relations as sets. ACM Transactions on Database Systems, 20(3):288–

324, 1995.

[ISO01] ISO. SQL - part 2: Foundation (SQL/Foundation) - amendment 1: On-line

analytical processing (SQL/OLAP). Technical Report 9075-2-1999/Amd1-

2001, INCITS/ISO/IEC, 2001.

[IvdMV95] T. Imielinski, R. van der Meyden, and K. Vadaparty. Complexity tailored de-

sign: A new design methodology for databases with incomplete information.

Journal of Computer and System Sciences (JCSS), 51(3):405–432, 1995.

[Lad75] R. E. Ladner. On the structure of polynomial time reducibility. Journal of

the ACM, 22(1):155–171, 1975.

[Lev81] H. Levesque. A Formal Treatment of Incomplete Knowledge Bases. PhD

thesis, University of Toronto, 1981.

[Lip79] W. Lipski. On semantic issues connected with incomplete information

databases. ACM Transactions on Database Systems, 4(3):262–296, 1979.

Bibliography 166

[Lip81] W. Lipski. On databases with incomplete information. Journal of the ACM,

28(1):41–70, 1981.

[LLR02] D. Lembo, M. Lenzerini, and R. Rosati. Source inconsistency and incom-

pleteness in data integration. In International Workshop on Knowledge Rep-

resentation meets Databases (KRDB), 2002.

[LLRS97] L. Lakshmanan, N. Leone, R. Ross, and V. Subrahmanian. Probview: A flex-

ible probabilistic database system. ACM Transactions on Database Systems,

22(3):419–469, 1997.

[LM96] J. Lin and A. Mendelzon. Merging databases under constraints. International

Journal of Cooperative Information Systems, 7(1):55–76, 1996.

[LRR06] D. Lembo, R. Rosati, and M. Ruzzi. On the first-order reducibility of unions

of conjunctive queries over inconsistent databases. In International Work-

shop on Inconsistency and Incompleteness in Databases, pages 17–32, 2006.

[LW95] L. Libkin and L. Wong. On representation and querying incomplete informa-

tion in databases with bags. Information Processing Letters, 56(4):209–214,

1995.

[LW97] L. Libkin and L. Wong. Query languages for bags and aggregate functions.

Journal of Computer and System Sciences (JCSS), 55(2):241–272, 1997.

[Moo85] R. Moore. Formal Theories of the Commonsense World, chapter A Formal

Theory of Knowledge and Action, pages 319–358. 1985.

[NER00] B. Nuseibeh, S. Easterbrook, and A. Russo. Leveraging inconsistency in

software development. IEEE Computer, 33(4):24–29, 2000.

[Ost70] P. Ostrand. Systems of distinct representatives. Journal of Mathematical

Analysis and Applications, 32:1–4, 1970.

[TPC03] Transaction Processing Performance Council: TPC. TPC Benchmark H

(Decision Support). Standard Specification Revision 2.1.0, 2003.

[Vaz01] V. Vazirani. Approximation Algorithms. Springer, 2001.

Bibliography 167

[vdM98] R. van der Meyden. Logical approaches to incomplete information: A survey.

In Logics for Databases and Information Systems, pages 307–356. Kluwer,

1998.

[Wij05] J. Wijsen. Database repairing using updates. ACM Transactions on

Database Systems, 30(3):722–768, 2005.

Appendix A

TPC-H Queries and their Rewritings

The following are the queries from the TPC-H standard [TPC03] that we employed in

our experiments, together with their rewritings.

TPC-H Query 1

select

l_returnflag,

l_linestatus,

sum(l_quantity) as sum_qty,

sum(l_extendedprice) as sum_base_price,

sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,

sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,

avg(l_quantity) as avg_qty,

avg(l_extendedprice) as avg_price,

avg(l_discount) as avg_disc,

count(*) as count_order

from

lineitem

where

l_shipdate <= date(’1998-12-01’) - 90 DAYS

group by

l_returnflag,

l_linestatus

order by

l_returnflag,

l_linestatus;

Rewritten Query 1

168

Appendix A. TPC-H Queries and their Rewritings 169

with candidatesSubQuery as (

select

l_orderkey,

l_linenumber

from

lineitem

where

l_shipdate <= date(’1998-12-01’) - 90 DAYS

),

contribAllSubQuery as (

select

l_returnflag,

l_linestatus,

max(l_quantity) as max_qty,

max(l_extendedprice) as max_extendedprice,

max(l_extendedprice * (1 - l_discount)) as max_disc_price,

max(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as max_charge,

max(l_discount) as max_disc,

min(l_quantity) as min_qty,

min(l_extendedprice) as min_extendedprice,

min(l_extendedprice * (1 - l_discount)) as min_disc_price,

min(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as min_charge,

min(l_discount) as min_disc,

condWhereViol,

condWhereSat,

max(rankProj) over (partition by l_orderkey,l_linenumber) as countProj

from

(select

l_orderkey,

l_linenumber,

l_returnflag,

l_linestatus,

l_quantity,

l_extendedprice,

l_discount,

l_tax,

rank() over (partition by l_orderkey,l_linenumber

order by l_returnflag,l_linestatus)

as rankProj,

sum(case

when l_shipdate <= date(’1998-12-01’) - 90 days then 0 else 1 end)

Appendix A. TPC-H Queries and their Rewritings 170

over (partition by l_orderkey,l_linenumber) as condWhereViol,

case

when l_shipdate <= date(’1998-12-01’) - 90 days then 1 else 0 end

as condWhereSat

from lineitem li

where

exists (select * from candidatesSubQuery sc

where li.l_orderkey=sc.l_orderkey and

li.l_linenumber=sc.l_linenumber)

) q

where condWhereSat = 1

group by l_orderkey,l_linenumber,l_returnflag,l_linestatus,condWhereViol,

condWhereSat,rankProj),

contribConsistentSubQuery as (

select

l_returnflag,

l_linestatus,

max_qty,

max_extendedprice,

max_disc_price,

max_charge,

max_disc,

min_qty,

min_extendedprice,

min_disc_price,

min_charge,

min_disc,

1 as countConsistent

from

contribAllSubQuery Cand

where condWhereViol = 0 and countProj=1),

contribNonConsistentSubQuery as (

select

l_returnflag,

l_linestatus,

max_qty,

max_extendedprice,

max_disc_price,

max_charge,

max_disc,

0 as min_qty,

Appendix A. TPC-H Queries and their Rewritings 171

0 as min_extendedprice,

0 as min_disc_price,

0 as min_charge,

0 as min_disc,

0 as countConsistent

from

contribAllSubQuery Cand

where condWhereViol >= 1 or countProj > 1)

select

l_returnflag,

l_linestatus,

sum(max_qty) as max_sum_qty,

sum(max_extendedprice) as max_sum_base_price,

sum(max_disc_price) as max_sum_disc_price,

sum(max_charge) as max_sum_charge,

sum(max_qty)/sum(countConsistent) as max_avg_qty,

sum(max_extendedprice)/sum(countConsistent) as max_avg_price,

sum(max_disc)/sum(countConsistent) as max_avg_disc,

count(*) as max_count_order,

sum(min_qty) as min_sum_qty,

sum(min_extendedprice) as min_sum_base_price,

sum(min_disc_price) as min_sum_disc_price,

sum(min_charge) as min_sum_charge,

sum(min_qty)/sum(countConsistent) as min_avg_qty,

sum(min_extendedprice)/sum(countConsistent) as min_avg_price,

sum(min_disc)/sum(countConsistent) as min_avg_disc,

sum(countConsistent) as min_count_order

from

(select * from contribConsistentSubQuery

union all

select * from contribNonConsistentSubQuery) q

group by

l_returnflag,

l_linestatus

having sum(countConsistent)>0

order by

l_returnflag,

l_linestatus;

TPC-H Query 3

select

Appendix A. TPC-H Queries and their Rewritings 172

l_orderkey,

sum(l_extendedprice * (1 - l_discount)) as revenue,

o_orderdate,

o_shippriority

from

customer,

orders,

lineitem

where

c_mktsegment = ’BUILDING’

and c_custkey = o_custkey

and l_orderkey = o_orderkey

and o_orderdate < ’1995-03-15’

and l_shipdate > ’1995-03-15’

group by

l_orderkey,

o_orderdate,

o_shippriority

order by

revenue desc,

o_orderdate

fetch first 10 rows only;

Rewritten Query 3

with candidatesSubQuery as (

select

l_orderkey,

l_linenumber

from

customer,

orders,

lineitem

where

c_mktsegment = ’BUILDING’

and c_custkey = o_custkey

and l_orderkey = o_orderkey

and o_orderdate < ’1995-03-15’

and l_shipdate > ’1995-03-15’

),

contribAllSubQuery as (

Appendix A. TPC-H Queries and their Rewritings 173

select

l_orderkey,

l_linenumber,

o_orderdate,

o_shippriority,

min(l_extendedprice * (1 - l_discount)) as min_revenue,

max(l_extendedprice * (1 - l_discount)) as max_revenue,

1 as min_count,

max(rankProj) over (partition by l_orderkey,l_linenumber) as countProj,

cond_viol,

cond_sat

from

(select

l_orderkey,

l_linenumber,

o_orderdate,

o_shippriority,

l_extendedprice,

l_discount,

rank() over (partition by l_orderkey,l_linenumber

order by o_orderdate,o_shippriority)

as rankProj,

sum(case

when c_mktsegment = ’BUILDING’

and c_custkey = o_custkey

and o_orderdate < ’1995-03-15’

and l_shipdate > ’1995-03-15’

then 0 else 1 end)

over (partition by l_orderkey,l_linenumber) as cond_viol,

case when c_mktsegment = ’BUILDING’

and c_custkey = o_custkey

and o_orderdate < ’1995-03-15’

and l_shipdate > ’1995-03-15’

then 1 else 0 end as cond_sat

from orders o1 JOIN lineitem l ON l_orderkey = o1.o_orderkey

LEFT OUTER JOIN customer ON c_custkey=o1.o_custkey

where

exists (select * from candidatesSubQuery sc

where l.l_orderkey=sc.l_orderkey and l.l_linenumber=sc.l_linenumber)

) q

where cond_sat=1

Appendix A. TPC-H Queries and their Rewritings 174

group by

l_orderkey,

l_linenumber,

o_orderdate,

o_shippriority,

cond_viol,cond_sat,rankProj

),

contribConsistentSubQuery as (

select l_orderkey,

l_linenumber,

o_orderdate,

o_shippriority,

min_revenue,

max_revenue,

min_count

from

contribAllSubQuery Cand

where

countProj = 1 and cond_viol=0),

contribNonConsistentSubQuery as (

select

l_orderkey,

l_linenumber,

o_orderdate,

o_shippriority,

0 as min_revenue,

max_revenue,

0 as min_count

from

contribAllSubQuery Cand

where

countProj > 1 or cond_viol >= 1)

select

l_orderkey,

o_orderdate,

o_shippriority,

sum(min_revenue) as sum_min_revenue,

sum(max_revenue) as sum_max_revenue

from

(select * from contribNonConsistentSubQuery

Appendix A. TPC-H Queries and their Rewritings 175

union all

select * from contribConsistentSubQuery) as q

group by

l_orderkey,

o_orderdate,

o_shippriority

having sum(min_count)>0

order by

sum_min_revenue desc,

o_orderdate

fetch first 10 rows only;

TPC-H Query 4

select

o_orderpriority,

count(*) as order_count

from

orders

where

o_orderdate >= ’1993-07-01’

and o_orderdate < date(’1993-07-01’) + 3 MONTHS

and exists (

select *

from

lineitem

where

l_orderkey = o_orderkey

and l_commitdate < l_receiptdate

)

group by

o_orderpriority

order by

o_orderpriority;

Rewritten Query 4

with candidatesSubQuery as (

select

l_orderkey,

l_linenumber

from

Appendix A. TPC-H Queries and their Rewritings 176

orders, lineitem

where

o_orderdate >= ’1993-07-01’

and o_orderdate < date(’1993-07-01’) + 3 MONTHS

and l_orderkey = o_orderkey

and l_commitdate < l_receiptdate

),

contribAllSubQuery as (

select

l_orderkey,

l_linenumber,

o_orderpriority,

1 as min_count,

max(rankProj) over (partition by l_orderkey,l_linenumber) as countProj,

cond_viol,

cond_sat

from

(select

l_orderkey,

l_linenumber,

o_orderpriority,

rank() over (partition by l_orderkey,l_linenumber

order by o_orderpriority) as rankProj,

sum(case

when l_commitdate < l_receiptdate and

o_orderdate >= ’1993-07-01’

and o_orderdate < date(’1993-07-01’) + 3 MONTHS

then 0 else 1 end)

over (partition by l_orderkey,l_linenumber) as cond_viol,

case when l_commitdate < l_receiptdate and

o_orderdate >= ’1993-07-01’

and o_orderdate < date(’1993-07-01’) + 3 MONTHS

then 1 else 0 end as cond_sat

from orders, lineitem li

where

l_orderkey = o_orderkey

and

exists (select * from candidatesSubQuery sc

where li.l_orderkey=sc.l_orderkey and li.l_linenumber=sc.l_linenumber)

) q

where cond_sat=1

Appendix A. TPC-H Queries and their Rewritings 177

group by

l_orderkey,

l_linenumber,

o_orderpriority,

cond_viol,cond_sat,rankProj),

contribConsistentSubQuery as (

select l_orderkey,

l_linenumber,

o_orderpriority,

min_count

from

contribAllSubQuery Cand

where

countProj =1 and cond_viol=0),

contribNonConsistentSubQuery as (

select

l_orderkey,

l_linenumber,

o_orderpriority,

0 as min_count

from

contribAllSubQuery Cand

where

countProj > 1 or cond_viol >= 1)

select

o_orderpriority,

count(*) as max_order_count,

sum(min_count) as min_order_count

from

(select * from contribNonConsistentSubQuery

union all

select * from contribConsistentSubQuery) as q

group by

o_orderpriority

having sum(min_count)>0

order by

o_orderpriority;

TPC-H Query 6

select

Appendix A. TPC-H Queries and their Rewritings 178

sum(l_extendedprice * l_discount) as revenue

from

lineitem

where

l_shipdate >= ’1994-01-01’

and l_shipdate < date(’1994-01-01’) + 1 YEAR

and l_discount >= 0.06 - 0.01

and l_discount <= 0.06 + 0.01

and l_quantity < 24;

Rewritten Query 6

with candidatesSubQuery as (

select

l_orderkey,

l_linenumber

from

lineitem

where

l_shipdate >= ’1994-01-01’

and l_shipdate < date(’1994-01-01’) + 1 YEAR

and l_discount >= 0.06 - 0.01

and l_discount <= 0.06 + 0.01

and l_quantity < 24

),

contribAllSubQuery as (

select l_orderkey,

l_linenumber,

min(l_extendedprice * l_discount) as min_revenue,

max(l_extendedprice * l_discount) as max_revenue,

cond_viol,

cond_sat

from (

select

l_orderkey,

l_linenumber,

l_extendedprice,

l_discount,

sum(case

when l_shipdate >= ’1994-01-01’

and l_shipdate < date(’1994-01-01’) + 1 YEAR

Appendix A. TPC-H Queries and their Rewritings 179

and l_discount >= 0.06 - 0.01

and l_discount <= 0.06 + 0.01

and l_quantity < 24

then 0 else 1 end)

over (partition by l_orderkey,l_linenumber) as cond_viol,

case when l_shipdate >= ’1994-01-01’

and l_shipdate < date(’1994-01-01’) + 1 YEAR

and l_discount >= 0.06 - 0.01

and l_discount <= 0.06 + 0.01

and l_quantity < 24

then 1 else 0 end as cond_sat

from

lineitem li

where

exists (select * from candidatesSubQuery sc

where li.l_orderkey=sc.l_orderkey

and li.l_linenumber=sc.l_linenumber) ) q

where cond_sat=1

group by l_orderkey,

l_linenumber, cond_viol,cond_sat),

contribConsistentSubQuery as (

select l_orderkey,

l_linenumber,

min_revenue,

max_revenue

from

contribAllSubQuery Cand

where

cond_viol=0),

contribNonConsistentSubQuery as (

select l_orderkey,

l_linenumber,

0 as min_revenue,

max_revenue

from

contribAllSubQuery Cand

where cond_viol >= 1

)

select

sum(min_revenue) as min_sum_revenue,

Appendix A. TPC-H Queries and their Rewritings 180

sum(max_revenue) as max_sum_revenue

from

(select * from contribNonConsistentSubQuery

union all

select * from contribConsistentSubQuery) as q;

TPC-H Query 7

select

supp_nation,

cust_nation,

l_year,

sum(volume) as revenue

from

(

select

n1.n_name as supp_nation,

n2.n_name as cust_nation,

year(l_shipdate) as l_year,

l_extendedprice * (1 - l_discount) as volume

from

supplier,

lineitem,

orders,

customer,

nation n1,

nation n2

where

s_suppkey = l_suppkey

and o_orderkey = l_orderkey

and c_custkey = o_custkey

and s_nationkey = n1.n_nationkey

and c_nationkey = n2.n_nationkey

and (

(n1.n_name = ’FRANCE’ and n2.n_name = ’GERMANY’)

or (n1.n_name = ’GERMANY’ and n2.n_name = ’FRANCE’)

)

and l_shipdate >= ’1995-01-01’

and l_shipdate <= ’1996-12-31’

) as shipping

group by

Appendix A. TPC-H Queries and their Rewritings 181

supp_nation,

cust_nation,

l_year

order by

supp_nation,

cust_nation,

l_year;

Rewritten Query 7

with candidatesSubQuery as (

select

l_orderkey,

l_linenumber

from

supplier,

lineitem,

orders,

customer,

nation n1,

nation n2

where

s_suppkey = l_suppkey

and o_orderkey = l_orderkey

and c_custkey = o_custkey

and s_nationkey = n1.n_nationkey

and c_nationkey = n2.n_nationkey

and (

(n1.n_name = ’FRANCE’ and n2.n_name = ’GERMANY’)

or (n1.n_name = ’GERMANY’ and n2.n_name = ’FRANCE’)

)

and l_shipdate >= ’1995-01-01’

and l_shipdate <= ’1996-12-31’

),

contribAllSubQuery as (

select

supp_nation,

cust_nation,

l_year,

min(volume) as low_revenue,

max(volume) as up_revenue,

Appendix A. TPC-H Queries and their Rewritings 182

condWhereSat,

condWhereViol,

max(rankProj) as countProj

from

(

select

l_orderkey,

l_linenumber,

n1.n_name as supp_nation,

n2.n_name as cust_nation,

year(l_shipdate) as l_year,

l_extendedprice * (1 - l_discount) as volume,

rank() over (partition by l_orderkey,l_linenumber

order by n1.n_name,n2.n_name,year(l_shipdate)) as rankProj,

sum(case

when (

s_suppkey = l_suppkey

and o_orderkey = l_orderkey

and c_custkey = o_custkey

and s_nationkey = n1.n_nationkey

and c_nationkey = n2.n_nationkey

and (

(n1.n_name = ’FRANCE’ and n2.n_name = ’GERMANY’)

or (n1.n_name = ’GERMANY’ and n2.n_name = ’FRANCE’)

)

and l_shipdate >= ’1995-01-01’

and l_shipdate <= ’1996-12-31’)

then 0 else 1 end)

over (partition by l_orderkey,l_linenumber) as condWhereViol,

case

when (

s_suppkey = l_suppkey

and c_custkey = o_custkey

and s_nationkey = n1.n_nationkey

and c_nationkey = n2.n_nationkey

and (

(n1.n_name = ’FRANCE’ and n2.n_name = ’GERMANY’)

or (n1.n_name = ’GERMANY’ and n2.n_name = ’FRANCE’)

)

and l_shipdate >= ’1995-01-01’

and l_shipdate <= ’1996-12-31’)

Appendix A. TPC-H Queries and their Rewritings 183

then 1 else 0 end as condWhereSat

from

lineitem li JOIN orders ON o_orderkey = l_orderkey

LEFT OUTER JOIN supplier ON s_suppkey = l_suppkey

LEFT OUTER JOIN nation n1 ON s_nationkey = n1.n_nationkey

LEFT OUTER JOIN customer ON c_custkey = o_custkey

LEFT OUTER JOIN nation n2 ON c_nationkey = n2.n_nationkey

where

exists

(select * from candidatesSubQuery sc

where li.l_orderkey=sc.l_orderkey

and li.l_linenumber=sc.l_linenumber)

) q

where condWhereSat=1

group by

l_orderkey,

l_linenumber,

supp_nation,

cust_nation,

l_year,

condWhereSat,condWhereViol),

contribConsistentSubQuery as (

select

supp_nation,

cust_nation,

l_year,

low_revenue,

up_revenue,

1 as countConsistent

from

contribAllSubQuery Cand

where condWhereViol = 0 and countProj = 1),

contribNonConsistentSubQuery as (

select

supp_nation,

cust_nation,

l_year,

low_revenue,

0 as up_revenue,

0 as countConsistent

from

Appendix A. TPC-H Queries and their Rewritings 184

contribAllSubQuery Cand

where condWhereViol >= 1 or countProj >1)

select

supp_nation,

cust_nation,

l_year,

sum(low_revenue) as low_sum_revenue,

sum(up_revenue) as up_sum_revenue

from

(select * from contribConsistentSubQuery

union all

select * from contribNonConsistentSubQuery) q

group by

supp_nation,

cust_nation,

l_year

having sum(countConsistent)>0

order by

supp_nation,

cust_nation,

l_year;

TPC-H Query 8

select

YEAR(o_orderdate) as o_year,

sum(case

when n2.n_name = ’BRAZIL’ then l_extendedprice * (1 - l_discount)

else 0

end) / sum(l_extendedprice * (1 - l_discount)) as mkt_share

from

part,

supplier,

lineitem,

orders,

customer,

nation n1,

nation n2,

region

where

p_partkey = l_partkey

Appendix A. TPC-H Queries and their Rewritings 185

and s_suppkey = l_suppkey

and l_orderkey = o_orderkey

and o_custkey = c_custkey

and c_nationkey = n1.n_nationkey

and n1.n_regionkey = r_regionkey

and r_name = ’AMERICA’

and s_nationkey = n2.n_nationkey

and o_orderdate >= ’1995-01-01’

and o_orderdate <= ’1996-12-31’

and p_type = ’ECONOMY ANODIZED STEEL’

group by

YEAR(o_orderdate)

order by

YEAR(o_orderdate);

Rewritten Query 8

with candidatesSubQuery as (

select

l_orderkey,

l_linenumber

from

part,

supplier,

lineitem,

orders,

customer,

nation n1,

nation n2,

region

where

p_partkey = l_partkey

and s_suppkey = l_suppkey

and l_orderkey = o_orderkey

and o_custkey = c_custkey

and c_nationkey = n1.n_nationkey

and n1.n_regionkey = r_regionkey

and r_name = ’AMERICA’

and s_nationkey = n2.n_nationkey

and o_orderdate >= ’1995-01-01’

and o_orderdate <= ’1996-12-31’

Appendix A. TPC-H Queries and their Rewritings 186

and p_type = ’ECONOMY ANODIZED STEEL’

),

contribAllSubQuery as (

select

o_year,

min(dividend) as low_dividend,

max(dividend) as up_dividend,

min(divisor) as low_divisor,

max(divisor) as up_divisor,

condWhereSat,

condWhereViol,

max(rankProj) over (partition by l_orderkey,l_linenumber) as countProj

from

(

select

l_orderkey,

l_linenumber,

YEAR(o_orderdate) as o_year,

case

when n2.n_name = ’BRAZIL’ then l_extendedprice * (1 - l_discount)

else 0

end as dividend,

l_extendedprice * (1 - l_discount) as divisor,

rank() over (partition by l_orderkey,l_linenumber

order by YEAR(o_orderdate)) as rankProj,

sum(case when

(p_partkey = l_partkey

and s_suppkey = l_suppkey

and o_custkey = c_custkey

and c_nationkey = n1.n_nationkey

and n1.n_regionkey = r_regionkey

and r_name = ’AMERICA’

and s_nationkey = n2.n_nationkey

and o_orderdate >= ’1995-01-01’

and o_orderdate <= ’1996-12-31’

and p_type = ’ECONOMY ANODIZED STEEL’)

then 0 else 1 end)

over (partition by l_orderkey,l_linenumber) as condWhereViol,

case when

(p_partkey = l_partkey

and s_suppkey = l_suppkey

Appendix A. TPC-H Queries and their Rewritings 187

and o_custkey = c_custkey

and c_nationkey = n1.n_nationkey

and n1.n_regionkey = r_regionkey

and r_name = ’AMERICA’

and s_nationkey = n2.n_nationkey

and o_orderdate >= ’1995-01-01’

and o_orderdate <= ’1996-12-31’

and p_type = ’ECONOMY ANODIZED STEEL’)

then 1 else 0 end as condWhereSat

from

lineitem li JOIN orders ON l_orderkey = o_orderkey

LEFT OUTER JOIN supplier ON s_suppkey = l_suppkey

LEFT OUTER JOIN nation n2 ON s_nationkey = n2.n_nationkey

LEFT OUTER JOIN part ON p_partkey = l_partkey

LEFT OUTER JOIN customer ON o_custkey = c_custkey

LEFT OUTER JOIN nation n1 ON c_nationkey = n1.n_nationkey

LEFT OUTER JOIN region ON n1.n_regionkey = r_regionkey

where

exists (select * from candidatesSubQuery sc

where li.l_orderkey=sc.l_orderkey

and li.l_linenumber=sc.l_linenumber)

) q

where

condWhereSat=1

group by

l_orderkey,l_linenumber,o_year,condWhereSat,condWhereViol,rankProj),

contribConsistentSubQuery as (

select

o_year,

low_dividend,

up_dividend,

low_divisor,

up_divisor,

1 as countConsistent

from

contribAllSubQuery Cand

where condWhereViol = 0 and countProj=1),

contribNonConsistentSubQuery as (

select

o_year,

Appendix A. TPC-H Queries and their Rewritings 188

0 as low_dividend,

up_dividend,

0 as low_divisor,

up_divisor,

0 as countConsistent

from

contribAllSubQuery Cand

where condWhereViol >= 1 or countProj > 1)

select o_year,

sum(low_dividend)/sum(up_divisor) as low_mkt_share,

sum(up_dividend)/sum(low_divisor) as up_mktshare

from

(select * from contribConsistentSubQuery

union all

select * from contribNonConsistentSubQuery) q

group by o_year

having sum(countConsistent)>0

order by o_year;

TPC-H Query 9

select

n_name,

YEAR(o_orderdate) as o_year,

sum(l_extendedprice * (1 - l_discount) - ps_supplycost * l_quantity) as sum_profit

from

part,

supplier,

lineitem,

partsupp,

orders,

nation

where

s_suppkey = l_suppkey

and ps_suppkey = l_suppkey

and ps_partkey = l_partkey

and p_partkey = l_partkey

and o_orderkey = l_orderkey

and s_nationkey = n_nationkey

and p_name like ’%green%’

group by

Appendix A. TPC-H Queries and their Rewritings 189

n_name,

o_orderdate

order by

n_name,

o_orderdate desc;

Rewritten Query 9

with candidatesSubQuery as (

select

l_orderkey,

l_linenumber

from

part,

supplier,

lineitem,

orders,

nation

where

s_suppkey = l_suppkey

and p_partkey = l_partkey

and o_orderkey = l_orderkey

and s_nationkey = n_nationkey

and p_name like ’%green%’

),

contribAllSubQuery as (

select

l_orderkey,

l_linenumber,

n_name as nation,

YEAR(o_orderdate) as o_year,

min(l_extendedprice * (1 - l_discount) - ps_supplycost * l_quantity) as min_profit,

max(l_extendedprice * (1 - l_discount) - ps_supplycost * l_quantity) as max_profit,

1 as min_count,

max(rankProj) over (partition by l_orderkey,l_linenumber) as countProj,

cond_viol,

cond_sat

from

(select l_orderkey,

l_linenumber,

n_name,

Appendix A. TPC-H Queries and their Rewritings 190

o_orderdate,

l_extendedprice,

l_discount,

ps_supplycost,

l_quantity,

rank() over (partition by l_orderkey,l_linenumber

order by n_name,YEAR(o_orderdate)) as rankProj,

sum(case

when s_suppkey = l_suppkey

and ps_suppkey = l_suppkey

and ps_partkey = l_partkey

and p_partkey = l_partkey

and s_nationkey = n_nationkey

and p_name like ’%green%’

then 0 else 1 end)

over (partition by l_orderkey,l_linenumber) as cond_viol,

case when

s_suppkey = l_suppkey

and ps_suppkey = l_suppkey

and ps_partkey = l_partkey

and p_partkey = l_partkey

and s_nationkey = n_nationkey

and p_name like ’%green%’

then 1 else 0 end as cond_sat

from

lineitem l JOIN orders o1 ON l_orderkey=o1.o_orderkey

LEFT OUTER JOIN part ON p_partkey=l_partkey

LEFT OUTER JOIN supplier ON s_suppkey=l_suppkey

LEFT OUTER JOIN nation n1 ON n1.n_nationkey=s_nationkey

LEFT OUTER JOIN partsupp ON ps_partkey=l_partkey and ps_suppkey=l_suppkey

where

exists (select * from candidatesSubQuery sc

where l.l_orderkey=sc.l_orderkey

and l.l_linenumber=sc.l_linenumber)

) q

where cond_sat=1

group by

l_orderkey,

l_linenumber,

n_name,

o_orderdate,

Appendix A. TPC-H Queries and their Rewritings 191

cond_viol,cond_sat,rankProj),

contribConsistentSubQuery as (

select l_orderkey,

l_linenumber,

nation,

o_year,

min_profit,

max_profit,

min_count

from

contribAllSubQuery Cand

where

countProj = 1 and cond_viol=0),

contribNonConsistentSubQuery as (

select

l_orderkey,

l_linenumber,

nation,

o_year,

0 as min_profit,

max_profit,

0 as min_count

from

contribAllSubQuery Cand

where

countProj > 1 or cond_viol >= 1)

select

nation,

o_year,

sum(min_profit) as min_sum_profit,

sum(max_profit) as max_sum_profit

from

(select * from contribNonConsistentSubQuery

union all

select * from contribConsistentSubQuery) as q

group by

nation,

o_year

having sum(min_count)>0

order by

nation,

Appendix A. TPC-H Queries and their Rewritings 192

o_year desc;

TPC-H Query 10

select

c_custkey,

c_name,

sum(l_extendedprice * (1 - l_discount)) as revenue,

c_acctbal,

n_name,

c_address,

c_phone,

c_comment

from

customer,

orders,

lineitem,

nation

where

c_custkey = o_custkey

and l_orderkey = o_orderkey

and o_orderdate >= ’1993-10-01’

and o_orderdate < date(’1993-10-01’) + 3 MONTHS

and l_returnflag = ’R’

and c_nationkey = n_nationkey

group by

c_custkey,

c_name,

c_acctbal,

c_phone,

n_name,

c_address,

c_comment

order by

revenue desc

fetch first 20 rows only;

Rewritten Query 10

with candidatesSubQuery as (

select

l_orderkey,

Appendix A. TPC-H Queries and their Rewritings 193

l_linenumber

from

customer,

orders,

lineitem,

nation

where

c_custkey = o_custkey

and l_orderkey = o_orderkey

and o_orderdate >= ’1993-10-01’

and o_orderdate < date(’1993-10-01’) + 3 MONTHS

and l_returnflag = ’R’

and c_nationkey = n_nationkey

),

contribAllSubQuery as (

select

l_orderkey,

l_linenumber,

c_custkey,

c_name,

c_acctbal,

n_name,

c_address,

c_phone,

c_comment,

min(l_extendedprice * (1 - l_discount)) as min_revenue,

max(l_extendedprice * (1 - l_discount)) as max_revenue,

1 as min_count,

max (rankProj) over (partition by l_orderkey,l_linenumber) as countProj,

cond_viol,

cond_sat

from

(select l_orderkey,

l_linenumber,

c_custkey,

c_name,

c_acctbal,

n_name,

c_address,

c_phone,

c_comment,

Appendix A. TPC-H Queries and their Rewritings 194

l_extendedprice,

l_discount,

rank() over (partition by l_orderkey,l_linenumber

order by c_custkey,c_name,

c_acctbal,n_name, c_address, c_phone,

c_comment) as rankProj,

sum(case

when c_custkey = o_custkey

and o_orderdate >= ’1993-10-01’

and o_orderdate < date(’1993-10-01’) + 3 MONTHS

and l_returnflag = ’R’

and c_nationkey = n_nationkey

then 0 else 1 end)

over (partition by l_orderkey,l_linenumber) as cond_viol,

case when c_custkey = o_custkey

and o_orderdate >= ’1993-10-01’

and o_orderdate < date(’1993-10-01’) + 3 MONTHS

and l_returnflag = ’R’

and c_nationkey = n_nationkey

then 1 else 0 end as cond_sat

from

lineitem l JOIN orders on l_orderkey=o_orderkey

LEFT OUTER JOIN customer c1 ON c1.c_custkey=o_custkey

LEFT OUTER JOIN nation n1 ON n1.n_nationkey=c1.c_nationkey

where

exists (select * from candidatesSubQuery sc

where l.l_orderkey=sc.l_orderkey

and l.l_linenumber=sc.l_linenumber)

) q

where cond_sat=1

group by

l_orderkey,

l_linenumber,

c_custkey,

c_name,

c_acctbal,

c_phone,

n_name,

c_address,

c_comment,

cond_viol,cond_sat,rankProj),

Appendix A. TPC-H Queries and their Rewritings 195

contribConsistentSubQuery as (

select l_orderkey,

l_linenumber,

c_custkey,

c_name,

c_acctbal,

n_name,

c_address,

c_phone,

c_comment,

min_revenue,

max_revenue,

min_count

from

contribAllSubQuery Cand

where

countProj=1 and cond_viol=0),

contribNonConsistentSubQuery as (

select

l_orderkey,

l_linenumber,

c_custkey,

c_name,

c_acctbal,

n_name,

c_address,

c_phone,

c_comment,

0 as min_revenue,

max_revenue,

0 as min_count

from

contribAllSubQuery Cand

where

countProj > 1 or cond_viol >= 1)

select

c_custkey,

c_name,

sum(min_revenue) as min_sum_revenue,

sum(max_revenue) as max_sum_revenue,

c_acctbal,

Appendix A. TPC-H Queries and their Rewritings 196

n_name,

c_address,

c_phone,

c_comment

from

(select * from contribNonConsistentSubQuery

union all

select * from contribConsistentSubQuery) as q

group by

c_custkey,

c_name,

c_acctbal,

c_phone,

n_name,

c_address,

c_comment

having sum(min_count)>0

order by

min_sum_revenue desc

fetch first 20 rows only;

TPC-H Query 12

select

l_shipmode,

sum(case

when o_orderpriority = ’1-URGENT’

or o_orderpriority = ’2-HIGH’

then 1

else 0

end) as high_line_count,

sum(case

when o_orderpriority <> ’1-URGENT’

and o_orderpriority <> ’2-HIGH’

then 1

else 0

end) as low_line_count

from

orders,

lineitem

where

Appendix A. TPC-H Queries and their Rewritings 197

o_orderkey = l_orderkey

and l_shipmode in (’MAIL’, ’SHIP’)

and l_commitdate < l_receiptdate

and l_shipdate < l_commitdate

and l_receiptdate >= ’1994-01-01’

and l_receiptdate < date(’1994-01-01’) + 1 YEAR

group by

l_shipmode

order by

l_shipmode;

Rewritten Query 12

with candidatesSubQuery as (

select

l_orderkey,

l_linenumber

from

orders,

lineitem

where

o_orderkey = l_orderkey

and l_shipmode in (’MAIL’, ’SHIP’)

and l_commitdate < l_receiptdate

and l_shipdate < l_commitdate

and l_receiptdate >= ’1994-01-01’

and l_receiptdate < date(’1994-01-01’) + 1 YEAR

),

contribAllSubQuery as (

select l_orderkey,

l_linenumber,

l_shipmode,

min(case

when o_orderpriority = ’1-URGENT’

or o_orderpriority = ’2-HIGH’

then 1

else 0

end) as min_high_line_count,

max(case

when o_orderpriority = ’1-URGENT’

or o_orderpriority = ’2-HIGH’

Appendix A. TPC-H Queries and their Rewritings 198

then 1

else 0

end) as max_high_line_count,

min(case

when o_orderpriority <> ’1-URGENT’

and o_orderpriority <> ’2-HIGH’

then 1

else 0

end) as min_low_line_count,

max(case

when o_orderpriority <> ’1-URGENT’

and o_orderpriority <> ’2-HIGH’

then 1

else 0

end) as max_low_line_count,

1 as min_count,

max(rankProj) over (partition by l_orderkey,l_linenumber) as countProj,

cond_viol,

cond_sat

from

(select l_orderkey,

l_linenumber,

l_shipmode,

o_orderpriority,

rank() over (partition by l_orderkey,l_linenumber

order by l_shipmode) as rankProj,

sum(case

when l_shipmode in (’MAIL’, ’SHIP’)

and l_commitdate < l_receiptdate

and l_shipdate < l_commitdate

and l_receiptdate >= ’1994-01-01’

and l_receiptdate < date(’1994-01-01’) + 1 YEAR

then 0 else 1 end)

over (partition by l_orderkey,l_linenumber) as cond_viol,

case when l_shipmode in (’MAIL’, ’SHIP’)

and l_commitdate < l_receiptdate

and l_shipdate < l_commitdate

and l_receiptdate >= ’1994-01-01’

and l_receiptdate < date(’1994-01-01’) + 1 YEAR

then 1 else 0 end as cond_sat

from orders JOIN lineitem l ON l_orderkey = o_orderkey

Appendix A. TPC-H Queries and their Rewritings 199

where

exists (select * from candidatesSubQuery sc

where l.l_orderkey=sc.l_orderkey

and l.l_linenumber=sc.l_linenumber)

) q

where cond_sat=1

group by

l_shipmode,

l_orderkey,

l_linenumber,

cond_viol,cond_sat,rankProj),

contribConsistentSubQuery as (

select l_orderkey,

l_linenumber,

l_shipmode,

min_high_line_count,

max_high_line_count,

min_low_line_count,

max_low_line_count,

min_count

from

contribAllSubQuery Cand

where

countProj=1 and cond_viol=0),

contribNonConsistentSubQuery as (

select

l_orderkey,

l_linenumber,

l_shipmode,

0 as min_high_line_count,

max_high_line_count,

0 as min_low_line_count,

max_low_line_count,

0 as min_count

from

contribAllSubQuery Cand

where

countProj > 1 or cond_viol >= 1)

select

l_shipmode,

sum(min_high_line_count) as sum_min_high_line_count,

Appendix A. TPC-H Queries and their Rewritings 200

sum(max_high_line_count) as sum_max_high_line_count,

sum(min_low_line_count) as sum_min_low_line_count,

sum(max_low_line_count) as sum_max_low_line_count

from

(select * from contribNonConsistentSubQuery

union all

select * from contribConsistentSubQuery) as q

group by

l_shipmode

having sum(min_count)>0

order by

l_shipmode;

TPC-H Query 14

select

100.00 * sum(case

when p_type like ’PROMO%’

then l_extendedprice * (1 - l_discount)

else 0

end) / sum(l_extendedprice * (1 - l_discount)) as promo_revenue

from

lineitem,

part

where

l_partkey = p_partkey

and l_shipdate >= ’1995-09-01’

and l_shipdate < date(’1995-09-01’) + 30 DAYS;

Rewritten Query 14

with candidatesSubQuery as (

select

l_orderkey,

l_linenumber

from

lineitem,

part

where

l_partkey = p_partkey

and l_shipdate >= ’1995-09-01’

and l_shipdate < date(’1995-09-01’) + 30 DAYS

Appendix A. TPC-H Queries and their Rewritings 201

),

contribAllSubQuery as (

select

min(dividend) as low_dividend,

max(dividend) as up_dividend,

min(divisor) as low_divisor,

max(divisor) as up_divisor,

condWhereSat,

condWhereViol

from

(

select

l_orderkey,

l_linenumber,

100.00 * case

when p_type like ’PROMO%’

then l_extendedprice * (1 - l_discount)

else 0

end as dividend,

l_extendedprice * (1 - l_discount) as divisor,

sum(case when

(l_partkey = p_partkey

and l_shipdate >= ’1995-09-01’

and l_shipdate < date(’1995-09-01’) + 30 DAYS)

then 0 else 1 end)

over (partition by l_orderkey,l_linenumber) as condWhereViol,

case when

(l_partkey = p_partkey

and l_shipdate >= ’1995-09-01’

and l_shipdate < date(’1995-09-01’) + 30 DAYS)

then 1 else 0 end as condWhereSat

from

lineitem li LEFT OUTER JOIN part ON l_partkey = p_partkey

where

exists (select * from candidatesSubQuery sc

where li.l_orderkey=sc.l_orderkey

and li.l_linenumber=sc.l_linenumber)

) q

where

condWhereSat=1

group by l_orderkey,l_linenumber,condWhereSat,condWhereViol),

Appendix A. TPC-H Queries and their Rewritings 202

contribConsistentSubQuery as (

select

low_dividend,

up_dividend,

low_divisor,

up_divisor,

1 as countConsistent

from

contribAllSubQuery Cand

where condWhereViol = 0),

contribNonConsistentSubQuery as (

select

0 as low_dividend,

up_dividend,

0 as low_divisor,

up_divisor,

0 as countConsistent

from

contribAllSubQuery Cand

where condWhereViol >= 1)

select sum(low_dividend)/sum(up_divisor) as low_promo_revenue,

sum(up_dividend)/sum(low_divisor) as up_promo_revenue

from

(select * from contribConsistentSubQuery

union all

select * from contribNonConsistentSubQuery) q

having sum(countConsistent)>0;

TPC-H Query 19

select

sum(l_extendedprice* (1 - l_discount)) as revenue

from

lineitem,

part

where

(

p_partkey = l_partkey

and p_brand = ’Brand#12’

and p_container in (’SM CASE’, ’SM BOX’, ’SM PACK’, ’SM PKG’)

and l_quantity >= 1 and l_quantity <= 1 + 10

Appendix A. TPC-H Queries and their Rewritings 203

and p_size between 1 and 5

and l_shipmode in (’AIR’, ’AIR REG’)

and l_shipinstruct = ’DELIVER IN PERSON’

)

or

(

p_partkey = l_partkey

and p_brand = ’Brand#23’

and p_container in (’MED BAG’, ’MED BOX’, ’MED PKG’, ’MED PACK’)

and l_quantity >= 10 and l_quantity <= 10 + 10

and p_size between 1 and 10

and l_shipmode in (’AIR’, ’AIR REG’)

and l_shipinstruct = ’DELIVER IN PERSON’

)

or

(

p_partkey = l_partkey

and p_brand = ’Brand#34’

and p_container in (’LG CASE’, ’LG BOX’, ’LG PACK’, ’LG PKG’)

and l_quantity >= 20 and l_quantity <= 20 + 10

and p_size between 1 and 15

and l_shipmode in (’AIR’, ’AIR REG’)

and l_shipinstruct = ’DELIVER IN PERSON’

);

Rewritten Query 19

with candidatesSubQuery as (

select

l_orderkey,

l_linenumber

from

lineitem,

part

where

(

p_partkey = l_partkey

and p_brand = ’Brand#12’

and p_container in (’SM CASE’, ’SM BOX’, ’SM PACK’, ’SM PKG’)

and l_quantity >= 1 and l_quantity <= 1 + 10

and p_size between 1 and 5

Appendix A. TPC-H Queries and their Rewritings 204

and l_shipmode in (’AIR’, ’AIR REG’)

and l_shipinstruct = ’DELIVER IN PERSON’

)

or

(

p_partkey = l_partkey

and p_brand = ’Brand#23’

and p_container in (’MED BAG’, ’MED BOX’, ’MED PKG’, ’MED PACK’)

and l_quantity >= 10 and l_quantity <= 10 + 10

and p_size between 1 and 10

and l_shipmode in (’AIR’, ’AIR REG’)

and l_shipinstruct = ’DELIVER IN PERSON’

)

or

(

p_partkey = l_partkey

and p_brand = ’Brand#34’

and p_container in (’LG CASE’, ’LG BOX’, ’LG PACK’, ’LG PKG’)

and l_quantity >= 20 and l_quantity <= 20 + 10

and p_size between 1 and 15

and l_shipmode in (’AIR’, ’AIR REG’)

and l_shipinstruct = ’DELIVER IN PERSON’

)

),

contribAllSubQuery as (

select min(revenue) as low_revenue,

max(revenue) as up_revenue,

condWhereViol,

condWhereSat

from

(select

l_orderkey,

l_linenumber,

l_extendedprice* (1 - l_discount) as revenue,

sum (case when (

(

p_partkey = l_partkey

and p_brand = ’Brand#12’

and p_container in (’SM CASE’, ’SM BOX’, ’SM PACK’, ’SM PKG’)

and l_quantity >= 1 and l_quantity <= 1 + 10

and p_size between 1 and 5

Appendix A. TPC-H Queries and their Rewritings 205

and l_shipmode in (’AIR’, ’AIR REG’)

and l_shipinstruct = ’DELIVER IN PERSON’

)

or

(

p_partkey = l_partkey

and p_brand = ’Brand#23’

and p_container in (’MED BAG’, ’MED BOX’, ’MED PKG’, ’MED PACK’)

and l_quantity >= 10 and l_quantity <= 10 + 10

and p_size between 1 and 10

and l_shipmode in (’AIR’, ’AIR REG’)

and l_shipinstruct = ’DELIVER IN PERSON’

)

or

(

p_partkey = l_partkey

and p_brand = ’Brand#34’

and p_container in (’LG CASE’, ’LG BOX’, ’LG PACK’, ’LG PKG’)

and l_quantity >= 20 and l_quantity <= 20 + 10

and p_size between 1 and 15

and l_shipmode in (’AIR’, ’AIR REG’)

and l_shipinstruct = ’DELIVER IN PERSON’

)

) then 0 else 1 end)

over (partition by l_orderkey,l_linenumber) as condWhereViol,

case when (

(

p_partkey = l_partkey

and p_brand = ’Brand#12’

and p_container in (’SM CASE’, ’SM BOX’, ’SM PACK’, ’SM PKG’)

and l_quantity >= 1 and l_quantity <= 1 + 10

and p_size between 1 and 5

and l_shipmode in (’AIR’, ’AIR REG’)

and l_shipinstruct = ’DELIVER IN PERSON’

)

or

(

p_partkey = l_partkey

and p_brand = ’Brand#23’

and p_container in (’MED BAG’, ’MED BOX’, ’MED PKG’, ’MED PACK’)

and l_quantity >= 10 and l_quantity <= 10 + 10

Appendix A. TPC-H Queries and their Rewritings 206

and p_size between 1 and 10

and l_shipmode in (’AIR’, ’AIR REG’)

and l_shipinstruct = ’DELIVER IN PERSON’

)

or

(

p_partkey = l_partkey

and p_brand = ’Brand#34’

and p_container in (’LG CASE’, ’LG BOX’, ’LG PACK’, ’LG PKG’)

and l_quantity >= 20 and l_quantity <= 20 + 10

and p_size between 1 and 15

and l_shipmode in (’AIR’, ’AIR REG’)

and l_shipinstruct = ’DELIVER IN PERSON’

)

) then 1 else 0 end as condWhereSat

from

lineitem li

LEFT OUTER JOIN part ON p_partkey = l_partkey

where

exists (select * from candidatesSubQuery sc

where li.l_orderkey=sc.l_orderkey

and li.l_linenumber=sc.l_linenumber)

) q

where condWhereSat = 1

group by l_orderkey,l_linenumber,condWhereViol,condWhereSat),

contribConsistentSubQuery as (

select low_revenue,

up_revenue,

1 as countConsistent

from

contribAllSubQuery Cand

where condWhereViol = 0),

contribNonConsistentSubQuery as (

select

0 as low_revenue,

up_revenue,

0 as countConsistent

from

contribAllSubQuery Cand

where condWhereViol >= 1)

select sum(low_revenue) as sum_low_revenue,

Appendix A. TPC-H Queries and their Rewritings 207

sum(up_revenue) as sum_up_revenue

from

(select * from contribConsistentSubQuery

union all

select * from contribNonConsistentSubQuery) q

having sum(countConsistent)>0;

Appendix B

Design Advisor Indices

The following are the indices suggested by DB2’s Design Advisor for the inconsistent databases of the

scalability experiment.

Inconsistent database size 1 GB, p = 5%, n = 2

-- index[1], 4.118MB

CREATE INDEX "DB2ADMIN"."IDX609161412250000" ON "DB2ADMIN"."CUSTOMER" ("C_MKTSEGMENT" ASC, "C_CUSTKEY" ASC)

ALLOW REVERSE SCANS ;

-- index[2], 2.884MB

CREATE INDEX "DB2ADMIN"."IDX609161413270000" ON "DB2ADMIN"."CUSTOMER" ("C_NATIONKEY" ASC, "C_CUSTKEY" ASC)

ALLOW REVERSE SCANS ;

-- index[3], 28.802MB

CREATE INDEX "DB2ADMIN"."IDX609161413220000" ON "DB2ADMIN"."ORDERS" ("O_CUSTKEY" ASC, "O_ORDERKEY" ASC)

ALLOW REVERSE SCANS ;

-- index[4], 0.196MB

CREATE INDEX "DB2ADMIN"."IDX609161412310000" ON "DB2ADMIN"."SUPPLIER" ("S_NATIONKEY" ASC, "S_SUPPKEY" ASC)

ALLOW REVERSE SCANS ;

-- index[5], 10.751MB

CREATE INDEX "DB2ADMIN"."IDX609161415040000" ON "DB2ADMIN"."PART" ("P_TYPE" ASC, "P_PARTKEY" ASC)

ALLOW REVERSE SCANS ;

-- index[6], 0.013MB

CREATE INDEX "DB2ADMIN"."IDX609161413420000" ON "DB2ADMIN"."REGION" ("R_NAME" ASC, "R_REGIONKEY" ASC)

ALLOW REVERSE SCANS ;

-- index[7], 0.013MB

CREATE INDEX "DB2ADMIN"."IDX609161413510000" ON "DB2ADMIN"."NATION" ("N_REGIONKEY" ASC, "N_NATIONKEY" ASC)

ALLOW REVERSE SCANS ;

-- index[8], 28.282MB

CREATE INDEX "DB2ADMIN"."IDX609161415260000" ON "DB2ADMIN"."ORDERS" ("O_ORDERKEY" ASC, "O_ORDERDATE" ASC)

ALLOW REVERSE SCANS ;

-- index[9], 39.142MB

CREATE INDEX "DB2ADMIN"."IDX609161417270000" ON "DB2ADMIN"."ORDERS" ("O_ORDERDATE" ASC,

"O_ORDERKEY" ASC, "O_CUSTKEY" ASC) ALLOW REVERSE SCANS ;

-- index[10], 63.013MB

CREATE INDEX "DB2ADMIN"."IDX609161421380000" ON "DB2ADMIN"."LINEITEM" ("L_SHIPDATE" ASC, "L_LINENUMBER" ASC,

208

Appendix B. Design Advisor Indices 209

"L_ORDERKEY" ASC) ALLOW REVERSE SCANS ;

-- index[11], 4.118MB

CREATE INDEX "DB2ADMIN"."IDX609161422320000" ON "DB2ADMIN"."CUSTOMER" ("C_CUSTKEY" ASC, "C_MKTSEGMENT" ASC)

ALLOW REVERSE SCANS ;

-- index[12], 2.884MB

CREATE INDEX "DB2ADMIN"."IDX609161424190000" ON "DB2ADMIN"."CUSTOMER" ("C_CUSTKEY" ASC, "C_NATIONKEY" ASC)

ALLOW REVERSE SCANS ;

-- index[13], 28.802MB

CREATE INDEX "DB2ADMIN"."IDX609161424240000" ON "DB2ADMIN"."ORDERS" ("O_ORDERKEY" ASC, "O_CUSTKEY" ASC)

ALLOW REVERSE SCANS ;

-- index[14], 39.142MB

CREATE INDEX "DB2ADMIN"."IDX609161425360000" ON "DB2ADMIN"."ORDERS" ("O_ORDERKEY" ASC, "O_ORDERDATE" ASC,

"O_CUSTKEY" ASC) ALLOW REVERSE SCANS ;

-- index[15], 131.692MB

CREATE INDEX "DB2ADMIN"."IDX609161431410000" ON "DB2ADMIN"."LINEITEM" ("L_PARTKEY" ASC, "L_QUANTITY" ASC)

ALLOW REVERSE SCANS ;

-- index[16], 9.919MB

CREATE INDEX "DB2ADMIN"."IDX609161432260000" ON "DB2ADMIN"."PART" ("P_PARTKEY" ASC, "P_SIZE" ASC,

"P_BRAND" ASC, "P_CONTAINER" ASC) ALLOW REVERSE SCANS ;

Inconsistent database size 2 GB, p = 5%, n = 2

-- index[1], 103.747MB

CREATE INDEX "DB2ADMIN"."IDX609161434540000" ON "DB2ADMIN"."LINEITEM" ("L_DISCOUNT" ASC,

"L_SHIPDATE" ASC, "L_QUANTITY" ASC, "L_EXTENDEDPRICE" ASC) ALLOW REVERSE SCANS ;

-- index[2], 210.294MB

CREATE INDEX "DB2ADMIN"."IDX609161446330000" ON "DB2ADMIN"."LINEITEM" ("L_QUANTITY" ASC, "L_SHIPDATE" ASC,

"L_DISCOUNT" ASC, "L_EXTENDEDPRICE" ASC, "L_LINENUMBER" ASC, "L_ORDERKEY" ASC) ALLOW REVERSE SCANS ;

-- index[3], 396.028MB

CREATE INDEX "DB2ADMIN"."IDX609161447560000" ON "DB2ADMIN"."LINEITEM" ("L_PARTKEY" ASC, "L_SUPPKEY" ASC,

"L_ORDERKEY" ASC, "L_LINENUMBER" ASC) ALLOW REVERSE SCANS ;

-- index[4], 263.388MB

CREATE INDEX "DB2ADMIN"."IDX609161453480000" ON "DB2ADMIN"."LINEITEM" ("L_PARTKEY" ASC, "L_QUANTITY" ASC)

ALLOW REVERSE SCANS ;

Inconsistent database size 3 GB, p = 5%, n = 2

-- index[1], 12.341MB

CREATE INDEX "DB2ADMIN"."IDX609161458010000" ON "DB2ADMIN"."CUSTOMER" ("C_MKTSEGMENT" ASC,

"C_CUSTKEY" ASC) ALLOW REVERSE SCANS ;

-- index[2], 135.380MB

CREATE INDEX "DB2ADMIN"."IDX609161458080000" ON "DB2ADMIN"."LINEITEM" ("L_DISCOUNT" ASC, "L_SHIPDATE" ASC,

"L_QUANTITY" ASC, "L_EXTENDEDPRICE" ASC) ALLOW REVERSE SCANS ;

-- index[3], 8.646MB

CREATE INDEX "DB2ADMIN"."IDX609161459360000" ON "DB2ADMIN"."CUSTOMER" ("C_CUSTKEY" ASC,

"C_NATIONKEY" ASC) ALLOW REVERSE SCANS ;

-- index[4], 84.853MB

CREATE INDEX "DB2ADMIN"."IDX609161459280000" ON "DB2ADMIN"."ORDERS" ("O_CUSTKEY" ASC,

Appendix B. Design Advisor Indices 210

"O_ORDERKEY" ASC) ALLOW REVERSE SCANS ;

-- index[5], 0.583MB

CREATE INDEX "DB2ADMIN"."IDX609161458370000" ON "DB2ADMIN"."SUPPLIER" ("S_NATIONKEY" ASC,

"S_SUPPKEY" ASC) ALLOW REVERSE SCANS ;

-- index[6], 0.013MB

CREATE INDEX "DB2ADMIN"."IDX609161459480000" ON "DB2ADMIN"."REGION" ("R_NAME" ASC,

"R_REGIONKEY" ASC) ALLOW REVERSE SCANS ;

-- index[7], 0.013MB

CREATE INDEX "DB2ADMIN"."IDX609161459570000" ON "DB2ADMIN"."NATION" ("N_REGIONKEY" ASC,

"N_NATIONKEY" ASC) ALLOW REVERSE SCANS ;

-- index[8], 8.646MB

CREATE INDEX "DB2ADMIN"."IDX609161459580000" ON "DB2ADMIN"."CUSTOMER" ("C_NATIONKEY" ASC,

"C_CUSTKEY" ASC) ALLOW REVERSE SCANS ;

-- index[9], 32.243MB

CREATE INDEX "DB2ADMIN"."IDX609161501100000" ON "DB2ADMIN"."PART" ("P_TYPE" ASC,

"P_PARTKEY" ASC) ALLOW REVERSE SCANS ;

-- index[10], 84.853MB

CREATE INDEX "DB2ADMIN"."IDX609161501320000" ON "DB2ADMIN"."ORDERS" ("O_ORDERKEY" ASC,

"O_ORDERDATE" ASC) ALLOW REVERSE SCANS ;

-- index[11], 26.794MB

CREATE INDEX "DB2ADMIN"."IDX609161502520000" ON "DB2ADMIN"."PARTSUPP" ("PS_PARTKEY" ASC, "PS_SUPPKEY" ASC,

"PS_SUPPLYCOST" ASC) ALLOW REVERSE SCANS ;

-- index[12], 12.341MB

CREATE INDEX "DB2ADMIN"."IDX609161506500000" ON "DB2ADMIN"."CUSTOMER" ("C_CUSTKEY" ASC,

"C_MKTSEGMENT" ASC) ALLOW REVERSE SCANS ;

-- index[13], 84.853MB

CREATE INDEX "DB2ADMIN"."IDX609161508420000" ON "DB2ADMIN"."ORDERS" ("O_ORDERKEY" ASC,

"O_CUSTKEY" ASC) ALLOW REVERSE SCANS ;

-- index[14], 115.099MB

CREATE INDEX "DB2ADMIN"."IDX609161509520000" ON "DB2ADMIN"."ORDERS" ("O_ORDERDATE" ASC,

"O_ORDERKEY" ASC, "O_CUSTKEY" ASC) ALLOW REVERSE SCANS ;

-- index[15], 395.075MB

CREATE INDEX "DB2ADMIN"."IDX609161515590000" ON "DB2ADMIN"."LINEITEM" ("L_PARTKEY" ASC,

"L_QUANTITY" ASC) ALLOW REVERSE SCANS ;

-- index[16], 29.751MB

CREATE INDEX "DB2ADMIN"."IDX609161516440000" ON "DB2ADMIN"."PART" ("P_PARTKEY" ASC, "P_SIZE" ASC,

"P_BRAND" ASC, "P_CONTAINER" ASC) ALLOW REVERSE SCANS ;

Inconsistent database size 5 GB, p = 5%, n = 2

-- mqt[1], 1430.329MB

CREATE SUMMARY TABLE "DB2ADMIN"."MQT609161518000000" AS (SELECT Q4.C0 AS "C0", Q4.C1 AS "C1",

Q4.C2 AS "C2", Q4.C5 AS "C3", Q4.C4 AS "C4", Q4.C3 AS "C5", Q4.C6 AS "C6"

FROM TABLE(SELECT Q3.C0 AS "C0", SUM(Q3.C1) AS "C1", SUM(Q3.C2) AS "C2", Q3.C5 AS "C3", Q3.C4 AS "C4", Q3.C3 AS "C5", COUNT(* ) AS "C6" FROM TABLE(SELECT Q1.L_SHIPMODE AS "C0", CASE WHEN ((Q2.O_ORDERPRIORITY = ’1-URGENT ’) OR (Q2.O_ORDERPRIORITY = ’2-HIGH ’)) THEN 1 ELSE 0 END AS "C1", CASE WHEN ((Q2.O_ORDERPRIORITY <> ’1-URGENT ’) AND (Q2.O_ORDERPRIORITY <> ’2-HIGH ’)) THEN 1 ELSE 0 END AS "C2", Q1.L_RECEIPTDATE AS "C3", Q1.L_SHIPDATE AS "C4", Q1.L_COMMITDATE AS "C5" FROM DB2ADMIN.LINEITEM AS Q1, DB2ADMIN.ORDERS AS Q2 WHERE (Q2.O_ORDERKEY = Q1.L_ORDERKEY)) AS Q3 GROUP BY Q3.C3, Q3.C4, Q3.C5, Q3.C0) AS Q4) DATA INITIALLY DEFERRED REFRESH IMMEDIATE IN USERSPACE1 ;

-- index[1], 990.099MB

CREATE INDEX "DB2ADMIN"."IDX609161532510000" ON "DB2ADMIN"."LINEITEM" ("L_PARTKEY" ASC,

"L_SUPPKEY" ASC, "L_ORDERKEY" ASC, "L_LINENUMBER" ASC) ALLOW REVERSE SCANS ;

-- index[2], 49.587MB

CREATE INDEX "DB2ADMIN"."IDX609161539360000" ON "DB2ADMIN"."PART" ("P_PARTKEY" ASC, "P_SIZE" ASC,

"P_CONTAINER" ASC, "P_BRAND" ASC) ALLOW REVERSE SCANS ;

Appendix B. Design Advisor Indices 211

-- index[3], 164.485MB

CREATE INDEX "DB2ADMIN"."IDX609161539380000" ON "DB2ADMIN"."MQT609161518000000"

("C3" ASC, "C0" ASC, "C5" ASC, "C4" ASC) ALLOW REVERSE SCANS ;

Inconsistent database size 10 GB, p = 5%, n = 2

-- index[1], 41.126MB

CREATE INDEX "DB2ADMIN"."IDX609161543010000" ON "DB2ADMIN"."CUSTOMER" ("C_MKTSEGMENT" ASC,

"C_CUSTKEY" ASC) ALLOW REVERSE SCANS ;

-- index[2], 373.618MB

CREATE INDEX "DB2ADMIN"."IDX609161543080000" ON "DB2ADMIN"."LINEITEM" ("L_DISCOUNT" ASC,

"L_SHIPDATE" ASC, "L_QUANTITY" ASC, "L_EXTENDEDPRICE" ASC) ALLOW REVERSE SCANS ;

-- index[3], 1.923MB

CREATE INDEX "DB2ADMIN"."IDX609161543370000" ON "DB2ADMIN"."SUPPLIER" ("S_NATIONKEY" ASC,

"S_SUPPKEY" ASC) ALLOW REVERSE SCANS ;

-- index[4], 282.829MB

CREATE INDEX "DB2ADMIN"."IDX609161544310000" ON "DB2ADMIN"."ORDERS" ("O_ORDERKEY" ASC,

"O_CUSTKEY" ASC) ALLOW REVERSE SCANS ;

-- index[5], 28.802MB

CREATE INDEX "DB2ADMIN"."IDX609161544360000" ON "DB2ADMIN"."CUSTOMER" ("C_CUSTKEY" ASC,

"C_NATIONKEY" ASC) ALLOW REVERSE SCANS ;

-- index[6], 0.013MB

CREATE INDEX "DB2ADMIN"."IDX609161544480000" ON "DB2ADMIN"."REGION" ("R_NAME" ASC,

"R_REGIONKEY" ASC) ALLOW REVERSE SCANS ;

-- index[7], 0.013MB

CREATE INDEX "DB2ADMIN"."IDX609161544570000" ON "DB2ADMIN"."NATION" ("N_REGIONKEY" ASC,

"N_NATIONKEY" ASC) ALLOW REVERSE SCANS ;

-- index[8], 28.802MB

CREATE INDEX "DB2ADMIN"."IDX609161544580000" ON "DB2ADMIN"."CUSTOMER" ("C_NATIONKEY" ASC,

"C_CUSTKEY" ASC) ALLOW REVERSE SCANS ;

-- index[9], 107.474MB

CREATE INDEX "DB2ADMIN"."IDX609161546100000" ON "DB2ADMIN"."PART" ("P_TYPE" ASC,

"P_PARTKEY" ASC) ALLOW REVERSE SCANS ;

-- index[10], 230.290MB

CREATE INDEX "DB2ADMIN"."IDX609161547520000" ON "DB2ADMIN"."PARTSUPP" ("PS_PARTKEY" ASC, "PS_SUPPKEY" ASC,

"PS_SUPPLYCOST" ASC) ALLOW REVERSE SCANS ;

-- index[11], 282.829MB

CREATE INDEX "DB2ADMIN"."IDX609161546320000" ON "DB2ADMIN"."ORDERS" ("O_ORDERKEY" ASC,

"O_ORDERDATE" ASC) ALLOW REVERSE SCANS ;

-- index[12], 1565.091MB

CREATE INDEX "DB2ADMIN"."IDX609161550570000" ON "DB2ADMIN"."LINEITEM" ("L_ORDERKEY" ASC,

"L_LINENUMBER" ASC, "L_SHIPDATE" ASC) ALLOW REVERSE SCANS ;

-- index[13], 41.126MB

CREATE INDEX "DB2ADMIN"."IDX609161551500000" ON "DB2ADMIN"."CUSTOMER" ("C_CUSTKEY" ASC,

"C_MKTSEGMENT" ASC) ALLOW REVERSE SCANS ;

-- index[14], 282.829MB

CREATE INDEX "DB2ADMIN"."IDX609161552080000" ON "DB2ADMIN"."ORDERS" ("O_ORDERDATE" ASC,

"O_ORDERKEY" DESC) ALLOW REVERSE SCANS ;

-- index[15], 1.923MB

CREATE INDEX "DB2ADMIN"."IDX609161554170000" ON "DB2ADMIN"."SUPPLIER" ("S_SUPPKEY" ASC,

Appendix B. Design Advisor Indices 212

"S_NATIONKEY" ASC) ALLOW REVERSE SCANS ;

-- index[16], 383.646MB

CREATE INDEX "DB2ADMIN"."IDX609161554520000" ON "DB2ADMIN"."ORDERS" ("O_ORDERDATE" ASC,

"O_ORDERKEY" ASC, "O_CUSTKEY" ASC) ALLOW REVERSE SCANS ;

-- index[17], 99.169MB

CREATE INDEX "DB2ADMIN"."IDX609161601440000" ON "DB2ADMIN"."PART" ("P_PARTKEY" ASC,

"P_SIZE" ASC, "P_BRAND" ASC, "P_CONTAINER" ASC) ALLOW REVERSE SCANS ;

Inconsistent database size 20 GB, p = 5%, n = 2

-- index[1], 990.228MB

CREATE INDEX "DB2ADMIN"."IDX609161604350000" ON "DB2ADMIN"."ORDERS" ("O_ORDERDATE" ASC,

"O_SHIPPRIORITY" ASC, "O_ORDERKEY" ASC, "O_CUSTKEY" ASC) ALLOW REVERSE SCANS ;

-- index[2], 82.243MB

CREATE INDEX "DB2ADMIN"."IDX609161605050000" ON "DB2ADMIN"."CUSTOMER" ("C_MKTSEGMENT" ASC,

"C_CUSTKEY" ASC) ALLOW REVERSE SCANS ;

-- index[3], 575.935MB

CREATE INDEX "DB2ADMIN"."IDX609161606040000" ON "DB2ADMIN"."ORDERS" ("O_CUSTKEY" ASC,

"O_ORDERKEY" ASC) ALLOW REVERSE SCANS ;

-- index[4], 3.841MB

CREATE INDEX "DB2ADMIN"."IDX609161605130000" ON "DB2ADMIN"."SUPPLIER" ("S_NATIONKEY" ASC,

"S_SUPPKEY" ASC) ALLOW REVERSE SCANS ;

-- index[5], 57.599MB

CREATE INDEX "DB2ADMIN"."IDX609161606090000" ON "DB2ADMIN"."CUSTOMER" ("C_NATIONKEY" ASC,

"C_CUSTKEY" ASC) ALLOW REVERSE SCANS ;

-- index[6], 214.939MB

CREATE INDEX "DB2ADMIN"."IDX609161607460000" ON "DB2ADMIN"."PART" ("P_TYPE" ASC,

"P_PARTKEY" ASC) ALLOW REVERSE SCANS ;

-- index[7], 0.013MB

CREATE INDEX "DB2ADMIN"."IDX609161606240000" ON "DB2ADMIN"."REGION" ("R_NAME" ASC,

"R_REGIONKEY" ASC) ALLOW REVERSE SCANS ;

-- index[8], 0.013MB

CREATE INDEX "DB2ADMIN"."IDX609161606330000" ON "DB2ADMIN"."NATION" ("N_REGIONKEY" ASC,

"N_NATIONKEY" ASC) ALLOW REVERSE SCANS ;

-- index[9], 178.575MB

CREATE INDEX "DB2ADMIN"."IDX609161609310000" ON "DB2ADMIN"."PARTSUPP" ("PS_PARTKEY" ASC,

"PS_SUPPKEY" ASC, "PS_SUPPLYCOST" ASC) ALLOW REVERSE SCANS ;

-- index[10], 575.935MB

CREATE INDEX "DB2ADMIN"."IDX609161608110000" ON "DB2ADMIN"."ORDERS" ("O_ORDERKEY" ASC,

"O_ORDERDATE" ASC) ALLOW REVERSE SCANS ;

-- index[11], 782.731MB

CREATE INDEX "DB2ADMIN"."IDX609161610160000" ON "DB2ADMIN"."ORDERS" ("O_ORDERDATE" ASC,

"O_ORDERKEY" ASC, "O_CUSTKEY" ASC) ALLOW REVERSE SCANS ;

-- index[12], 919.810MB

CREATE INDEX "DB2ADMIN"."IDX609161614260000" ON "DB2ADMIN"."LINEITEM" ("L_PARTKEY" ASC,

"L_SHIPINSTRUCT" ASC, "L_SHIPMODE" ASC, "L_QUANTITY" ASC) ALLOW REVERSE SCANS ;

-- index[13], 82.243MB

CREATE INDEX "DB2ADMIN"."IDX609161615250000" ON "DB2ADMIN"."CUSTOMER" ("C_CUSTKEY" ASC,

"C_MKTSEGMENT" ASC) ALLOW REVERSE SCANS ;

-- index[14], 1462.771MB

Appendix B. Design Advisor Indices 213

CREATE INDEX "DB2ADMIN"."IDX609161615450000" ON "DB2ADMIN"."LINEITEM" ("L_LINENUMBER" ASC,

"L_RECEIPTDATE" ASC, "L_COMMITDATE" ASC, "L_ORDERKEY" ASC) ALLOW REVERSE SCANS ;

-- index[15], 575.935MB

CREATE INDEX "DB2ADMIN"."IDX609161615430000" ON "DB2ADMIN"."ORDERS" ("O_ORDERDATE" ASC,

"O_ORDERKEY" DESC) ALLOW REVERSE SCANS ;

-- index[16], 57.599MB

CREATE INDEX "DB2ADMIN"."IDX609161617140000" ON "DB2ADMIN"."CUSTOMER" ("C_CUSTKEY" ASC,

"C_NATIONKEY" ASC) ALLOW REVERSE SCANS ;

-- index[17], 3.841MB

CREATE INDEX "DB2ADMIN"."IDX609161617540000" ON "DB2ADMIN"."SUPPLIER" ("S_SUPPKEY" ASC,

"S_NATIONKEY" ASC) ALLOW REVERSE SCANS ;

-- index[18], 0.013MB

CREATE INDEX "DB2ADMIN"."IDX609161622150000" ON "DB2ADMIN"."NATION" ("N_NATIONKEY" ASC,

"N_NAME" ASC) ALLOW REVERSE SCANS ;

-- index[19], 198.333MB

CREATE INDEX "DB2ADMIN"."IDX609161625390000" ON "DB2ADMIN"."PART" ("P_PARTKEY" ASC,

"P_SIZE" ASC, "P_BRAND" ASC, "P_CONTAINER" ASC) ALLOW REVERSE SCANS ;