Fuxman Thesis
-
Upload
curiousfan -
Category
Documents
-
view
5 -
download
1
description
Transcript of Fuxman Thesis
Efficient Query Processing Over Inconsistent Databases
by
Ariel Damian Fuxman
A thesis submitted in conformity with the requirementsfor the degree of Ph.D. in Computer ScienceGraduate Department of Computer Science
University of Toronto
Copyright c© 2007 by Ariel Damian Fuxman
Abstract
Efficient Query Processing Over Inconsistent Databases
Ariel Damian Fuxman
Ph.D. in Computer Science
Graduate Department of Computer Science
University of Toronto
2007
Although integrity constraints have long been used to maintain data consistency, there
are situations in which they may not be enforced or satisfied. In this thesis, we present
ConQuer, a system for efficient and scalable answering of SQL queries on databases
that may violate a set of constraints. ConQuer permits users to postulate a set of key
constraints together with their queries. The system rewrites the queries to retrieve all
(and only) data that is consistent with respect to the constraints. The rewriting is into
SQL, so the rewritten queries can be efficiently optimized and executed by commercial
database systems.
The problem of obtaining consistent answers for primary key constraints and Select-
Project-Join (SPJ) queries is known to be intractable in general. However, we identify
a large and practical class of SPJ queries for which the problem is tractable. For this
class of queries, we provide a query rewriting algorithm that can be executed in linear
time in the size of the query. We consider SPJ queries that may have either set or bag
semantics. For the latter case, the queries may also have grouping and aggregation. We
show the maximality of the class of queries, in the sense that minimal relaxations of its
conditions may lead to intractability. Finally, we study the efficiency and scalability of the
query rewritings on a commercial database system. The study shows that the overhead
of the rewritings is reasonable, when we consider the original (non-rewritten) queries
as a baseline. The experiments use representative queries from TPC-H (the standard
benchmark for decision support systems) and databases of up to 20 GB.
ii
Acknowledgements
First and foremost, I would like to thank my supervisor, Renee J. Miller, for her constant
encouragement and support. During these years, I have benefited tremendously from her
remarkable vision and experience. She has been the greatest mentor, always available for
discussion and guidance. I will always be grateful for the endless hours she devoted to
reading and correcting my drafts, and for the numerous times she stayed at the university
until very late to help me out before conference deadlines.
I am grateful to the members of my committee (John Mylopoulos, Mariano Consens,
and Thodoros Topaloglou) for thoroughly reading my thesis and for their valuable feed-
back. I also thank Leopoldo Bertossi for serving as the external reviewer of the thesis, and
for coming to Canada during his sabbatical in Chile with the sole purpose of attending
my thesis defense.
I am indebted to Alberto Mendelzon, who sadly passed away the year before I com-
pleted my thesis. Alberto was not only an outstanding researcher, but also the warmest
and most generous person. At the beginning of my stay in Canada, I was needing a job
offer in order to obtain permanent resident status. Alberto hardly knew me at that time
(I was then not even a member of the Database Group), but as soon as he heard about
my situation, he offered me a position as Research Associate in his group.
In 2004, I had the opportunity of visiting Phokion Kolaitis and Wang-Chiew Tan at
University of California at Santa Cruz. It was a joy to work with both of them. They
were also wonderful hosts, and I thank them for their hospitality. During the summer of
2005, I did an internship with the Clio group at IBM Almaden, working with Mauricio
Hernandez, Lucian Popa, and Howard Ho. I very much enjoyed my time at Almaden,
where I had an opportunity to learn how research is done at an industrial lab. Special
thanks go to Mauricio for his unwavering support during the internship.
For the implementation of the ConQuer system, I received invaluable help from my
brother Diego. I “convinced” him to do his final undergraduate project on the topic of
consistent query answering, and his contribution was fundamental for the demo that we
gave at VLDB in Trondheim. Diego, I am proud of your work! I also thank Jiang Du for
his help in building up the experimental framework used in Chapter 7.
Many people helped to make these years in Toronto a very enjoyable experience. I
especially thank the “Latin American gang” (Sebastian Sardina, Andres Lagar-Cavilla,
Carlos Hurtado, Blas Melissari, Flavio Rizzolo, Pablo Sala, and many others) for their
iv
friendship. I will always remember our long, heated debates at the Graduate Lounge,
which gained us the reputation of being the loudest group of people in the Department.
I am also grateful to Patricia Rodriguez Gianolli for her support during the last year of
my Ph.D.
And last, but definitely not least, I would like to thank my parents, Silvia and Miguel,
and my brothers, Adrian and Diego, for always being there, despite the distance: without
their love and support none of this would have ever been possible.
v
Contents
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Consistent Query Answering . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Organization of the Document . . . . . . . . . . . . . . . . . . . . . . . . 8
2 Formal Framework 10
2.1 Repairs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Query Answering Semantics . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Query Rewritings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4 Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3 Rewritings for Conjunctive Queries 22
3.1 A Broad Class of First-Order Rewritable Queries . . . . . . . . . . . . . 22
3.1.1 Notation for Conjunctive Queries . . . . . . . . . . . . . . . . . . 22
3.1.2 Join Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.1.3 The Class Cforest of First-Order Rewritable Queries . . . . . . . . 25
3.2 Query Rewriting Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3 Correctness of the Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.3.1 Properties of Repairs . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3.2 A Structural Property of Cforest . . . . . . . . . . . . . . . . . . . 35
3.3.3 A “Pessimistic” Repair . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3.4 Correctness of RewriteLocal . . . . . . . . . . . . . . . . . . . . 39
3.3.5 Correctness of RewriteTree . . . . . . . . . . . . . . . . . . . . . 42
3.3.6 Correctness of RewriteForest . . . . . . . . . . . . . . . . . . . . 44
vi
3.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4 Rewritings for Queries with Grouping and Aggregation 48
4.1 Formal Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.2 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.2.1 Queries with Bag Semantics . . . . . . . . . . . . . . . . . . . . . 51
4.2.2 Queries with the sum, min, and max Functions . . . . . . . . . . . 56
4.3 Correctness of the Algorithms . . . . . . . . . . . . . . . . . . . . . . . . 60
4.3.1 Building Upon First-Order Rewritings . . . . . . . . . . . . . . . 61
4.3.2 An “Optimistic” Repair . . . . . . . . . . . . . . . . . . . . . . . 63
4.3.3 Sound Ranges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.3.4 Tight Ranges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.3.5 Putting It All Together . . . . . . . . . . . . . . . . . . . . . . . . 79
4.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5 Complexity-Theoretic Analysis 83
5.1 Minimal Relaxations of Cforest . . . . . . . . . . . . . . . . . . . . . . . . 83
5.2 A Dichotomy Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.2.1 The Class C∗ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.2.2 Basic Intractable Cases . . . . . . . . . . . . . . . . . . . . . . . . 91
5.2.3 Generalizing the Basic Cases . . . . . . . . . . . . . . . . . . . . . 95
5.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6 ConQuer: System Implementation and SQL Rewritings 101
6.1 System Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.2 ConQuer Rewritings for Queries without Aggregation . . . . . . . . . . . 103
6.2.1 Rewriting Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 103
6.2.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
6.3 ConQuer Rewritings for SPJ Queries with Aggregation . . . . . . . . . . 121
6.3.1 Rewriting algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 121
6.3.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
6.4 Exploiting Precomputed Annotations . . . . . . . . . . . . . . . . . . . . 134
6.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
vii
7 Experimental Analysis 139
7.1 Experimental Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
7.1.1 System and Database Manager Configuration . . . . . . . . . . . 139
7.1.2 Inconsistent Database Instances . . . . . . . . . . . . . . . . . . . 140
7.1.3 Workload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
7.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
7.2.1 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
7.2.2 Effect of Degree of Inconsistency . . . . . . . . . . . . . . . . . . 153
8 Conclusions and Future Work 157
Bibliography 159
A TPC-H Queries and their Rewritings 168
B Design Advisor Indices 208
viii
Chapter 1
Introduction
1.1 Motivation
The presence of inconsistent data is known to be a major problem in enterprises. How-
ever, data analysts often make business decisions based on inconsistent data; and their
database systems rarely give any warning or indication about this situation. In fact,
current database management systems are largely unable to give such a warning because
they rely upon the fundamental assumption that the underlying data is consistent. In
this thesis, we tackle this problem by providing a set of tools that enable users to obtain
meaningful answers from databases even if they are partially inconsistent.
Integrity constraints have long been used by database management systems in order
to maintain data consistency. The typical data design process focuses on developing a set
of constraints that ensure that every possible database reflects a valid, consistent state
of the world. However, integrity constraints may not always be enforced or satisfied for
a number of reasons. For example, when data is integrated from multiple sources, each
source may satisfy a constraint (for example, a key constraint), but the merged data may
not (for example, if the same key value exists in multiple sources). More generally, when
data is exchanged between independently designed sources with different constraints, the
exchanged data may not satisfy the constraints of the destination schema. As another
example, in some environments, checking the consistency of constraints may be too ex-
pensive, particularly for workloads with high update rates. Hence, the database may
become inconsistent with respect to the (unenforced) integrity constraints. In addition
to these long-standing problems, the trend toward autonomous computing is making the
need to manage inconsistent data more acute. In autonomous environments, we can no
1
Chapter 1. Introduction 2
longer assume that data are married with a single set of constraints that define their
semantics. As constraints are used in an increasing number of roles (from modelling
the query capabilities of a system, to defining mappings between independent sources),
there is an increasing number of applications in which data must be used with a set
of independently designed constraints. In such applications, a static approach where
consistency (with respect to a fixed set of constraints) is enforced on the database may
not be appropriate. Rather, a dynamic approach in which inconsistent data is tolerated,
but consistency is taken into account at query time, permits the constraints to evolve
independently from the data.
One strategy for managing inconsistent databases is data cleaning [DJ03]. Data
cleaning techniques seek to identify and correct errors in the data, and can be used to
restore an inconsistent database to a consistent state. Data cleaning, when applicable,
can be very successful. However, it is necessarily a semiautomatic process, which makes
it infeasible or unaffordable for some applications. Furthermore, committing to a single
cleaning strategy may not always be appropriate. A user may wish to experiment with
different cleaning strategies, or may desire to retain all data, even inconsistent data,
for tasks such as lineage tracing. Finally, data cleaning is only applicable to data that
contains errors. However, the violation of a constraint may also indicate that the data
contains exceptions, that is, clean data which simply does not satisfy a constraint.
In this thesis, we consider inconsistent databases that may violate a set of primary
key constraints. This type of constraint (together with foreign key constraints) are the
most commonly used in commercial databases systems. Furthermore, databases that
violate primary key constraints are ubiquitous in enterprises. For example, in the domain
of Customer Relationship Management (CRM), data sources often contain conflicting
information about the same customer. Notably, commercial CRM tools provide limited
support for merging tuples corresponding to the same customer into one tuple in the
integrated database. Although they typically support some form of conflict resolution
rules (e.g., rules that take the average between two conflicting incomes of the same
customer), these rules may be difficult to design. In the absence of conflict resolution
rules, some CRM tools transfer all conflicting tuples to the integrated database. Thus,
even if the sources satisfy the key constraints, the integrated database may not.
Chapter 1. Introduction 3
1.2 Consistent Query Answering
While it is well known how to answer queries over consistent databases, we must give
a clear and precise semantics to the notion of a “meaningful” answer obtained from an
inconsistent database. In this thesis, we make use of a semantics based upon the notions
of possible worlds and certain answers, concepts that are widely used not only in the
context of database theory and data integration [Lip79, Lip81, AKG87, AD98], but also
in the field of knowledge representation [Lev81, Moo85]. These notions were first adapted
to the context of inconsistent databases by Arenas, Bertossi and Chomicki [ABC99], who
defined the semantics of consistent query answers.
The semantics of consistent query answers relies on the intuition that an inconsistent
database can be cleaned (or “repaired”) by adding or deleting tuples in such a way that
the resulting database satisfies some given integrity constraints. The semantics is agnostic
about which tuples should be added or removed. Therefore, each inconsistent database
may be associated to more than one clean, consistent database. A consistent answer is
then an answer that is obtained from every possible consistent database. Intuitively, this
means that the consistent answers are obtained no matter how the database is cleaned.
The semantics of consistent query answers provides a sound and elegant basis for the
study of the problem of query answering over inconsistent databases. However, despite
considerable work on its theoretical underpinnings [ABC99, CB00, ABC+03b, CLR03a,
CLR03b, BB03a, BB03b, CM05], to the best of our knowledge, little work has been
done on its practical applications. A key contribution of this thesis is to bridge the
gap between theory and practice by providing an efficient and scalable system to obtain
consistent query answers from inconsistent databases. In particular, we report the design
and evaluation of ConQuer, a system for managing inconsistent data.1 In ConQuer, a
user may postulate a set of integrity constraints, possibly at query time, and the system
automatically retrieves all (and only) the query answers that are consistent with respect
to the constraints. ConQuer also helps users take advantage of the query results in order
to interactively clean the inconsistent database.
The major challenge in consistent query answering is the potentially huge number
of consistent databases that can be associated with a given inconsistent database. In
the case of primary key constraints, that is the focus of this thesis, the number of con-
1ConQuer stands for Consistent Querying. ConQuer’s web page can be found atwww.cs.toronto.edu/db/conquer.
Chapter 1. Introduction 4
emplKey salary
t1 John 1000
t2 John 2000
t3 Mary 1000
Figure 1.1: An inconsistent database
sistent databases is exponential in the size of the inconsistent database. This problem
is tackled in ConQuer by implementing a query rewriting approach. Given a query q,
ConQuer rewrites q into another query Q that has the following property: for every incon-
sistent database, the rewritten query Q retrieves the consistent answers for the original
query q. The rewriting is done independently of the data, and works on every inconsistent
database. This approach has two fundamental advantages. First, it avoids constructing
the (potentially huge number of) consistent databases associated with the inconsistent
database. Second, the rewritten query is a SQL query that can be executed using any
commercial relational database management system (in ConQuer, we use IBM’s DB2).
In an extensive set of experiments, reported in Chapter 7, we show that the overhead
in the execution of the rewritten queries is reasonable, when compared to the original
(non-rewritten) ones.
In the next example, we illustrate the semantics of consistent answers and the query
rewriting approach.
Example 1.1. Consider the database of Figure 1.1, which contains information about
employees and their salaries. In particular, the schema of the database has one relation
called employee, with two attributes: emplKey (the name of the employee) and salary.
Assume that a user specifies that the key of the relation should be the attribute
emplKey. Note that the database violates this key constraint, perhaps because its data
has been integrated from many operational sources. In particular, there are two tuples
for employee John, one stating that he makes a salary of 1000, and the other stating that
he makes a salary of 2000. Suppose that we do not know which one of this alternatives is
correct, but we still want to be able to draw meaningful answers from the database. Let
us consider the consistent databases (i.e., databases that satisfy the key constraint) that
can be built from the inconsistent database. We would like these databases to be not
only consistent, but also “as close as possible” to the inconsistent database. This leaves
Chapter 1. Introduction 5
emplKey salary emplKey salary
t1 John 1000 t2 John 2000
t3 Mary 1000 t3 Mary 1000
Consistent database 1 Consistent database 2
Figure 1.2: Consistent databases for the inconsistent database of Figure 1.1
us with two possible consistent databases (shown in Figure 1.2), obtained by deleting
exactly one tuple for John in each of them.
Consider a query q1 that retrieves information about customers whose salary is less
or equal than 1000.
q1: select distinct emplKey
from employee
where salary <= 1000
If we execute this query directly over the inconsistent database, we obtain {John, Mary}.Intuitively, this is not a “consistent” answer because it may be the case that John has a
salary over 1000. In fact, if the consistent database turns out to be the database on the
right hand side of Figure 1.2, then John would not appear in the answer.
One strategy to obtain the “consistent answer” would be to apply query q1 to each
of the consistent databases of Figure 1.2. While this may be feasible in this simple
example, it is clearly impractical when the number of tuples violating the constraint
grows. In particular, even for the schema and single constraint of this example, the
number of consistent databases is exponential in the size of the inconsistent database.
For this reason, in ConQuer, we never build the consistent databases explicitly. Instead,
we follow a query rewriting approach, where we rewrite the original query (q1 in this
case) into another query that can be executed directly on the inconsistent database and
is guaranteed to always return the consistent answers for the original query.
In this case, it is quite simple to obtain a rewriting of q1. Notice that John appears
associated with two different salaries in the inconsistent database: one satisfying the
query, the other not. This suggests that in the rewriting we should return the employees
that satisfy q1 (i.e., have a salary of less or equal than 1000) in every tuple of the
inconsistent database where they appear. This can be obtained using the following
query:
Chapter 1. Introduction 6
Q1: select distinct emplKey
from employee e
where salary <= 1000
and not exists (select *
from employee e’
where e’.emplKey=e.emplKey
and c’.salary > 1000)
Notice the use of a nested subquery related by not exists. The purpose of this
subquery is to filter out those key values that satisfy q1 in some tuples, but violate it in
others. In our example, this subquery filters John out of the answer because he appears
in tuple t2 with an account balance above 1000.
Despite the simplicity of the previous example, it has been shown in the literature
[CLR03a, CM05] that there are Select-Project-Join queries for which there is no rewriting
into SQL (under a very likely complexity-theoretic assumption). However, we observe
that the presence of these negative results does not necessarily preclude the existence of
classes of queries for which there is a SQL rewriting. In fact, in Chapter 3, we show a
large and practical class of Select-Project-Join queries for which there is a SQL rewriting.
In Chapter 5, we show that this is a maximal class of queries, in the sense that minimal
relaxations of its conditions lead to queries for which there is no SQL rewriting.
Most of the previous work on consistent query answering (except [ABC+03b]) focuses
on queries with set semantics and no aggregation. However, practical query languages
like SQL have bag semantics (duplicates are not eliminated unless explicitly requested),
and support aggregation functions and grouping of results. In Chapter 2, we present
a generalization of the semantics of consistent answers for queries with bag semantics,
grouping and aggregation. In Chapter 4, we provide query rewritings that work under
this semantics.
In the thesis, we are concerned not only with the correctness of the rewritings (i.e.,
ensuring that they retrieve all and only the consistent answers), but also with their
efficiency when executed using existing database technology. We address efficiency issues
and their empirical validation in Chapters 6 and 7.
Chapter 1. Introduction 7
1.3 Contributions
The main contributions of this thesis are the following:
• We identify a large and practical class of Select-Project-Join queries for which the
problem of computing consistent answers is tractable. The class consists of queries
that can have two kinds of joins. First, they can have joins between key attributes.
Second, they can have joins from non-key attributes of a relation (possibly a foreign
key) to the primary key of another relation. Arguably, these two types of joins are
the most commonly used in practice (and certainly the most common in industry
standard benchmarks like TPC-H). (Chapter 3)
• For the class of tractable queries that we identify, we provide a query rewriting algo-
rithm that produces a query in first-order logic that returns the consistent answers.
The algorithm runs in polynomial time in the size of the query. The rewritings
are sound and complete, in the sense that they return all (and only) the consistent
answers. Since first-order queries can be written in SQL, the rewritings in first-
order logic are a first step towards reusing existing commercial database technology.
This work was first published at the International Conference on Database Theory
(ICDT) [FM05], and an extended journal version has been invited to the Journal
of Computer and Systems Sciences (JCSS) [FM06]. (Chapter 3)
• We consider not only Select-Project-Join queries with set semantics, but also queries
with bag semantics, grouping and aggregation. These extensions are needed to en-
able practical use in decision support applications. For this purpose, we extend
the semantics of consistent answers originally proposed by Arenas, Bertossi and
Chomicki [ABC99, ABC+03b] . We provide sound and complete algorithms un-
der this semantics for the most common SQL aggregation functions (count, min,
max, sum). This work has been published at the ACM International Conference
on the Management of Data (SIGMOD) [FFM05a]. (Chapters 2 and 4)
• We show a large class of Select-Project-Join queries for which the conditions of
applicability of our rewriting algorithm are not only sufficient but also necessary.
In particular, we show a class in which the problem of computing the consistent
answers is coNP-complete (and, assuming P 6= NP, inexpressible in first-order logic)
for every query of the class that violates the conditions of the class of queries for
Chapter 1. Introduction 8
which we give a rewriting algorithm. This type of result is stronger than the com-
plexity results given in the consistent query answering literature [CLR03a, CM05],
which consist of showing intractability of a class by exhibiting at least one query for
which the problem is intractable. As a corollary of our result, we get a dichotomy
for this class of queries: given a query q in our class, either the problem of comput-
ing the consistent answers for q is first-order rewritable (and thus it is in PTIME),
or it is a coNP-complete problem. (Chapter 5)
• We present the implementation of ConQuer, a system for querying inconsistent
databases. We also explain in detail the SQL rewritings produced by the system.
ConQuer has been demonstrated at the International Conference on Very Large
Databases (VLDB) [FFM05b]. (Chapter 6)
• We study the running time of ConQuer’s SQL rewritings on a commercial database
system, in particular IBM DB2. To this end, we present a detailed performance
study using the data and queries of the TPC-H decision support benchmark. The
study focuses on the overhead of the rewritings, using the original (non-rewritten
queries) as a baseline. We study the scalability of the approach (with databases of
up to 172 million tuples), and the effect of the degree of inconsistency (in terms
of the percentage of tuples that are inconsistent and the number of conflicting
tuples per key value). The experiments show that our approach can be applied to
large databases, several orders of magnitude larger than those considered in other
approaches for querying inconsistent databases. (Chapter 7)
1.4 Organization of the Document
The rest of this document is organized as follows. In Chapter 2, we present the formal
framework for querying inconsistent databases that will be used throughout the thesis.
In Chapters 3 and 4, we present query rewritings and focus on proving their correctness.
In Chapter 3, we consider a large and practical class of conjunctive queries (that is,
Select-Project-Join queries) and present rewritings in first-order logic. In Chapter 4, we
consider queries with bag semantics, grouping and aggregation, and present rewritings
in an extension of first-order logic with grouping and aggregation functions. In Chapter
5, we show the maximality of the class of queries that is the input to the rewriting
algorithms.
Chapter 1. Introduction 9
In Chapter 6, we present ConQuer, a system for efficiently querying inconsistent
databases. We present in detail the SQL query rewritings produced by ConQuer for
queries with and without aggregation. The efficiency of these rewritings is empirically
validated in Chapter 7 with an extensive set of experiments. We present related work in
separate sections at the end of each of the chapters. In Chapter 8, we finish the document
with conclusions and directions for future work.
Chapter 2
Formal Framework
In this chapter, we present the formal framework that will be used throughout the thesis.
In this framework, an inconsistent database is associated with a space of consistent
databases called repairs. In Section 2.1, we formally define the notion of repair. Then, in
Section 2.2, we introduce the semantics for query answering over inconsistent databases.
This semantics involves the exploration of all repairs of an inconsistent database. Since
the number of repairs can be very large, in this thesis we advocate a query rewriting
approach, where queries are rewritten in such a way that their consistent answer can be
obtained by posing another query directly on the inconsistent database, without explicitly
building any repair. In Section 2.3, we formally define the notion of a query rewriting.
Finally, in Section 2.4, we introduce the integrity constraints that are the focus of this
thesis.
2.1 Repairs
A schema R is a finite collection of relation symbols, each of which has an associated
arity. A database instance (or database) I over R is a function that associates each
relation symbol r of R to a relation I(r). A relation I(r) of arity k is a set of k-tuples
whose elements belong to some underlying fixed domain.1 Whenever it is clear from
context, we will abuse notation and use the same symbol r to denote both a relation
symbol and a relation. Given a tuple ~t occurring in relation I(r), we denote by r(~t) the
association between ~t and r.
1Although we will consider both set and bag semantics for queries, we always assume the relations ofa database instance (including inconsistent databases) to be sets.
10
Chapter 2. Formal Framework 11
A database instance I is consistent with respect to a set of integrity constraints Σ if
I satisfies Σ in the standard model-theoretic sense, that is I |= Σ. (As customary, an
integrity constraint may be any first-order formula [AHV95]). Throughout this thesis,
we will consider databases that may violate a given set of integrity constraints. That is,
given R and set of integrity constraints Σ over R, a database I may be inconsistent with
respect to Σ, that is I 6|= Σ.
Intuitively, we will assume that an inconsistent database can be cleaned (or “re-
paired”) by adding or deleting tuples in such a way that the resulting database satisfies
the given integrity constraints. We will be agnostic about which tuples should be added
or removed. Therefore, each inconsistent database may be associated to more than one
possible clean, consistent database. Furthermore, no matter how the clean databases are
obtained, we would like them to be “as close as possible” to the original, inconsistent
database (that is, to minimize the number of tuples that are added or removed). We will
call each consistent database a repair.
The notion of repair was originally introduced by Arenas, Bertossi and Chomicki
[ABC99]. A repair is a database instance that satisfies the given integrity constraints,
and which has a minimal distance to the inconsistent database. The distance between
two database instances I and I ′ is defined as their symmetric difference, i.e., ∆(I, I ′) =
(I − I ′) ∪ (I ′ − I). The formal definition of repair is the following.
Definition 2.1 (Repair [ABC99]). Let I be a database instance, and Σ be a set of
integrity constraints. We say that an instance I is a repair of I with respect to Σ if:2
• I |= Σ, and
• there is no instance I ′ such that I ′ |= Σ and ∆(I, I ′) ⊂ ∆(I, I) (i.e., ∆(I, I) is
minimal under set inclusion in the class of instances that satisfy Σ).
Example 2.1. Let R be a schema with one relation symbol employee. Assume that
employee has two attributes: emplKey (the name of the employee) and salary, and
that the only constraint in Σ is that attribute emplKey is the key of relation employee.
Let I = {employee(John, 1000), employee(John, 2000), employee(Mary, 1000)}. The
database I is inconsistent with respect to Σ because it violates the key constraint stating
that every employee has exactly one salary.
2Whenever Σ is clear from the context, we will just say that I is a repair of I.
Chapter 2. Formal Framework 12
There are two repairs of I wrt Σ: I1 = {employee(John, 1000), employee(Mary, 1000)}and I2 = {employee(John, 2000), employee(Mary, 1000)}. Notice that, according to
Definition 2.1, the databases {employee(John, 2000)} and {employee(Mary, 1000)} are
not repairs because their distance with respect to I is not minimal under set inclusion.
The minimality condition for repairs is crucial in the definition. Otherwise, the empty
set would trivially be a repair of every database that violates a set of key constraints.
Notice that repairs do not need to be unique. For example, if the given set of con-
straints consists of key dependencies, the number of repairs can be exponential in the
size of the inconsistent database.
2.2 Query Answering Semantics
The notion of repair can be used to give a precise meaning to query answering over
inconsistent databases. Intuitively, each repair corresponds to one particular way of
cleaning the database. Since we are agnostic about how the database should be cleaned,
it makes sense to consider the answers that would be obtained from every repair. This
notion is formalized with the concept of consistent answers, which we define next.
Definition 2.2 (Consistent Answer [ABC99]). Let R be a schema. Let Σ be a set
of integrity constraints. Let I be an instance over R (possibly inconsistent with respect
to Σ). Let q be a query over R. We say that a tuple ~t is a consistent answer for q with
respect to Σ if ~t ∈ q(I), for every repair I of I with respect to Σ. We denote this as
~t ∈ consistentΣ(q, I).
This definition was originally given by Arenas, Bertossi and Chomicki [ABC99]. It is
based on the semantics of certain answers [Lip79, Lip81, AKG87] that has been used in
database theory, and possible worlds, which is well-known in knowledge representation
[Lev81]. In the case of consistent answers, the space of possible worlds corresponds to
the repairs of the inconsistent database.
Example 2.1. (continued) Consider a query that retrieves all the employees from
the database, expressed as q1(e) = ∃s.employee(e, s). Recall that there are two re-
pairs of I wrt Σ: I1 = {employee(John, 1000), employee(Mary, 1000)} and I2 =
{employee(John, 2000), employee(Mary, 1000)}. The result of applying q1 on both I1
Chapter 2. Formal Framework 13
and I2 is {(John), (Mary)}. Thus, the consistent answers for q1 on I are the tuples
(John) and (Mary).
Now, consider a query that retrieves employees together with their salaries, expressed
as q2(e, s) = employee(e, s). Notice that q2 is the identity on the repairs. Thus, the con-
sistent answer can be obtained as the intersection of I1 and I2. In consequence, the only
consistent answer for q2 on I is (Mary, 1000). Notice that the tuples (John, 1000) and
(John, 2000) are not consistent answers. The reason is that neither of them are present
in both repairs. Intuitively, this reflects the fact that John’s salaries are inconsistent data,
and we do not want to retrieve possibly erroneous results.
For convenience, we will use the following notation for the consistent answers of
Boolean queries.
Definition 2.3. Let R be a schema. Let Σ be a set of integrity constraints. Let
I be a database instance over R. Let q be a Boolean query over R. We say that
consistentΣ(q, I) = true if for every repair I of I with respect to Σ, I |= q. We
say that consistentΣ(q, I) = false if there exists at least one repair I of I with respect
to Σ such that I 6|= q.
Notice the asymmetry between the case for consistentΣ(q, I) = true and
consistentΣ(q, I) = false. While for the former, every repair must satisfy the query,
for the latter it suffices to have just one non-satisfying repair. This is not intrinsic to
Boolean queries: by Definition 2.2, it is also the case that ~t 6∈ consistentΣ(q, I) if there
exists at least one repair I such that ~t 6∈ q(I).
The definition of consistent answers is independent of the language used to express
the input query q, and it makes perfect sense for queries that, for example, return tuples
from the active domain of the database. However, for queries that compute aggregates
over groups of tuples, it may be useful to relax this definition, as we motivate next.
Example 2.1. (continued) Let q3(s, v) be a SQL query that counts the number of
occurrences of each salary in the database:
select salary as s, count(*) as v
from employee
group by salary
Chapter 2. Formal Framework 14
Recall that there are two repairs of I with respect to Σ: I1 = {employee(John, 1000),
employee(Mary, 1000)} and I2 = {employee(John, 2000), employee(Mary, 1000)}. The
result of applying query q3 to the repairs is the following: q3(I1) = {(1000, 2)}, and
q3(I2) = {(1000, 1), (2000, 1)}. Since the intersection of these results is empty, according
to Definition 2.2, the set of consistent answers for q3 is empty. However, notice that the
salary 1000 appears in every query result (but together with a different number for the
count of occurrences). Intuitively, it would be desirable to report this salary in the result.
In the previous example, the value 1000 appears in every query result. However, it
appears a different number of times on each of them. How do we report the number of
times that it appears? In the semantics that we define next, we employ tight bounds
for this purpose. In this particular example, we will say that the minimum (greatest
lower bound) is one, since the salary 1000 appears exactly once in q3(I1); and that the
maximum (lowest upper bound) is two, since salary 1000 appears exactly twice in q3(I2).
In the following definition, we formalize this notion. The definition applies to any query
that computes an aggregate over a group (in our example, the aggregate is the count
of occurrences of each salary). We will denote with aggconsistentΣ(q, I) the modified
semantics for consistent answers for a query q on an instance I with respect to a set of
constraints Σ.
Definition 2.4 (Consistent Answer for Queries with Aggregation). Let R be
a schema. Let Σ be a set of integrity constraints. Let I be a database instance over
R. Let q be a query over R with free variables ~z and v, where v is a variable over a
numeric domain (possibly computed by an aggregate function). We say that (~t, glb, lub) ∈aggconsistentΣ(q, I) if all the following conditions hold:
• for every repair I of I wrt Σ, there is some d such that (~t, d) ∈ q(I) and glb ≤ d ≤lub; and
• there is some repair I of I wrt Σ such that (~t, glb) ∈ q(I); and
• there is some repair I of I wrt Σ such that (~t, lub) ∈ q(I).
We also say that glb is the greatest lower bound of ~t in q, and that lub is the lowest
upper bound of ~t in q.
This definition is particularly well suited to the case of queries with bag semantics,
grouping and aggregation, which are prevalent in practice. For instance, consider the
query q3(s, v) of Example 2.1:
Chapter 2. Formal Framework 15
select salary as s, count(*) as v
from employee
group by salary
In this case, q3 has free variables s and v. The variable s corresponds to the attribute
salary, on which there is a grouping condition; the numerical argument v, for which we
give tight ranges, corresponds to the result of count(*). Essentially, for a query q(~z, v),
aggconsistentΣ(q, I) gives the consistent answers on I with respect to Σ for each value
of ~z (the salary in our example), together with a tight range for the possible associated
numerical values.
Example 2.1. (continued) Let us obtain the aggconsistentΣ answers for q3 on I. Re-
call that the result of applying q3 to the repairs of the inconsistent database is: q3(I1) =
{(1000, 2)}, and q3(I2) = {(1000, 1), (2000, 1)}. Then, we have that aggconsistentΣ(q3, I) =
{(1000, 1, 2)}. This means that the salary 1000 appears in every query result, and the
value of count(*) for 1000 has a greatest lower bound of one and a lowest upper bound
of two. Notice that the salary 2000 does not appear in aggconsistentΣ(q3, I). The intu-
itive reason is that 2000 is not a consistent answer, since it does not occur in repair I1.
According to the definition of aggconsistentΣ above, 2000 is not in the answer because
it fails to satisfy the first condition of Definition 2.4. This condition is violated because
I1 is a repair such that (2000, d) 6∈ q(I1), for every d.
To the best of our knowledge, the problem of computing consistent answers for queries
with aggregation has only been studied before by Arenas et al. [ABC+03b]. In particular,
they were the first to propose a generalization of the semantics of consistent answers,
where ranges rather than exact values are returned. In their work, they consider a class
of SQL queries with no grouping, no selection conditions (i.e., no conditions in the where
clause) and on exactly one relation. In Chapter 4, we will present results for a much
larger class of queries. For the class of queries considered by Arenas et al., our and their
semantics coincide. However, we need to extend their semantics in order to be able to
deal with grouping.
2.3 Query Rewritings
The definition of consistent answers introduced in the previous section involves the explo-
ration of a potentially huge number of repairs (in the case of keys, it can be exponential in
Chapter 2. Formal Framework 16
the size of the inconsistent database). In this thesis, we approach this problem by design-
ing algorithms that compute consistent answers directly from the inconsistent database,
without explicitly building the repairs. Given a query q, our algorithms will return an-
other query Q such that, for every instance I, the consistent answers for the original
query q can be obtained by just evaluating Q on I. We call Q a query rewriting for the
problem of computing the consistent answers of q.
In order to give a formal definition of query rewriting, we first define the computa-
tional problems associated to computing consistent answers using the consistentΣ and
aggconsistentΣ operators (the latter for the case in which the query computes numerical
values over a group of tuples).
Definition 2.5. Let R be a schema. Let q be a query over R. Let Σ be a set of integrity
constraints.
The problem CONSISTENT(q, Σ) is the following: given an instance I over R, and
tuple ~t, is it the case that ~t ∈ consistentΣ(q, I)?
The problem AGGCONSISTENT(q, Σ) is the following: given an instance I over R, tuple
~t and real numbers glb and lub, is it the case that (~t, glb, lub) ∈ aggconsistentΣ(q, I)?
We can now define the notion of query rewriting for the problems CONSISTENT(q, Σ)
and AGGCONSISTENT(q, Σ). The definition is given for a fixed (but undefined) query
language.
Definition 2.6 (L-query rewriting). Let R be a schema. Let Σ be a set of integrity
constraints. Let q be a query over R. Let Q be a query expressed in a query language L(possibly different from the language used to express q).
We say that Q is an L-rewriting of CONSISTENT(q, Σ) if for every instance I over R
and tuple ~t, ~t ∈ Q(I) iff ~t ∈ consistentΣ(q, I).
We say that Q is an L-rewriting of AGGCONSISTENT(q, Σ) if for every instance I
over R, tuple ~t and real numbers glb and lub, (~t, glb, lub) ∈ Q(I) iff (~t, glb, lub) ∈aggconsistentΣ(q, I).
We also define the rewritability of a problem in a language L as follows. We say that
CONSISTENT(q, Σ) is L-rewritable if there exists a query Q expressed in language L such
that Q is a query rewriting for CONSISTENT(q, Σ). A similar definition can be given for
AGGCONSISTENT(q, Σ).
In Chapter 3, we will consider classes of conjunctive queries, and present query rewrit-
ings in first-order logic. Notice that if CONSISTENT(q, Σ) is first-order rewritable, then
Chapter 2. Formal Framework 17
it is tractable. This is because the data complexity of first-order logic is in PTIME (in
fact, in AC0, which is a subset of PTIME). Thus, the query rewriting Q can be executed
on the inconsistent database in polyomial time. Besides this, an approach based on first-
order query rewriting is attractive because first-order queries can be written in SQL. In
Chapter 4, we will focus on classes of conjunctive queries with bag semantics, grouping,
and aggregation. We will give query rewritings for the problem AGGCONSISTENT(q, Σ) in
a language that extends first-order logic with operators for grouping and aggregation. In
Chapter 5, we will study the computational complexity of the problem CONSISTENT(q, Σ).
Finally, in Chapters 6 and 7, we will present SQL query rewritings and show experimen-
tally that they can be run efficiently and scalably on a commercial relational database
system.
2.4 Constraints
The most commonly used type of constraints in database systems are keys and foreign
keys. Of these, keys pose a particular challenge since databases that are inconsistent
with respect to a set of key dependencies admit an exponential number of repairs in the
worst case. This potentially large number of repairs leads to the question of whether it is
possible to compute consistent answers efficiently. The answer to this question is known
to be negative in general [CLR03a, CM05]. However, this does not necessarily preclude
the existence of classes of queries for which the problem is easier to compute. Hence, we
consider the following question: for what queries is the problem of computing consistent
answers under key constraints in polynomial time (in data complexity)? And, can these
rewritings be executed efficiently in practice? We address the first question in Chapters
3 and 4, and the second question in Chapter 6.
A key constraint is an integrity constraint of the form
∀~x, ~y, ~z.(r(~x, ~y) ∧ r(~x, ~z)) → ~y = ~z
In the above constraint, we say that ~x is a key of relation r. Notice that a key may
consist of many attributes. Throughout the thesis, we will assume that Σ is a set of key
constraints that includes one key constraint per relation of the schema. This corresponds
to the notion of primary keys in database systems.
To facilitate specifying the key constraints each time that we give a query, we will un-
derline the positions in each literal that correspond to key attributes. Furthermore,
Chapter 2. Formal Framework 18
by convention, the key attributes will be given first. For example, the query q =
∃x, y, z.r1(x, y) ∧ r2(y, z) indicates that the first and second literals correspond to bi-
nary relations whose first attribute is the key. We will use vector notation (e.g., ~x, ~y) to
denote vectors of variables or constants from a query or tuple. In addition, when we give
a tuple, we will underline the values that appear at the position of key attributes. For
instance, for a tuple r(~c, ~d), we will say that ~c is a key value, and ~d is a nonkey value.
Using this notation, the key constraints of Σ that are relevant to the query are denoted
directly in the query expression.
2.5 Related Work
In this section, we survey work on related formal frameworks for managing inconsistent
data. For two excellent surveys of the area of consistent query answering, we refer the
reader to Bertossi and Chomicki [BC03] and Bertossi [Ber06].
Intuitively, a repair is a consistent database that is “as close as possible” to the given
inconsistent database. To formalize this intuition, it is necessary to define a notion of
distance between databases. The notion of distance that we employ in this thesis (and
which was initially proposed by Arenas, Bertossi, and Chomicki [ABC99]) is defined in
terms of the symmetric difference between sets. Other notions of distance have been
explored in the literature, which we review next.
Some proposals adopt a cardinality-based notion of distance between database in-
stances, instead of set-theoretic. For example, Lin and Mendelzon [LM96] propose a
semantics where conflicts are resolved according to a majority criterion. Their frame-
work is presented in the context of belief revision for first-order theories, and is therefore
broader in scope than consistent query answering. However, the complexity of query an-
swering under this semantics has not been studied. Other approaches [FPL+01, BBFL05,
FFP05, BMFR05] consider cost-based notions of distance, where each operation that can
be used to restore consistency is given a cost. Then, repairs are defined as the consistent
databases that can be obtained from the inconsistent database with a minimum cost.
These operations include not only insertion and deletion of tuples, but also modification
of values. While a cost-based notion of distance is attractive from a semantic point of
view, it can be computationally more expensive than the set-theoretic semantics. For
example, in the case of inconsistencies with respect to primary key dependencies, the
problem of obtaining a repair of an inconsistent database is NP-complete [BMFR05],
Chapter 2. Formal Framework 19
whereas it can be obtained in linear time under the set-theoretic semantics.
In some of the cost-based approaches mentioned above [FPL+01, BBFL05, FFP05],
tuples can be modified to contain values that are not in the active domain of the incon-
sistent database. Thus, the domain of the attributes that can be modified must have
an intrinsic distance metric. In particular, these approaches consider only numerical at-
tributes (it is not clear how their techniques could be extended to categorical values).
An approach based on tuple modification which allows arbitrary attribute domains is
given by Wijsen [Wij05]. In his work, the repaired databases may contain variables, and
the semantics is given in terms of homomorphisms to the inconsistent database. Instead
of answering queries directly on the inconsistent database (as we do in ConQuer), his
approach requires the offline processing of the inconsistent databases to construct con-
densed representations. The consistent answers to certain classes of queries can then be
obtained by directly executing the original query on the condensed representation.
In contrast to consistent answers, we could also consider possible answers, where
we retrieve answers that appear in at at least one repair. This notion has received
less attention than consistent answers, perhaps because it is less challenging from a
computational point of view. In fact, for broad classes of queries and constraints for which
obtaining consistent answers is intractable, the problem of obtaining possible answers
is tractable (and it usually suffices to compute the original query on the inconsistent
database). Although they are easier to obtain, possible answers are as important as
consistent answers in the context of inconsistent databases. While consistent answers are
best suited for decision making, possible answers can be used to understand the reasons
why a database is inconsistent. For example, in ConQuer, we give the option of retrieving
not only the consistent answers but also the possible answers (see Chapter 6). If the user
decides that a possible answer should have been a consistent answer, he or she can request
an explanation from the system in terms of the underlying database. This explanation
often helps the user to detect incorrect data and to (interactively) correct it.
The notions of possible and consistent answers are two opposite ends of a spectrum:
the former being the most aggressive, and the latter the most cautious. In some sce-
narios, it is desirable to give preference (or rank) tuples in the answer according to the
number of repairs where they appear. Furthermore, some repairs may be more preferable
than others. To formalize this intuition, it is natural to appeal to a semantics based on
probabilities, where each repair is assigned a probability of being the consistent database
that the user has in mind. There has been considerable research on the topic of prob-
Chapter 2. Formal Framework 20
abilistic databases [CP87, BMP92, LLRS97, FR97, DS04]. Recently, Dalvi and Suciu
[DS04] presented a framework for query rewriting over probabilistic databases. Their
rewriting algorithms rely on the fundamental assumption that each tuple has an inde-
pendent probability of being in the (in our terms) consistent database. In the context
of databases that violate primary key constraints, which is the focus of this thesis, we
cannot assume that all tuples are independent. In fact, tuples that share the same key
value are mutually exclusive. In recent work (which is not covered in this thesis), we
and other authors [AFM06] presented query rewriting algorithms that work under the
probabilistic semantics for databases that may violate primary key constraints. In that
paper, we also considered the important problem of obtaining the probabilities. In par-
ticular, we explored the use of a clustering-based technique that works particularly well
on categorical values [ATMS04]. The non-probabilistic semantics that we consider in this
thesis is a special case of the probabilistic semantics. However, the class of rewritable
queries that we can handled under the probabilistic semantics [AFM06] is considerably
more restricted than the classes considered in Chapters 3 and 4 of this thesis for the
non-probabilistic case.
Databases that are inconsistent with respect to primary key constrains can be mod-
elled as disjunctive databases [vdM98]. In particular, if Σ is a set of key dependencies, the
set of all repairs of an inconsistent database can be represented as a disjunctive database
D in such a way that each repair corresponds to a minimal model of D. However, to
the best of our knowledge, there are no results in the literature for query rewritings over
disjunctive databases. A relevant special case of disjunctive databases are databases with
OR-objects [IvdMV95]. If an inconsistent relation has two attributes (a key and a nonkey
attribute), then it can be modelled with OR-objects. However, this is no longer the case
for relations whose arity is greater than two.
To the best of our knowledge, DeMichiel [DeM89] and Agarwal et al. [AKWS95] are
the first authors to recognize the need to manage inconsistent databases. They propose
semantics analogous to the one for OR-objects. DeMichiel proposes algorithms that are
sound but not necessarily complete with respect to the semantics. Agarwal et al. do not
discuss the implementation of the projection and join operations which, as we will see in
Chapter 3, are particularly challenging under the consistent query answering semantics,
and an important contribution of this thesis.
We conclude this section by pointing out that the problem of dealing with inconsis-
tency arises (and has been studied) in other fields of computer science. For example, our
Chapter 2. Formal Framework 21
approach to handling inconsistency is related to the approaches followed by the belief
revision community [GR95] in the field of artificial intelligence. The scenario typically
adopted in belief revision is more general in scope than ours, since (in our terms) they
allow the modification of not only the data but also the integrity constraints. As another
example, the problem of handling inconsistency has been studied in software engineer-
ing [Bal91, NER00]. The focus of this body of work is not centered on data or query
answering, but on the reconciliation of inconsistent views of software requirements and
specifications.
Chapter 3
Rewritings for Conjunctive Queries
The problem of computing consistent answers for conjunctive queries over databases that
might violate a set of key constraints is known to be coNP-complete in general [CLR03a,
CM05]. This is the case even for queries with no repeated relation symbols, which is
the focus of this chapter. However, this does not necessarily preclude the existence of
classes of queries for which the problem is easier to compute. In fact, in this section we
characterize a large and practical class of conjunctive queries for which the problem of
computing consistent answers under key constraints is indeed tractable. Even more so,
we show that all queries in this class are first-order rewritable, and we give a linear-time
algorithm that computes the first-order rewriting. We introduce the class of queries in
Section 3.1, and we present the query rewriting algorithm in Section 3.2. The proof of
correctness of the algorithm is given in Section 3.3.
3.1 A Broad Class of First-Order Rewritable Queries
3.1.1 Notation for Conjunctive Queries
The results in this chapter concern a class of conjunctive queries. Conjunctive queries
[CM77, AHV95] are first-order formulas that may only have conjunctions of positive
literals and existential quantification. That is, they are formulas of the following form:
q(~z) = ∃~w.R1(~x1, ~y1) ∧ · · · ∧Rn(~xn, ~yn)
where the variables of ~x1, ~y1, . . . , ~xn, ~yn appear in exactly one of ~z and ~w. We will
say that the variables in ~z are the free variables of q, and that the variables in ~w are the
22
Chapter 3. Rewritings for Conjunctive Queries 23
existentially-quantified variables of q. Even though there are no equality symbols in our
notation for conjunctive queries, their effect can be achieved by having variables appear
more than once in the queries.
Notice that in the formula above, we denote the literals as Ri(~xi, ~yi). Throughout
the thesis, we will use the convention of using capital letters (usually R, S and T ) to
denote literals of a query. Notice that two distinct literals Ri and Rj may be on the same
relation symbol r (although most results in this thesis are for queries without repeated
relation symbols in which each literal corresponds to a distinct relation).
We will adopt the convention of using ~x to denote variables and constants of a literal
that appear at a position corresponding to key attributes of the relation symbol of the
literal, and ~y for variables and constants that appear at the position of nonkey attributes
of the relation symbol of the literal.
We will say that there is a join on a variable w if w appears in two literals Ri(~xi, ~yi)
and Rj(~xj, ~yj) such that i 6= j. If w occurs in ~yi and ~yj, we say that there is a nonkey-
to-nonkey join on w; if w occurs in ~yi and ~xj, we say that there is a nonkey-to-key join;
and if w occurs in ~xi and ~xj, we say that there is a key-to-key join.
3.1.2 Join Graph
Before introducing the class of queries handled by our algorithm, let us get some insight
from queries that are not considered by our algorithm because (unless P=NP) there is
no first-order rewriting that computes the consistent answer (no matter what rewriting
algorithm is used). In particular, let us consider the following queries:
• q1 = ∃x, x′, y.R1(x, y) ∧R2(x′, y)
• q2 = ∃x, y.R1(x, y) ∧R2(y, x)
• q3 = ∃x, x′, w, w′, z, z′,m.R1(x,w) ∧R2(m,w, z) ∧R3(x′, w′) ∧R4(m,w′, z′)
We will show in Chapter 5 that the problem of computing consistent answers for the
above queries is intractable. The first query consists of a join between nonkey attributes;
the second one involves a cycle of nonkey-to-key joins; and in the third, there are two
joins from nonkey variables to part, but not the entire key, of the corresponding relations.
In order to be more precise in specifying such conditions, we need the notion of the join
graph of a query, which has a node for each literal of a query. Notice that the conditions
Chapter 3. Rewritings for Conjunctive Queries 24
that we just gave are concerned with joins where at least one nonkey variable is involved.
Therefore, the join graph will be a directed graph, where directionality is determined by
the nonkey variables involved in the join.
Definition 3.1 (join graph). Let q be a conjunctive query. The join graph G of q is a
directed graph such that:
• the vertices of G are the literals of q;
• there is an arc from Ri to Rj if i 6= j, and there is some variable w such that w is
existentially-quantified in q, w occurs at the position of a nonkey attribute in Ri,
and w occurs in Rj.
Notice that key-to-key joins do not introduce any arcs to the join graph. Since the
class of first-order rewritable queries that we will present shortly is defined in terms of
the join graph, its queries can have arbitrary key-to-key joins. Further, the free variables
of a query do not introduce arcs to the join graph. As a special case, if all the variables
of a query are free, then its join graph has no arcs. Such queries correspond to the
class of quantifier-free queries, and have already been shown to be first-order rewritable
[ABC99]. If we think in terms of equivalent SQL queries, the fact that all variables are
free means that every attribute of every relation in the from clause must appear in the
select clause.1 This a strong condition which restricts the practical applicability of
the class. As an empirical observation, none of the queries in the TPC-H specification
[TPC03], the industry standard for decision support systems, satisfy this restriction. For
this reason, we will focus on a class of conjunctive queries that may have existential
quantification (in relational algebra terms, arbitrary projections). Handling queries with
existentially-quantified variables is a major challenge, which we address in this chapter.
In Figure 3.1, we show the join graphs for q1 and q2 (we label the arcs with the variable
involved in the joins for illustration purposes). Observe in the figure that both join graphs
have a cycle. For our rewriting algorithm, we will focus on queries that have an acyclic
join graph. Additionally, when we consider how two literals Ri and Rj are joined, we will
require that if any of the key attributes of Ri are joined with a nonkey attribute of Rj,
then all of the key attributes of Ri join with nonkey attributes of Rj. We will then say
that the query has only full nonkey-to-key joins. For example, in the query q3 above, of
1The only exception are the attributes that are equated in the where clause. In that case, only oneof the equated attributes needs to appear in the select clause.
Chapter 3. Rewritings for Conjunctive Queries 25
the form ∃x, x′, w, w′, z, z′,m.R1(x,w)∧R2(m,w, z)∧R3(x′, w′)∧R4(m,w′, z′), the joins
between R1 and R2, and between R3 and R4, are not full since they do not involve the
entire key of R2 and R4, respectively.
Definition 3.2. Let q be a conjunctive query. Let Ri(~xi, ~yi) and Rj(~xj, ~yj) be a pair of
literals of q. We say that there is a full nonkey-to-key join from Ri to Rj if every variable
of ~xj appears in ~yi.
We observe that if G is an acyclic join graph for a query all of whose nonkey-to-key
joins are full, then G must be a forest. We show this with the following proposition.
Proposition 3.3. Let q be a query all of whose nonkey-to-key joins are full. Let G be
the join graph of q. If G is acyclic, then G is a forest.
Proof. Assume towards a contradiction that G is a directed acyclic graph that is not a
tree. Then, there is a node v in G that receives arcs from two different nodes vi and vj
of G. Let R(~x, ~y), Ri(~xi, ~yi), and Rj(~xj, ~yj) be the literals at the nodes of v, vi, and vj,
respectively. Since there are arcs from vi and vj to v, there are variables wi and wj in
~yi and ~yj, respectively, that appear in R. Since G is acyclic, wi and wj must appear in
~x. Also, wj cannot appear in a nonkey position of Ri (or, otherwise, there would be a
cycle between the nodes vi and vj). Since there is a nonkey-to-key join from Ri to R on
variable wi, and variable wj does not occur at a nonkey position of Ri, the join is not
full; contradiction.
3.1.3 The Class Cforest of First-Order Rewritable Queries
We will now characterize a broad class of conjunctive queries for which the problem of
computing consistent answers under key constraints is tractable and first-order rewritable.
The characterization is given in terms of the join graph of the queries. In particular, we
will require three conditions. First, all the nonkey-to-key joins of the query must be full.
Second, the join graph must be a forest. As we showed in Proposition 3.3, this includes
all queries with full nonkey-to-key joins with acyclic join graph. Finally, the query should
have no repeated relation symbols. We call this class Cforest since we require the join
graph of its queries to be a forest, and we give the formal definition next.
Definition 3.4. Let q be conjunctive query without repeated relation symbols and all of
whose nonkey-to-key joins are full. Let G be the join graph of q. We say that q ∈ Cforest
if G is a forest (i.e., every connected component of G is a tree).
Chapter 3. Rewritings for Conjunctive Queries 26
Figure 3.1: Cyclic join graphs of intractable queries
A fundamental observation about Cforest is that it is a very common, practical class
of queries. Arguably, the most used form of joins are from a set of nonkey attributes of
one relation (which may be a foreign key)2 to the key of another relation (which may be
a primary key). Furthermore, such joins typically involve the entire primary key of the
relation (and, hence, they are full joins in our terms). Finally, cycles are rarely present
in the queries used in practice. Admittedly, the restriction not to have repeated relation
symbols does rule out some common queries (those in which the same relation appears
twice in the from clause of an SQL query). Still, many queries used in practice do not
have repeated relation symbols.
As an empirical observation, only one out of 22 queries in the TPC-H specification
[TPC03], the industry standard for decision support queries, has a nonkey-to-nonkey
join. All the queries in the standard are acyclic, and all the nonkey-to-key joins of the
queries are full.
3.2 Query Rewriting Algorithm
In this section, we present the query rewriting algorithm RewriteForest that works for
the class of conjunctive queries Cforest introduced in the previous section. We start the
presentation with a number of examples that highlight some of the intuition underlying
the algorithm.
In the next example, we illustrate the rewriting for a query consisting of only one
2Notice that we are not dealing with the problem of inconsistency with respect to foreign keys, butonly with respect to key dependencies.
Chapter 3. Rewritings for Conjunctive Queries 27
literal. We also show that even for such a simple query, the query itself is not a rewriting
for the problem of computing its own consistent answers.
Example 3.1. As in Example 2.1, consider a schema R with one relation symbol
employee, which has two attributes: emplKey (the name of the employee) and salary.
Furthermore, consider a set Σ consisting of only one constraint stating that the attribute
emplKey is the key of relation employee.
Let q1 be a query that retrieves all the employees from the database that make
a salary of 1000, expressed as q1(e) = employee(e, 1000). First of all, notice that q1
itself is not a query rewriting of CONSISTENT(q1, Σ). Consider a database instance I1 =
{employee(John, 1000), employee(John, 2000)}. It is easy to see that (John) ∈ q1(I1).
However, (John) 6∈ consistentΣ(q1, I1) because the repair I = {employee(John, 2000)}is such that (John) 6∈ q1(I).
Now, consider a database instance I2 = {employee(John, 1000), employee(John, 2000),
employee(Mary, 1000)}. It is easy to see that (Mary) ∈ consistentΣ(q, I2). This is be-
cause employee Mary appears with a salary of 1000 as its nonkey value, and does not
appear with any other s′ such that s′ 6= 1000. This can be checked with a formula
Qconsist(e) = ∀s′.employee(e, s′) → s′ = 1000. In fact, we will show that a query rewrit-
ing Q1 for q1 can be obtained as the conjunction of q1 and Qconsist:
Q1(e) = ∃e.employee(e, 1000) ∧ ∀s′.employee(e, s′) → s′ = 1000
In the next example, we illustrate the rewriting for a conjunctive query that has a
nonkey-to-key join.
Example 3.2. Let R be a schema with two relation symbols: employee and dept. As-
sume that employee has two attributes: emplKey (employee name), and deptFKey (de-
partment name); and dept has two attributes deptKey (department name) and mgrName
(manager name). Assume that there are two key constraints in Σ, stating that emplKey is
the key of the relation employee, and deptKey is the key of relation dept.
Consider the query q2 that retrieves the names of all employees whose department
appears in the dept relation:
q2(e) = ∃d,m.employee(e, d) ∧ dept(d,m)
As in the previous example, q2 itself is not a query rewriting of CONSISTENT(q2, Σ).
Consider the database instance I1 = {employee(John, Sales), employee(John,Engineering),
Chapter 3. Rewritings for Conjunctive Queries 28
dept(Sales, Peter)}. It is easy to see that (John) ∈ q2(I1). However, we have that
(John) 6∈ consistentΣ(q2, I1) because the repair I = {employee(John,Engineering),
dept(Sales, Peter)} is such that (John) 6∈ q2(I).
Now, consider the following database instance I2 = {employee(John, Sales),
employee(John,Engineering), dept(Sales, Peter), dept(Engineering, Tom)}. It is easy
to see that (John) ∈ consistentΣ(q2, I2). This is because every nonkey value (de-
partment name) that appears together with John in some tuple (in this case, Sales
and Engineering) joins with a tuple of dept. This can be checked with a formula
Qconsist(e) = ∀d.employee(e, d) → ∃m.dept(d,m). We will soon show that a query rewrit-
ing Q2 for q2 can be obtained as the conjunction of q2 and Qconsist, as follows:
Q2(e) = ∃d,m.employee(e, d) ∧ dept(d,m) ∧ ∀d.(employee(e, d) → ∃m.dept(d, m))
We now proceed to present RewriteForest, the query rewriting algorithm for queries
in Cforest (shown in Figures 3.2, 3.3, and 3.4). Given a query q such that q ∈ Cforest
and a set of key constraints Σ (containing one key per relation), RewriteForest(q, Σ)
returns a first-order rewriting Q for the problem of obtaining the consistent answers
for q with respect to Σ. The main procedure of the algorithm is shown in Figure 3.2.
The first-order rewriting Q that it returns is obtained as the conjunction of the input
query q, and a new query called Qconsist. The query Qconsist is used to ensure that q is
satisfied in every repair. It is important to notice that Qconsist will be applied directly to
the inconsistent database (i.e., we will never explicitly generate the repairs). The query
Qconsist is obtained by recursion on the tree structure of each of the components of the
join graph of q (recall that since q is in Cforest, the join graph is a forest). The recursive
procedure is called RewriteTree, and is shown in Figure 3.3.
The first part of RewriteTree produces a rewriting Qlocal for the literal R(~x, ~y) at the
root of the input tree. This rewriting is done independently of the rest of the query, and
it is produced by the procedure RewriteLocal (shown in Figure 3.4). The query Qlocal
deals with the constants that appear in ~y in the same way as we illustrated in Example
3.1. It also deals with the free variables that appear at nonkey positions of the query in
the way that we illustrate in the next example.
Example 3.3. Consider the query q3 that retrieves all employees and their salaries from
the database, expressed as q3(e, s) = employee(e, s). Notice that the only difference with
the query q1 of Example 3.1 is that the constant 1000 is replaced by the free variable
Chapter 3. Rewritings for Conjunctive Queries 29
Algorithm RewriteForest(q, Σ)
Input: q(~z), a query of the form ∃~w.φ(~w, ~z)
Σ, a set of key constraints, one per relation used in q
Output: Q, a first-order query that computes consistentΣ(q, I) for every database I
Let G be the join graph of q
Let T1, . . . , Tm be the connected components of G
for i := 1 to m do
Let Ri(~xi, ~yi) be the literal at the root of Ti
Let φi be the conjunction of literals of Ti
Let ~wi = {w : w is a variable that occurs in φi and ~w, and w 6∈ ~xi}Let ~zi = {z : z is a variable that occurs in φi and ~z, and z 6∈ ~xi}Let qi(~xi, ~zi) = ∃~wi.φi(~xi, ~wi, ~zi)
Let Qi(~xi, ~zi) = RewriteTree(qi, Σ)
end for
Let Qconsist(~w, ~z) =∧
i=1...m Qi(~xi, ~zi)
Let Q(~z) = ∃~w.(φ(~w, ~z) ∧Qconsist(~w, ~z))
return Q
Figure 3.2: Query rewriting algorithm for conjunctive queries in Cforest
s. The algorithm RewriteLocal creates a new, universally-quantified variable s′ for the
free variable s, and equates s′ to s. The resulting query rewriting for q3 is the following:
Q3(e, s) = employee(e, s) ∧ ∀s′.employee(e, s′) → s′ = s
The second part of RewriteTree recursively creates a query Qi for each subtree Ti
of T rooted at R. Let ~y0 be the variables at nonkey positions of R (excluding those
that also appear in ~x). Then, one of the conjuncts of the rewritten query returned by
RewriteTree is of the form ∀~y0.R(~x, ~y) → ∧i=1...m Qi(~xi, ~zi). Notice that the variables of
~y0 (i.e., the variables at nonkey positions of the root literal R) are universally quantified.
The intuition behind this is that, as we illustrated in Example 3.2, the query must
be satisfied by all the nonkey values of a given key (in that example, all the possible
departments for the given employee).
Chapter 3. Rewritings for Conjunctive Queries 30
Algorithm RewriteTree(q, Σ)Input: q(~x, ~z), a query in Cforest of the form ∃~w.φ(~x, ~w, ~z),
whose join graph T is a tree with root literal R(~x, ~y)Σ, a set of key constraints, one per relation
Output: Q, a first-order query that computes consistentΣ(q, I) for every database I
Let T be the join graph of qLet R(~x, ~y) be the literal at the root node of TLet qlocal(~x, ~z) = ∃~w.R(~x, ~y)Let Qlocal(~x, ~z) = RewriteLocal(qlocal, Σ)
if φ has exactly one literal thenQ = Qlocal
elseLet R1(~x1, ~y1), . . . , Rm(~xm, ~ym) be the children of R in Tfor i := 1 to m do
Let Ti be the subtree of T rooted at Ri
Let φi be the conjunction of literals of Ti
Let ~wi = {w : w is a variable that occurs in φi and ~w,and w 6∈ ~xi}
Let ~zi = {z : z is a variable that occurs in φi and ~z, and z 6∈ ~xi}Let qi(~xi, ~zi) = ∃~wi.φi(~xi, ~wi, ~zi)Let Qi(~xi, ~zi) = RewriteTree(qi, Σ)
end forLet ~y0 = {y : y is a variable that occurs in ~y and ~w, and y 6∈ ~x}Let Q(~x, ~z) = Qlocal(~x, ~z) ∧ ∀~y0.R(~x, ~y) → ∧
i=1...m Qi(~xi, ~zi)end ifreturn Q
Figure 3.3: Recursive algorithm on the tree structure of the join graph
The next example illustrates an application of the algorithm.
Example 3.4. Let R be a schema with four relation symbols: employee, dept, city,
and prov. Assume that employee has three attributes: emplKey (employee name),
cityFKey (city name), and deptFKey (department name); dept has two attributes:
deptKey (department name) and mgrName (manager’s name); city has two attributes:
cityKey and provFKey; and prov has two attributes: provKey (province name) and
countryName (country name). Assume that there are four key constraints in Σ, stating
that emplKey is the key of the relation employee; cityKey is the key of relation city;
deptKey is the key of the relation dept; and provKey is the key of the relation prov.
Consider a query q4 that retrieves the names of all employees that are located in
Chapter 3. Rewritings for Conjunctive Queries 31
Algorithm RewriteLocal(q, Σ)Input: q(~x, ~z), a query of the form ∃~w.R(~x, ~y), where
none of the variables of ~w appear in ~xΣ, a set of key constraints
Let σ be an injective function mapping natural numbers to variables not present in RInitialize Eq as an empty setfor each position p of ~y do
Let w be the variable that appears at position p of ~yLet z = σ(p)if there is a constant d at position p of ~y then
Add the equality z = d to Eqend ifif w appears in ~x or w appears in ~z then
Add the equality z = w to Eqend iffor every position p′ of ~y such that p 6= p′ and w occurs in ~y at position p′ do
Let z′ = σ(p′)Add the equality z = z′ to Eq
end forend forif Eq 6= ∅ then
Let ~y∗ be a vector of variables of the same arity as ~y, andsuch that if z is at position p of ~y∗, then σ(p) = z
Let Qeq be the conjunction of the equalities of EqLet Qlocal(~x, ~z) = ∃~w.R(~x, ~y) ∧ ∀~y∗.R(~x, ~y∗) → Qeq
elseLet Qlocal(~x, ~z) = ∃~w.R(~x, ~w)
end ifreturn Qlocal
Figure 3.4: Query rewriting for a given literal
Chapter 3. Rewritings for Conjunctive Queries 32
Figure 3.5: Join graph of query q4.
Canada and whose manager is Peter:
q4(e) = ∃d, c, m, p. employee(e, d, c) ∧ city(c, p) ∧ prov(p, Canada) ∧ dept(d, Peter)
The join graph of q4 is given in Figure 3.5. Notice that the join graph of q4 is a tree.
Furthermore q4 has full nonkey-to-key joins and no repeated relation symbols. Thus, q4
is in Cforest.
Let q′′ be the query q′′(c) = ∃p.city(c, p) ∧ prov(p, Canada); let q′′′ be the query
q′′′(p) = prov(p, Canada); and let qIV (d) = dept(d, Peter). The first-order query rewrit-
ing Q4 of q4 is obtained by applying the algorithm RewriteForest(q4, Σ) as follows.
Q4(e) = ∃d, c, m, p.employee(e, d, c) ∧ dept(d,m) ∧ city(c, p) ∧ prov(p, Canada) ∧Qconsist(e)
where :
Qconsist(e) = RewriteTree(q, Σ) =
∃d, c.employee(e, d, c) ∧ ∀d, c.employee(e, d, c) → (Q′′(c) ∧QIV (d))
Q′′(c) = RewriteTree(q′′, Σ) =
∃p.city(c, p) ∧ ∀p.city(c, p) → Q′′′(p)
Q′′′(p) = RewriteTree(q′′′, Σ) =
prov(p, Canada) ∧ ∀w′.(prov(p, w′) → w′ = Canada)
QIV (d) = RewriteTree(qIV , Σ) =
dept(d, Peter) ∧ ∀u′.(dept(d, u′) → u′ = Peter)
Notice the reuse of variables in the rewritten queries. In particular, each existentially-
quantified variable of q4 that appears at a nonkey position in a literal of q4 is first
existentially quantified, and then universally quantified in the rewriting Q4.
Chapter 3. Rewritings for Conjunctive Queries 33
Recall that queries with repeated relation symbols are not allowed in the class Cforest.
We now give an example of a query with repeated relation symbols for which our al-
gorithm fails to give the consistent answers. Although not addressed in this work, it
would be interesting to characterize the class of queries with repeated relation symbols
for which our algorithm is indeed correct.
Example 3.5. Let R be a schema with one relation symbol r, which has three attributes:
A,B, C. Assume that A is the key of the relation r. Let q be the Boolean query
q = ∃x, y, z.r(x, y, a) ∧ r(y, z, b), where a and b are constants. If we apply our query
rewriting algorithm, we obtain the following:
Q = ∃x, y, z.r(x, y, a) ∧ r(y, z, b) ∧ ∀y′, z′.(r(x, y′, z′) → z′ = a)∧
∀y.(r(x, y, a) → ∃z.r(y, z, b) ∧ ∀z′, w′.(r(y, z′, w′) → z′ = b))
Let I be the database instance I = {r(c, d, a), r(d, e, b), r(d, f, a), r(f, g, b)}. In this
case, there are two repairs of I with respect to Σ: I1 = {r(c, d, a), r(d, e, b), r(f, g, b)}and I2 = {r(c, d, a), r(d, f, a), r(f, g, b)}. Clearly, I1 |= q and I2 |= q. However, I 6|= Q.
We finish this section by pointing out that the complexity of the query rewriting
algorithm is linear in the number of literals of the input query. To see this, notice that
the algorithm visits each node of the join graph exactly once.
3.3 Correctness of the Algorithm
In this section, we show that the algorithm RewriteForest presented in the previous
section is correct for all queries in the class Cforest. In particular, we prove the following
theorem.
Theorem 3.5. Let R be a schema. Let Σ be a set of integrity constraints, consisting of
one key dependency per relation of R. Let q(~z) be a conjunctive query over R such that
q ∈ Cforest. Let Q(~z) be the first-order query returned by RewriteForest(q, Σ). Let I be
an instance over R.
Then, ~t ∈ Q(I) iff ~t ∈ consistentΣ(q, I).
Our proof relies on a few simple properties of repairs of inconsistent databases where
the set of integrity constraints contains a single key dependency per relation. We establish
Chapter 3. Rewritings for Conjunctive Queries 34
these properties in Section 3.3.1. In Section 3.3.2, we show a structural property of the
queries in Cforest that is important in order to guarantee the correctness of the algorithms
RewriteTree and RewriteForest: the literals from distinct trees of the join graph may
only share variables that appear as key attributes at the root of their trees.
In Section 3.3.3, we introduce the notion of a “pessimistic” repair. The name comes
from the fact that, for a given query q and database I, if a tuple fails to satisfy the query
on some repair, then it also fails to satisfy the query on the pessimistic repair. More
precisely, for any inconsistent database I, there is a repair M such that if M |= q(~c),
then consistentΣ(q(~c), I) = true. This enables the algorithm to independently consider
each instantiation of the variables for the key of the root literal.
We then proceed to prove the correctness of the building blocks of the rewriting
algorithm. First, in Section 3.3.4, we prove the correctness of the module RewriteLocal,
for “atomic” queries, that is queries with a single literal (and hence no joins). In Section
3.3.5, we prove the correctness of the recursive algorithm RewriteTree that works on
queries whose join graph is a tree. Finally, in Section 3.3.6, this is generalized to the case
of queries whose join graph is a forest, which gives the correctness proof for the rewriting
algorithm RewriteForest for conjunctive queries in class Cforest.
3.3.1 Properties of Repairs
We first show a few important properties of repairs when the set of integrity constraints
consists of one key dependency per relation. These properties will be used throughout
the proofs of this and the next chapter.
Proposition 3.6. Let I be a database instance. Let I be a repair of I wrt Σ. Then
I ⊆ I.
Proof. Let I ′ be an instance such that I ′ |= Σ. Assume that there is a tuple ~t such that
~t ∈ I ′ and ~t 6∈ I. Let I ′′ = I ′ − {~t}. It is easy to see that by removing tuples from
an instance, we do not introduce violations with respect to a set of key dependencies.
Hence, I ′′ |= Σ. Clearly, ∆(I, I ′′) ⊂ ∆(I, I ′). Therefore, I ′ is not a repair of I wrt Σ.
Proposition 3.7. Let I be an instance. Let I be a repair of I wrt Σ. Let R(~c, ~d) be a
tuple of I. Then, there exists some ~d′ such that R(~c, ~d′) is a tuple of I.
Proof. Let I ′ be an instance such that I ′ |= Σ and R(~c, ~d′) 6∈ I ′, for every ~d′. Let
Chapter 3. Rewritings for Conjunctive Queries 35
I ′′ = I ′ ∪ {R(~c, ~d)}. Since R(~c, ~d′) 6∈ I ′ for every ~d′, I ′′ |= Σ. Clearly, ∆(I, I ′′) =
∆(I, I ′)− {R(~c, ~d)}. Since ∆(I, I ′′) ⊂ ∆(I, I ′), I ′ is not a repair of I wrt Σ.
Proposition 3.8. Let I be an instance. Let R(~c, ~d) be a tuple of I. Then, there exists
some repair I of I such that R(~c, ~d) ∈ I.
Proof. Let I∗ be a repair of I wrt Σ. By Proposition 3.7, there exists ~d′ such that
R(~c, ~d′) ∈ I∗. Let I ′ = I∗−{R(~c, ~d′)}∪ {R(~c, ~d)}. Since I∗ is a repair, I∗ |= Σ. Since I ′
does not introduce any violation to the key dependencies of Σ, I ′ |= Σ. Assume that I ′
is not a repair of I. Then, there exists a repair I∗∗ of I such that ∆(I, I∗∗) ⊂ ∆(I, I ′).
By Proposition 3.6, I∗ ⊆ I, and thus I ′ ⊂ I. Furthermore, by Proposition 3.6, I∗∗ ⊆ I.
Thus, I − I∗∗ ⊂ I − I ′. Therefore, I ′ ⊂ I∗∗. Let I ′′ = I∗∗ − {R(~c, ~d)} ∪ {R(~c, ~d′)}.Clearly, I∗ ⊂ I ′′. Thus, I∗ is not a repair; contradiction.
3.3.2 A Structural Property of Cforest
In the next lemma, we show a structural property of the queries in Cforest that is important
in order to guarantee the correctness of the algorithm. In particular, we show that distinct
trees of the join graph may only share free variables (which do not contribute arcs to the
join graph) or variables that appear as key attributes at the root of their trees.
Lemma 3.9. Let q(~z) be a query such that q ∈ Cforest. Let G be the join graph of q.
Let Ti and Tj be distinct connected components of G. Let Ri(~xi, ~yi) and Rj(~xj, ~yj) be the
literals at the roots of Ti and Tj, respectively. Let w be a variable that occurs in a literal
of both Ti and Tj. Then, either w is free (w ∈ ~z) or w is in the key of the roots of both
trees (w ∈ ~xi ∩ ~xj).
Proof. Let ~wi = {w : w is a variable that occurs in some literal of Ti, w 6∈ ~xi and w 6∈ ~z}.Let ~wj = {w : w is a variable that occurs in some literal of Tj, w 6∈ ~xj and w 6∈ ~z}.Assume that there is some variable w such that w appears in ~wi and ~wj. Let S1(~u1, ~v1)
and S2(~u2, ~v2) be literals of Ti and Tj, respectively such that w appears in S1 and S2.
We must now consider the next two cases. First, suppose that w occurs in ~v1. Then,
by definition of join graph, there is an arc from S1 to S2 in G. But S1 and S2 are in
distinct connected components of G; contradiction. Second, suppose that w occurs in
~u1. By definition of wi, S1 is not at the root of Ti (i.e., S1 6= Ri). Hence, there must
be a nonkey-to-key join from another literal, S3(~u3, ~v3), in Ti to S1. Since q is in Cforest,
Chapter 3. Rewritings for Conjunctive Queries 36
all the nonkey-to-key joins of q are full. Thus, the variable w also appears in a nonkey
position in ~v3. Hence, there must be an arc in the join graph from S3 to S2. But S2 and
S3 are in distinct connected components of G; contradiction.
3.3.3 A “Pessimistic” Repair
In this subsection, we introduce the notion of a “pessimistic” repair. The name comes
from the fact that, for a given query q (in a class that we will define shortly) and database
I, if a tuple fails to satisfy the query on some repair, then it also fails to satisfy the query
on the pessimistic repair. More precisely, for every inconsistent database I, there is a
repair M such that if ~c ∈ q(M), then ~c ∈ consistentΣ(q, I). This is a fundamental
property for the following reason. Consider a Boolean query q = ∃~x, ~w.φ(~x, ~w) and a
query q′(~x) = ∃~w.φ(~x, ~w). That is, q and q′ have the same literals, but some of the
(existentially-quantified) variables of q are free in q′. Suppose that we would like to
check whether consistentΣ(q, I) = true. This holds if, for every repair I of I, I |= q. In
particular, since M is a repair of I, M |= q. Thus, there is some ~c such that ~c ∈ q′(M).
By Lemma 3.10 below, it follows that ~c ∈ consistentΣ(q′, I). This property will be
exploited in the design of our algorithms in order to check the consistency of each tuple
of ~x independently. Notice that the property does not hold in general for conjunctive
queries, as we show in the next example. However, it does hold for the queries that
satisfy the conditions of Lemma 3.10.
Example 3.6. Consider a schema R with two binary relations r1 and r2. Consider a set Σ
that consists of a key dependency for r1 and a key dependency for r2 (the key dependencies
will be obvious from the queries). Let qnk be the Boolean query ∃x, x′, y.r1(x, y)∧r2(x′, y).
Notice that qnk is not in Cforest because it contains a nonkey-to-nonkey join. Let I be an
instance such that I = {r1(a1, b1), r1(a1, b2), r1(a2, b3), r1(a2, b4), r1(a3, b5),
r1(a3, b3), r2(c1, b1), r2(c1, b3), r2(c2, b4), r2(c2, b5), r2(c3, b2), r2(c3, b3)}. It can be checked
that for every repair I of I, I |= qnk.
Now, consider the query q′nk(x) = ∃x′, y.r1(x, y)∧ r2(x′, y). That is, qnk and q′nk differ
only in the fact that x is existentially-quantified in the former, and free in the latter. Let
I1 be repair of I such that I1 = {r1(a1, b1), r1(a2, b3), r1(a3, b5), r2(c1, b3), r2(c2, b4), r2(c3, b3)}.Let I2 be a repair of I such that I2 = {r1(a1, b1), r1(a2, b3), r1(a3, b5), r2(c1, b1), r2(c2, b4),
r2(c3, b2)}. Notice that (a1) 6∈ q′nk(I1), (a2) 6∈ q′nk(I2), and (a3) 6∈ q′nk(I1). Thus, even
though consistentΣ(qnk, I) = true, we have that (a) 6∈ consistentΣ(q′nk, I) = false,
Chapter 3. Rewritings for Conjunctive Queries 37
for every a. Therefore, it is not possible to check whether consistentΣ(qnk, I) = true
by independently checking each instantiation of the free variables of q′nk.
The result that we give below assumes an input query q(~x) that is in Cforest, whose
join graph T is a tree, and whose free variables ~x are exactly the variables of the key of T ’s
root. In the algorithm RewriteForest, the input query will be broken into subqueries
that satisfy this condition.
Lemma 3.10. Let q(~x) be a query in Cforest, whose join graph T is a tree and where
R(~x, ~y) is the literal at the root of T . Let I be an instance. Then, there is a repair Msuch that for all ~c if ~c ∈ q(M), then ~c ∈ consistentΣ(q, I).
Proof. Let M be the instance instance built by invoking the procedure
BuildPessimisticRepair(q, I) given in Figure 3.3.3. Assume that q is of the form
q(~x) = ∃~w.φ(~w, ~x). We will prove the claim by induction on the number of literals of φ.
Base case. Assume that φ consists of exactly one literal R(~x, ~y). Let ~t be the tuple
selected by the algorithm in the iteration for literal R and the vector of values ~c. Assume
towards a contradiction that consistentΣ(∃~w.R(~x, ~w)[~x/~c], I) = false. Then, there is
some repair I of I such that I 6|= ∃~w.R(~x, ~y)[~x/~c]. Since ~t ∈ I and I is a repair of I,
by Proposition 3.7, there is some tuple ~t′ in I and some ~d′ such that ~t′ = R(~c, ~d′). Since
I 6|= ∃~w.R(~x, ~y)[~x/~c], we have that {~t′} 6|= ∃~w.R(~x, ~y)[~x/~c].
Notice that ~t and ~t′ can be added to M only during the iteration for the vector of
values ~c. Since {~t} |= ∃~w.R(~x, ~y)[~x/~c] and {~t′} 6|= ∃~w.R(~x, ~y)[~x/~c], the algorithm never
selects tuple ~t. But ~t ∈M; contradiction.
Inductive step. Assume that φ has more than one literal. Let T1, . . . , Tm be the
subtrees of T such that the root of Tj is a child of the root of T , for 1 ≤ j ≤ m. For each
1 ≤ j ≤ m, let Sj(~xj, ~yj) be the literal at the root of Tj. Let φj be the conjunction of
the literals of Tj. Let ~wj = {w : w is a variable of φj, and w 6∈ ~xj}. Let qj = φj(~xj, ~wj).
Let Mj =BuildPessimisticRepair(φj, I).
Assume that M |= q(~x)[~x/~c]. Let ~t be the tuple of I selected by the algorithm in
the iteration for literal R and the vector of values ~c. Then, ~t ∈ M, and there is some
~d such that ~t = R(~c, ~d). Since M |= q(~x)[~x/~c], we have that for every j such that
1 ≤ j ≤ m, there is some valuation ν for the variables of ~y, and some ~cj such that
ν(~y) = ~d, ν(~xj) = ~cj, and Mj |= qj(~xj)[~xj/~cj].
Chapter 3. Rewritings for Conjunctive Queries 38
Algorithm BuildPessimisticRepair
Input: q(~x), a query in Cforest of the form ∃~w.φ(~w, ~x),whose join graph T is a tree with root literal R(~x, ~y)
Σ, a set of key constraints, one per relationI, an instance
Output: M, a repair of I
Initialize M as an empty instance
if φ has exactly one literal thenfor each ~c such that there is some R(~c, ~d) in I do
if there is some ~d such that R(~c, ~d) ∈ I,
and {R(~c, ~d)} 6|= ∃~w.R(~x, ~y)[~x/~c] then
Let ~t = R(~c, ~d)else
Let ~t be any tuple of I such that ~t = R(~c, ~d), for some ~dend ifAdd ~t to M
end forelse
/* φ has more than one literal*/Let S1, . . . , Sm be the children of R in Tfor j := 1 to m do
Let Tj be the subtree of T whose root is Sj
Let φj be the conjunction of literals of Tj
Let ~wj = {w : w is a variable that occurs in φj and ~w, and w 6∈ ~xj}Let qj(~xj) = ∃~wj.φj(~xj, ~wj)Let Mj = BuildPessimisticRepair(qj, I)Add Mj to M
end forfor each ~c such that there is some R(~c, ~d) in I do
if there is some ~d, some j, some valuation ν for the variables of ~y,and some ~cj such that R(~c, ~d) ∈ I, ν(~y) = ~d, ν(~xj) = ~cj, andMj 6|= qj(~xj)[~xj/~cj] then
Let ~t = R(~c, ~d)else
Let ~t be any tuple of I such that ~t = R(~c, ~d), for some ~dend ifAdd ~t to M
end forend if
Figure 3.6: Algorithm to construct a “pessimistic” repair
Chapter 3. Rewritings for Conjunctive Queries 39
Assume towards a contradiction that consistentΣ(q(~x)[~x/~c], I) = false. Then, there
is some repair I of I such that I 6|= q(~x)[~x/~c]. Since ~t ∈ I and I is a repair of I, by
Proposition 3.7, there is some tuple ~t′ in I and some ~d′ such that ~t′ = R(~c, ~d′). By Lemma
3.9, none of the variables of ~wi appear in ~wj, for every i and j such that i 6= j, 1 ≤ i ≤ m,
1 ≤ j ≤ m. Thus, there is some j, some valuation ν for the variables of ~y, and some tuple
of values ~c′j such that 1 ≤ j ≤ m, I 6|= qj(~xj)[~xj/~c′j], ν(~y) = ~d′, and ν(~xj) = ~c′j. Thus,
consistentΣ(qj(~xj)[~xj/~c′j], I) = false. By inductive hypothesis Mj 6|= qj(~xj)[~xj/~c
′j].
Since Mj |= qj(~xj)[~xj/~cj], the algorithm never selects ~t in the construction of M. But
~t ∈M; contradiction.
3.3.4 Correctness of RewriteLocal
We now give a correctness proof of RewriteLocal, the module of the algorithm that
handles “atomic” queries, that is queries with a single literal (and hence no joins). These
atomic queries may have arbitrary selections and projections on any subset of the nonkey
attributes (more precisely, any of the nonkey attributes may be projected out of the
query result). We consider here only equality selections, but it is quite easy to see how to
extend the algorithm and the proof to more general selection conditions (including not
only inequalities, but also arbitrary first-order expressions relating the variables of the
literal).
Lemma 3.11. Let q(~x, ~z) be a query of the form ∃~w.R(~x, ~y). Let I be a database instance.
Let Qlocal(~x, ~z) be the first-order query returned by RewriteLocal(q, Σ).
Then, (~c,~t) ∈ Qlocal(I) iff (~c,~t) ∈ consistentΣ(q, I).
Proof. (⇒) Assume that I |= Qlocal(~x, ~z)[~x/~c][~z/~t]. Then, there is a tuple R(~c, ~d) such
that {R(~c, ~d)} |= ∃~w.R(~x, ~y)[~x/~c][~z/~t]. Assume towards a contradiction that
consistentΣ(∃~w.R(~x, ~y)[~x/~c][~z/~t], I) = false. Then, there is some repair I such that
I 6|= ∃~w.R(~x, ~y)[~x/~c][~z/~t]. By Proposition 3.7, there is a tuple R(~c, ~d′) in I.
Following the construction of Qlocal in RewriteLocal, let σ be an injective function
that maps natural numbers to variables not present in R. Let ~y∗ be a vector of variables
of the same arity as ~y and such that if z is at position p of ~y∗, then σ(p) = z. Let ν and
ν ′ be valuations for the variables of ~x and ~y∗ such that ν(~x) = ~c, ν(~y∗) = ~d, ν ′(~x) = ~c,
and ν ′(~y∗) = ~d′.
Since {R(~c, ~d)} |= ∃~w.R(~x, ~y)[~x/~c][~z/~t] and {R(~c, ~d′)} 6|= ∃~w.R(~x, ~y)[~x/~c][~z/~t], there
is some variable z at some position p of ~y∗ such that
Chapter 3. Rewritings for Conjunctive Queries 40
1. ν(z) 6= ν ′(z), and there is a constant at position p in ~y; or
2. ν(z) 6= ν ′(z), and there is some variable w such that w occurs at position p of ~y,
and w occurs in either ~x or ~z; or
3. there are variables w and z′, and a position p′ such that w occurs at position p of
~y, w occurs at position p′ of ~y, p 6= p′, z′ = σ(p′), and ν ′(z) 6= ν ′(z′).
Assume (1) that there is a constant d at position p in ~y. Since
{R(~c, ~d)} |= ∃~w.R(~x, ~y)[~x/~c][~z/~t], ν(z) = d. Since ν(z) 6= ν ′(z), there is a constant d′
such that d 6= d′ and ν ′(z) = d′. Notice in the algorithm RewriteLocal that since I |=Qlocal(~x, ~z)[~x/~c][~z/~t], we have that I |= ∀~y∗.R(~x, ~y∗) → z = d. Since I ⊆ I, R(~c, ~d′) ∈ I.
Thus, {R(~c, ~d′)} |= ∀~y∗.R(~x, ~y∗) → z = d. Therefore, ν ′(z) = d; contradiction.
Assume (2) that there is some variable w such that w occurs at position p of ~y,
and w occurs in either ~x or in ~z. Let c = ν(w). Since {R(~c, ~d)} |= ∃~w.R(~x, ~y∗)[~x/~c][~z/~t],
ν(z) = c. Since ν(z) 6= ν ′(z), ν ′(z) 6= c. Notice in the algorithm RewriteLocal that since
I |= Qlocal(~x, ~z)[~x/~c][~z/~t], we have that I |= ∀~y∗.R(~x, ~y∗) → z = w[w/c]. Since I ⊆ I,
R(~c, ~d′) ∈ I. Thus, {R(~c, ~d′)} |= ∀~y∗.R(~x, ~y∗) → z = w[w/c]. Therefore, ν ′(z) = c;
contradiction.
Assume (3) that there are variables w and z′, and a position p′ such that w occurs
at position p of ~y, w occurs at position p′ of ~y, p 6= p′, z′ = σ(p′), and ν ′(z) 6= ν ′(z′).
Notice in the algorithm RewriteLocal that since I |= Qlocal(~x, ~z)[~x/~c][~z/~t], we have that
I |= ∀~y∗.R(~x, ~y∗) → z = z′. Since I ⊆ I, R(~c, ~d′) ∈ I. Thus, {R(~c, ~d′)} |= ∀~y∗.R(~x, ~y∗) →z = z′. Therefore, ν ′(z) = ν ′(z′); contradiction.
(⇐) Assume that consistentΣ(q(~x, ~z)[~x/~c][~z/~t], I) = true. Assume towards a con-
tradiction that I 6|= Qlocal(~x, ~z)[~x/~c][~z/~t]. Then, at least one of the following conditions
hold:
1. I 6|= ∃~w.R(~x, ~y)[~x/~c][~z/~t]; or
2. there is a constant d at position p in ~y and a variable z such that z = σ(p) and
I 6|= ∀~y∗.R(~x, ~y∗) → z = d[~x/~c][~z/~t]; or
3. there is some variable w such that w occurs at position p of ~y, w occurs in either
~x or ~z, and I 6|= ∀~y∗.R(~x, ~y∗) → z = w[~x/~c][~z/~t]; or
4. there is some variable w that occurs at position p of ~y, and at a position p′ of ~y
such that p 6= p′, σ(p) = z, σ(p′) = z′ and I 6|= ∀~y∗.R(~x, ~y∗) → z = z′[~x/~c][~z/~t].
Chapter 3. Rewritings for Conjunctive Queries 41
Assume that I 6|= ∃~w.R(~x, ~y)[~x/~c][~z/~t]. Let I be an arbitrary repair of I. Since I ⊆ I,
I 6|= ∃~w.R(~x, ~y)[~x/~c][~z/~t]; contradiction.
Suppose that I |= ∃~w.R(~x, ~y)[~x/~c][~z/~t]. Furthermore, assume that there is a constant
d at position p in ~y and a variable z such that z = σ(p) and I 6|= ∀~y∗.R(~x, ~y∗) → z =
d[~x/~c][~z/~t]. Then, there is a tuple R(~c, ~d) in I such that {R(~c, ~d)} 6|= ∀~y∗.R(~x, ~y∗) →z = d[~x/~c][~z/~t]. This means that there is some constant e at position p of ~d such that
d 6= e. Thus, {R(~c, ~d)} 6|= ∃~w.R(~x, ~y)[~x/~c][~z/~t]. By Proposition 3.8, there is a repair Iof I such that R(~c, ~d) ∈ I. Assume that I |= ∃~w.R(~x, ~y)[~x/~c][~z/~t]. Let R(~c, ~d′) be a
tuple of I such that {R(~c, ~d′)} |= ∃~w.R(~x, ~y)[~x/~c][~z/~t]. Since I is a repair of I, I satisfies
the key constraints of Σ. Thus, ~d = ~d′. Therefore, {R(~c, ~d)} |= ∃~w.R(~x, ~y)[~x/~c][~z/~t];
contradiction.
Suppose that I |= ∃~w.R(~x, ~y)[~x/~c][~z/~t]. Furthermore, assume that there is some
variable w such that w occurs at position p of ~y, w occurs in either ~x or ~z, and I 6|=∀~y∗.R(~x, ~y∗) → z = w[~x/~c][~z/~t]. Then, there is a tuple R(~c, ~d) in I such that {R(~c, ~d)} 6|=∀~y∗.R(~x, ~y∗) → z = w[~x/~c][~z/~t]. Let ν be a valuation for the variables of ~x and ~z such
that ν(~x) = ~c and ν(~z) = ~t. Let c = ν(w). Then, there is some constant e at position p of
~d such that c 6= e. Thus, {R(~c, ~d)} 6|= ∃~w.R(~x, ~y)[~x/~c][~z/~t]. By Proposition 3.8, there is a
repair I of I such that R(~c, ~d) ∈ I. Assume that I |= ∃~w.R(~x, ~y)[~x/~c][~z/~t]. Let R(~c, ~d′) be
a tuple of I such that {R(~c, ~d′)} |= ∃~w.R(~x, ~y)[~x/~c][~z/~t]. Since I is a repair of I, I satisfies
the key constraints of Σ. Thus, ~d = ~d′. Therefore, {R(~c, ~d)} |= ∃~w.R(~x, ~y)[~x/~c][~z/~t];
contradiction.
Suppose that I |= ∃~w.R(~x, ~y)[~x/~c][~z/~t]. Furthermore, assume that there is some
variable w that occurs at position p of ~y, and at a position p′ of ~y such that p 6= p′,
σ(p) = z, σ(p′) = z′ and I 6|= ∀~y∗.R(~x, ~y∗) → z = z′[~x/~c][~z/~t]. Then, there is a
tuple R(~c, ~d) in I such that {R(~c, ~d)} 6|= ∀~y∗.R(~x, ~y∗) → z = z′[~x/~c][~z/~t]. Let ν
be a valuation for the variables of ~y∗ such that ν(~y∗) = ~d. Then, there are con-
stants d and e at the respective positions p and p′ of ~d such that d 6= e. Thus,
{R(~c, ~d)} 6|= ∃~w.R(~x, ~y)[~x/~c][~z/~t]. By Proposition 3.8, there is a repair I of I such that
R(~c, ~d) ∈ I. Assume that I |= ∃~w.R(~x, ~y)[~x/~c][~z/~t]. Let R(~c, ~d′) be a tuple of I such that
{R(~c, ~d′)} |= ∃~w.R(~x, ~y)[~x/~c][~z/~t]. Since I is a repair of I, I satisfies the key constraints
of Σ. Thus, ~d = ~d′. Therefore, {R(~c, ~d)} |= ∃~w.R(~x, ~y)[~x/~c][~z/~t]; contradiction.
Chapter 3. Rewritings for Conjunctive Queries 42
3.3.5 Correctness of RewriteTree
Consider a Boolean query q = ∃~x, ~w.φ(~x, ~w) and a query q′(~x) = ∃~w.φ(~x, ~w). That is, q
and q′ have the same literals, but some of the (existentially-quantified) variables of q are
free in q′. In Lemma 3.10 above, we showed that if q′ is in a certain class of conjunctive
queries, then there is a “pessimistic” repair M such that for all ~c, if ~c ∈ q′(M), then
(c) ∈ consistentΣ(q′, I). We also argued that this fact implies that, in order to check
whether consistentΣ(q, I) = true, it suffices to find some instantiation ~c for the free
variables of q′ such that ~c ∈ consistentΣ(q′, I). The latter condition is fundamental in
the design of our algorithm since it can be checked with a first-order query directly on the
inconsistent database I. In the next lemma, we show that the algorithm RewriteTree,
the main building block of RewriteForest, produces a first-order query that checks
precisely this condition.
Lemma 3.12. Let q(~x, ~z) be a query in Cforest whose join graph T is a tree with root
literal R(~x, ~y). Let I be an instance. Let Q(~x, ~z) be the first-order query returned by
RewriteTree(q, Σ).
Then, (~c,~t) ∈ Q(I) iff (~c,~t) ∈ consistentΣ(q, I).
Proof. The proof is by induction on the number of literals of q.
Base case Assume that q has exactly one literal. Then, q(~x, ~z) = ∃~w.R(~x, ~y),
and Q = RewriteLocal(q, Σ). By Lemma 3.11, we have that I |= Q(~x, ~z)[~x/~c][~z/~t]
iff consistentΣ(q(~x, ~z)[~x/~c][~z/~t], I) = true.
(⇒) Notice in the algorithm RewriteLocal that, since I |= Qlocal[ν],
I |= ∃y1, . . . , ym.R(~x, ~y)[ν]. Let ~c = ν(~x). Then, there exists some ~d such that R(~c, ~d) ∈ I.
Let I be a repair of I. By Proposition 3.7, there is some ~d′ such that R(~c, ~d′) ∈ I.
Assume that there are no constants in ~y. Since all the variables of ~y are existentially
quantified in qT , {R(~c, ~d′)} |= qT [ν], and we are done.
Assume that there is some constant in ~y. Since all the variables of ~y are existentially
quantified in qT , in order to show that {R(~c, ~d′)} |= qT [ν], it suffices to show that ~d′ and
~y coincide in their constants. By Proposition 3.6, I ⊆ I. Thus, R(~c, ~d′) ∈ I. Since
I |= Qlocal[ν] and R(~c, ~d′) ∈ I, we have that |= Qconst[~y∗/~d′]. Therefore, it holds that if
there is a constant e at position i of ~d′, then |= Ei[wi/e], where wi is the variable created
in RewriteLocal for the i-th position of ~y. By construction of Ei, this means that there
is a constant e at position i of ~y.
Chapter 3. Rewritings for Conjunctive Queries 43
(⇐) Let I be a repair of I. Let ~c = ν(~x). Since I |= qT [ν], there exists ~d such that
R(~c, ~d) ∈ I. By Proposition 3.6, I ⊆ I. Therefore, there exists ~d′ such that R(~c, ~d′) ∈ I.
Thus, I |= ∃y1, . . . , ym.R(~x, ~y)[ν].
Assume that there is some constant in ~y. Let νy be a valuation for the variables
of ~y∗, where ~y∗ is the vector of variables created in RewriteLocal. Let ~d be such that
~d = νy(~y∗). If R(~c, ~d) 6∈ I, then I |= R(~x, ~y∗) → Qconst[ν][νy] because the left-hand side
of the implication is not satisfied. Assume R(~c, ~d) ∈ I. By Proposition 3.8, there exists
a repair I of I such that R(~c, ~d) ∈ I. Since I |= Σ, if R(~c, ~d′) ∈ I, then ~d′ = ~d. Since
I |= qT [ν], {R(~c, ~d)} |= qT [ν]. Therefore, if d is a constant that appears at position i in
~y, then d occurs at position i in ~d. Thus, I |= Qconst[ν][νy].
Inductive step Let R1(~x1, ~y1), . . . , Rm(~xm, ~ym) be the children of R in T . Assume
that q is of the form ∃~w.φ(~w, ~z), where φ is a conjunction of literals. For each 1 ≤ i ≤ m,
let Ti be the tree whose root is Ri. Let φi be the conjunction of the literals of Ti. Let
~wi = {w : w is a variable that occurs in φi and ~w, and w 6∈ ~xi}. Let ~zi = {z : z
is a variable that occurs in φi and ~z, and z 6∈ ~xi}. Let qi(~xi, ~zi) = ∃~wi.φi(~xi, ~wi, ~zi).
Let Qi(~xi, ~zi) = RewriteTree(qi, Σ). Let qlocal(~x, ~z) = ∃~w.R(~x, ~y). Let Qlocal(~x, ~z) =
RewriteLocal(qlocal, Σ).
(⇒) Assume that I |= Q(~x, ~z)[~x/~c][~z/~t]. Then, there is a valuation ν for the variables
of φ such that:
1. ν(~x) = ~c, and
2. ν(~z) = ~t, and
3. I |= Qlocal(~x, ~z)[ν], and
4. for every i such that 1 ≤ i ≤ m, there are ~ci and ~ti such that ν(~xi) = ~ci, ν(~zi) = ~ti,
and I |= Qi(~xi, ~zi)[~xi/~ci][~zi/~ti]
Let I be a repair of I. Assume towards a contradiction that I 6|= ∃~w.R(~x, ~y)[~x/~c][~z/~t].
Then, consistentΣ(∃~w.R(~x, ~y)[~x/~c][~z/~t], I) = false. By Lemma 3.11, we have that
I 6|= Qlocal(~x, ~z)[ν]; contradiction.
Assume that I |= ∃~w.R(~x, ~y)[~x/~c][~z/~t]. By Lemma 3.9, none of the variables of ~wi
appear in ~wj, for every i and j such that i 6= j, 1 ≤ i ≤ m, 1 ≤ j ≤ m. Then, I 6|= qi(~ci,~ti)
for some i such that 1 ≤ i ≤ m. Thus, consistentΣ(qi(~xi, ~zi)[~xi/~ci][~zi/~ti], I) = false.
By inductive hypothesis, I 6|= Qi(~xi, ~zi)[~xi/~ci][~zi/~ti]; contradiction.
Chapter 3. Rewritings for Conjunctive Queries 44
(⇐) Assume that consistentΣ(q(~x, ~z)[~x/~c][~z/~t], I) = true. Assume towards a con-
tradiction that I 6|= Q(~x, ~z)[~x/~c][~z/~t]. Let ν be a valuation for the variables of φ such
that ν(~x) = ~c and ν(~z) = ~t. By Lemma 3.9, none of the variables of ~wi appear in ~wj, for
every i and j such that i 6= j, 1 ≤ i ≤ m, 1 ≤ j ≤ m. Then, either (1) I 6|= Qlocal(~x, ~z)[ν];
or (2) there is some i such that I 6|= Qi(~xi, ~zi)[ν].
Assume that I 6|= Qlocal(~z)[ν]. By Lemma 3.11, consistentΣ(∃~w.R(~x, ~y)[~x/~c][~z/~t], I) =
false. Thus, it is the case that consistentΣ(q(~x, ~z)[~x/~c][~z/~t], I) = false; contradic-
tion. Assume that there is some i such that I 6|= Qi(~xi, ~zi)[ν]. By inductive hypothesis,
consistentΣ(qi(~xi, ~zi)[~xi/~ci][~zi/~ti], I) = false. Thus, it is the case that
consistentΣ(q(~x, ~z)[~x/~c][~z/~t], I) = false; contradiction.
3.3.6 Correctness of RewriteForest
We are now ready to give the correctness proof of our rewriting algorithm, for all queries
in class Cforest. The intuition of the proof is the following. Assume that we are given
a query q in Cforest. Then, each of the connected components of the join graph of q
is a tree. Recall that RewriteTree, the algorithm for which we proved correctness in
the above lemma, requires that the input query satisfies the following conditions. First,
the join graph of the query must be a tree. Second, the free variables of the query
must include all the variables at key positions of the literal at the root of this tree.
In order to be able to use RewriteTree, RewriteForest produces a subquery for each
tree of the join graph such that the variables at the key of the corresponding tree’s
root are free. In this way, a first-order rewriting can be produced for each subquery by
invoking the algorithm RewriteTree. For each i, let Qi(~xi, ~zi) be the rewriting obtained
by invoking RewriteTree(qi, Σ). The query returned by RewriteForest has the form
Q(~z) = ∃~w.(φ(~w, ~z) ∧ ∧i=1...m Qi(~xi, ~zi)), where φ(~w, ~z) is the conjunction of literals of
the original query q, and the variables of each ~xi are in ~w. The correctness of this formula
relies on the structural property of Section 3.3.2 and the notion of a “pessimistic” repair of
Section 3.3.3. First, by Lemma 3.10, it suffices to find one instantiation for the variables
of each ~xi. Thus, the variables of ~xi can be free in Qi. Second, the subqueries do not
share existentially-quantified variables. This is ensured by the structural property proved
in Lemma 3.9.
Chapter 3. Rewritings for Conjunctive Queries 45
Theorem 3.5. Let R be a schema. Let Σ be a set of integrity constraints, consisting
of one key dependency per relation of R. Let q(~z) be a conjunctive query over R such
that q ∈ Cforest. Let Q(~z) be the first-order query returned by RewriteForest(q, Σ). Let
I be an instance over R.
Then, ~t ∈ Q(I) iff ~t ∈ consistentΣ(q, I).
Proof. Let G be the join graph of q. Since q ∈ Cforest, G is a forest. Let T1, . . . , Tm be
the connected components (trees) of G. Assume that q is of the form ∃~w.φ(~w, ~z), where
φ is a conjunction of literals. For each 1 ≤ i ≤ m, let Ri(~xi, ~yi) be the literal at the root
of Ti. Let φi be the conjunction of the literals of Ti. Let ~wi = {w : w is a variable that
occurs in φi and ~w, and w 6∈ ~xi}. Let ~zi = {z : z is a variable that occurs in φi and ~z,
and z 6∈ ~xi}. Let qi(~xi, ~zi) = ∃~wi.φi(~xi, ~wi, ~zi). Let Qi(~xi, ~zi) = RewriteTree(qi, Σ).
(⇒) Assume that I |= Q(~z)[~z/~t]. Then, there is a valuation ν for the variables of φ
such that:
1. ν(~z) = ~t, and
2. I |= φ(~w, ~z)[ν], and
3. for every i such that 1 ≤ i ≤ m, there are ~ci and ~ti such that ν(~xi) = ~ci, ν(~zi) = ~ti,
and I |= Qi(~xi, ~zi)[~xi/~ci][~zi/~ti]
Let I be a repair of I. Assume towards a contradiction that I 6|= q[~z/~t]. Thus,
I 6|= q[ν]. By Lemma 3.9, none of the variables of ~wi appear in ~wj, for every i and j
such that i 6= j, 1 ≤ i ≤ m, 1 ≤ j ≤ m. Then, I 6|= qi(~xi, ~zi)[~xi/~ci][~zi/~ti] for some i such
that 1 ≤ i ≤ m. Thus, consistentΣ(qi(~xi, ~zi)[~xi/~ci][~zi/~ti], I) = false. By Lemma 3.12,
I 6|= Qi(~xi, ~zi)[~xi/~ci][~zi/~ti]; contradiction.
(⇐) Assume that ~t ∈ consistentΣ(q, I). Assume towards a contradiction that I 6|=Q(~z)[~z/~t]. Let ν be a valuation for the variables of φ such that ν(~z) = ~t. Then, either
(1) I 6|= q(~z)[ν]; or (2) there is some i such that I 6|= Qi(~xi, ~zi)[ν].
We will build a repair M of I as follows. For each i, let Ii be the projection of
I on the relation symbols of φi. By Lemma 3.10, there is a repair Mi such that if
Mi |= qi(~xi)[~xi/~ci], then consistentΣ(qi(~xi)[~xi/~ci], Ii) = true. We add all the tuples of
Mi to M.
We now show that M 6|= q(~z)[ν]. Assume that I 6|= q(~z)[ν]. Since M ⊆ I, M 6|=q(~z)[ν]. Now, assume that there is some i such that 1 ≤ i ≤ m and I 6|= Qi(~xi, ~zi)[ν]. By
Chapter 3. Rewritings for Conjunctive Queries 46
Lemma 3.12, consistentΣ(qi(~xi, ~zi)[ν], I) = false. By Lemma 3.10, Mi 6|= qi(~xi, ~zi)[ν].
Thus, M 6|= q(~z)[ν].
So, for every valuation ν such that ν(~z) = ~t, we have that M 6|= q(~z)[ν]. Thus,
~t 6∈ consistentΣ(q, I); contradiction.
3.4 Related Work
In their seminal paper on consistent query answering, Arenas, Bertossi and Chomicki
[ABC99] propose a first-order rewriting algorithm. The algorithm applies to a broad
class of constraints but a restricted class of queries, called quantifier-free conjunctive
queries. In these queries, all variables are free (i.e., there is no existential quantification).
If we think in terms of equivalent SQL queries, the fact that all variables are free means
that every attribute of every relation in the from clause must appear in the select
clause. This a strong restriction that rules out many practical queries. As an empirical
observation, none of the queries in the TPC-H specification [TPC03], the industry stan-
dard for decision support systems, satisfy this restriction. Chomicki and Marcinkowski
[CM05] propose a rewriting for another restricted class, where no variables are shared
between literals (and therefore, there are no joins). In this chapter, we focused on a class
of conjunctive queries that may have existential quantification, and we argued that the
class captures many queries that arise in practice.
Except for the aforementioned work [ABC99, CM05], to the best of our knowledge,
none of the work in the consistent query answering literature has focused on first-order
rewritings. Instead, they typically produce rewritings into disjunctive logic programs
[ABC00, CB00, GZ00, FPL+01, GGZ01, LLR02, ABC03a, BB03a, BB03b, EFGL03,
CB05]. Their focus is on obtaining correct disjunctive logic programs for (usually large)
classes of queries and constraints. However, given the high complexity of disjunctive
logic programming, none of these approaches focus on tractability issues. Tractability
results have been given in the context of databases with OR-objects [IvdMV95]. As
we mentioned in Section 2.5, OR-objects can be used in some (though not all) cases
to represent databases inconsistent with respect to key constraints. To the best of our
knowledge, query rewriting has not been studied in the context of OR-objects.
Our work on first-order query rewriting has been subsequently extended by other
authors. Grieco, Lembo, Rosati and Ruzzi [GLRR05] show a query rewriting algorithm
for our class Cforest under exclusion constraints (that is constraints which restrict values
Chapter 3. Rewritings for Conjunctive Queries 47
to appear in exactly one of two relations). In a recent paper [LRR06], Lembo, Rosati,
and Ruzzi extend the class Cforest to consider queries that may have the union operation.
Chapter 4
Rewritings for Queries with
Grouping and Aggregation
In the previous chapter, we presented query rewritings for queries with set semantics and
no aggregation. However, practical query languages like SQL have bag semantics (dupli-
cates are not eliminated unless explicitly requested), and support aggregation functions
and grouping of results. For this reason, in this chapter we present rewritings for queries
with bag semantics, grouping, and aggregation.
4.1 Formal Language
Despite extensive research on queries with bag semantics and aggregation [CV93, IR95,
LW97, GM96, GRT99, CNS99, HLNW01, CNS03], there is no commonly agreed formal
language for this kind of queries, with different researchers proposing different (but of-
ten equivalent) languages. For this reason, in this section, we introduce languages for
first-order aggregate queries and conjunctive aggregate queries that are influenced by the
previous proposals. The former language will be used to express our query rewritings,
whereas the latter will be used for the input queries (i.e., the queries for which we compute
consistent answers). The language of first-order aggregate queries extends the language
of first-order logic with operators for grouping and aggregation. Aggregate conjunctive
queries are a subset of first-order aggregate queries.
Our language for first-order aggregate queries is based on the one given by Cohen,
Nutt and Sagiv [CNS03], except for the fact that we use a “SQL-like” syntax to specify
grouping and aggregation. The language can be shown to be a subset of the aggregate
48
Chapter 4. Rewritings for Queries with Grouping and Aggregation 49
logic Laggr introduced by Hella, Libkin, Nurmonen, and Wong [HLNW01]. We do not
explicitly provide the bag manipulation operators (such as additive union, maximum
union, etc.) that are given in bag algebras [GM96, LW97].
Bags and aggregation functions. A bag or (multiset) is a collection of elements,
each of which occurs one or more times in the collection. We will denote the multiplicity
(number of occurrences) of each element x of a bag B as |x|B. If S is a domain, we
denote by B(S) the set of finite bags over S. A k -ary aggregation function is a function
F : B(Ck) → R that maps bags of k-tuples of constants from some underlying domain C
to real numbers. In particular, we will consider the functions sum, min, and max, which
return the sum, minimum, and maximum of a bag of tuples, and the function count(*),
which returns the cardinality of a bag of tuples.
We will consider a bag-set query semantics [CV93], where relations (and their re-
pairs) are assumed to be sets, but the aggregate queries manipulate bags. For example,
consider a database I = {employee(John, 1000), employee(Mary, 1000)} and a query
q that retrieves the salaries (the second attribute of relation employee), expressed as
q(s) = ∃e. employee(e, s). Under bag-set semantics, the result of q(I) is {{1000, 1000}}(that is, 1000 has multiplicity two in the result).
Language syntax.A first-order aggregate query q may be either:
1. a first-order formula; or
2. a formula of the form
select ~z, F1(~v1), . . . , Fm(~vm)
from q∗(~w, ~z)
group by ~z
where q∗ is a first-order aggregate query, ~w and ~z do not share variables, ~v1, . . . , ~vm
are vectors of variables from ~w, and F1, . . . , Fm are aggregation functions with
arities |~v1|, . . . , |~vm|. We will say that ~z are the grouping variables of the query, and
~v1, . . . , ~vm are the aggregation variables.
Language semantics. We now define how to obtain a set of tuples by applying a
first-order aggregate query q to a database I. (Even though aggregate functions take
bags as input, the final result of a query is always a set because it has one tuple for each
“group”).
Chapter 4. Rewritings for Queries with Grouping and Aggregation 50
If the aggregate query is just a first-order formula (Case 1 above), its semantics
corresponds to the semantics of first-order queries. If the query is of the form of Case
2 above, the aggregate query q is evaluated as follows. First, we retrieve “groups” that
satisfy q∗ (i.e., all the satisfying assignments for the grouping variables ~z). Second, for
each group ~a (i.e., for each instantiation of the grouping variables ~z), we obtain the bag
of tuples σ~a that satisfy q∗ and whose projection on ~z is ~a (the tuples of σ~a are on both
the grouping variables ~z and the other free variables ~w of q∗). Third, for each group ~a
and aggregation function Fi, we create a bag Bi,~a by taking each tuple (~c,~a) of σ~a and
projecting on the aggregation variables ~vi. Finally, we apply every aggregation function
Fi to the corresponding bag Bi,~a.
More formally, for every database instance I, tuple ~a and real numbers b1, . . . , bm, we
say that (~a, b1, . . . , bm) ∈ q(I) if there is a set σ~a such that:
• I |= ∃~w.q∗(~w,~a), and
• σ~a = {(~c,~a) : (~c,~a) ∈ q∗(I)}, and
• for every i such that 1 ≤ i ≤ m, bi = Fi(Bi,~a), where Bi,~a is the bag obtained by
taking each tuple (~c,~a) of σ~a and projecting on the aggregation variables ~vi.
We now define the language of conjunctive aggregate queries as a subset of first-order
aggregate queries. A conjunctive aggregate query is a formula of the form
select ~z, F1(~v1), . . . , Fm(~vm)
from q∗(~w, ~z)
group by ~z
where q∗(~w, ~z) is a conjunctive query, ~v1, . . . , ~vm are vectors of variables from ~w, and
F1, . . . , Fm are aggregation functions of the arities of ~v1, . . . , ~vm. We will say that ~z are
the grouping variables, and ~v1, . . . , ~vm are the aggregation variables. The semantics is the
same as for first-order conjunctive queries.
As with first-order aggregate queries, the language of conjunctive aggregate queries is
influenced by previous proposals. In particular, it corresponds closely to the language pre-
sented by Cohen, Nutt and Serebrenik [CNS99], except that we use a “SQL-like” syntax
instead of a Datalog syntax. It is also related to the language of real conjunctive queries
(conjunctive queries with bag semantics) introduced by Chaudhuri and Vardi [CV93],
Chapter 4. Rewritings for Queries with Grouping and Aggregation 51
and the class of conjunctive queries with label systems representing multisets presented
by Ioannidis and Ramakrishnan [IR95]. In the latter two cases, tuples are returned to-
gether with their multiplicity. This can be obtained in our conjunctive aggregate queries
by using the aggregation function count(∗).
4.2 Algorithms
In this section, we present query rewriting algorithms under the aggconsistentΣ se-
mantics for a class of queries that extends the class Cforest of the previous chapter with
operators for grouping and aggregation. In Section 4.2.1, we present the rewriting algo-
rithm for queries with bag semantics (i.e., the count(*) operator), and in Section 4.2.2
we present the algorithm for queries with the unary aggregation functions sum, min, and
max.
4.2.1 Queries with Bag Semantics
In this subsection, we give a query rewriting algorithm for conjunctive queries with bag
semantics (i.e., the count(*) operator). We start with an example, and then give the
general algorithm. The example illustrates how we can build upon the results for query
rewriting conjunctive queries under set-theoretic semantics of the previous chapter.
Example 4.1. Let R be a schema with one relation symbol employee. Assume that r
has two attributes: emplKey (the name of the employee) and salary. Let Σ be a set that
consists of only one constraint stating that emplKey is the key of relation employee.
Consider the following query q1, which counts the number of occurrences of each
salary (it corresponds to query q3 of Example 2.1).
q1(s, v): select s, count(*) as v
from employee(e, s)
group by s
Let I be a database instance such that I = {employee(John, 1000), employee(John, 2000),
employee(Mary, 1000), employee(Ali, 1000)}. There are two repairs of I with respect to
Σ: I1 = {employee(John, 1000), employee(Mary, 1000), employee(Ali, 1000)} and I2 =
{employee(John, 2000),employee(Mary, 1000), employee(Ali, 1000)}. Furthermore, q1(I1) =
{(1000, 3)} and q1(I2) = {(1000, 2), (2000, 1)}. By Definition 2.4, aggconsistentΣ(q1, I) =
Chapter 4. Rewritings for Queries with Grouping and Aggregation 52
{(1000, 2, 3)}. That is, the salary 1000 is an answer that appears at least twice and at
most three times in the result of applying q1 on the repairs.
Let us focus on obtaining the greatest lower bound for q1. From the previous chapter,
we know how to obtain consistent answers for conjunctive queries without aggregation
under set-theoretic semantics. We would like to reuse such results here. An obvious
strategy (shown to be incorrect shortly) is to first remove grouping and aggregation
from q1, obtain the consistent answers under set-theoretic semantics, and finally apply
grouping and aggregation to the intermediate result. That is, first compute the consistent
answers for the following query q′1(s):
select s
from employee(e, s)
We can express q′1 in conjunctive query notation as follows: q′1(s) = ∃e. employee(e, s).
Let QConsistent′(s) be the first-order query obtained by applying RewriteForest(q′1, Σ),
the algorithm introduced in the previous chapter. Suppose that now apply the operator
count(*) to the the result of QConsistent′(s) as follows:
select s, count(*)
from QConsistent′(s)
group by s
It is easy to see that this strategy leads to a wrong result. Since the result of the
consistent answers to q′1 (consistentΣ(q′1, I)) is {(1000)}, we would incorrectly conclude
that the greatest lower bound for 1000 is one, when in fact it is two. Clearly, the cause
for the incorrect result is that cardinalities are lost in the set-theoretic consistent answers
that we computed as an intermediate step. But, is there any way of obtaining the correct
bounds for the aggregate query, and yet be able to reuse the notion of set-theoretic
consistent answers as an intermediate step? The answer is positive: we can use a “root
key value at a time” principle. In this case, this corresponds to making the variable e
(for employee name) free because it is at the key position of employee(e, s), the literal
at the root (and only node) of q′1. We will obtain the consistent answer one employee
at a time in the intermediate result, and then project out the employees (since they
are not retrieved by q1). The intermediate result will be guaranteed to have the correct
cardinalities despite the fact that it is obtained using set semantics. The intuitive reason
Chapter 4. Rewritings for Queries with Grouping and Aggregation 53
is that repairs are sets of tuples that satisfy the key constraints, and hence every employee
name appears exactly once in each repair.
Following the previous discussion, let q′′1 be the query q′1, where the variable e is made
free. That is, let q′′1(e, s) = employee(e, s). The set-theoretic consistent answers for q′′1 are
consistentΣ(q′′1 , I) = {(Mary, 1000), (Ali, 1000)}. We can now project out the employee
names and count the number of occurrences of salary 1000, arriving at the correct lower
bound for count(*) in q1.
Let us now turn our attention to the computation of the lowest upper bound of q1.
Since aggconsistentΣ(q1, I) = (1000, 2, 3), the salary 1000 is an answer that appears
at most three times in the results of applying q1 to the repairs. We can use q′′1(e, s) =
employee(e, s) to obtain the lowest upper bound of salary 1000 as follows:
select s, count(*) as lub
from q′′1(e, s)
group by s
However, this query also retrieves the tuple (2000, 1) which should not be in the result
of aggconsistentΣ(q1, I) because the salary 2000 does not appear in q1(I1). This means
that we must make sure that the values for the grouping variables are in the consistent
answers for q′′1 . We can do this by employing the first-order rewriting QConsistent(e, s)
of query q′′1 , which can be obtained by invoking the algorithm RewriteForest. Now, we
can rule out 2000 from the final result because there is no tuple for salary 2000 in the
result of QConsistent(e, s). This can be achieved with the following query:
select s, count(*) as lub
from employee(e, s) ∧ ∃e′.QConsistent(e′, s)group by s
Query Rewriting Algorithm
In Figure 4.1, we give the rewriting algorithm for aggregate conjunctive queries with
the count(∗) aggregation function. The algorithm works for queries q of the form
select ~z, count(*)
from q∗(~z)
group by ~z
Chapter 4. Rewritings for Queries with Grouping and Aggregation 54
where q∗ is a conjunctive query in Cforest. The reason for requiring q∗ to be in Cforest is
that, as we motivated in the previous example, we would like to build upon the results for
first-order rewriting of conjunctive queries under set-theoretic semantics. In the previous
chapter, we showed how to obtain such rewritings for the conjunctive queries in class
Cforest.
By definition, the join graph of all queries in Cforest is a forest. We can then instantiate
the values for the key attributes at each root literal of the join graph of q∗, using the
“root key value at a time” strategy that we illustrated in the previous example. More
precisely, let G be the join graph of q∗. We will construct a conjunctive query q′ that
has the same literals as q∗, but all the variables that are at the key of some root of G are
free in q′.
Following the algorithm, let R1(~x1, ~y1), . . . , Rm(~xm, ~ym) be the literals at the roots
of all trees in G. Let ~x =⋃
i=1...m ~xi, let ~z′ = ~z − ~x. Let φ(~w, ~z) be the conjunction
of literals of q∗, and let ~w′ = ~w − ~x. We define q′ as q′(~x, ~z′) = ∃ ~w′.φ(~x, ~w′, ~z′). The
advantage of query q′ is that since the variables at the key of all root literal are free,
each tuple appears exactly once in the answer to q′ in the repairs (we will show this
formally in Lemma 4.4). Thus, set and bag-set semantics coincide in the answer to q′.
We can exploit this fact by computing the set-theoretic consistent answers for q′ as an
intermediate result towards producing the consistent answers to the aggregate query q.
The first-order query rewriting QConsistent for q′ is obtained by invoking the algorithm
RewriteForest given in Figure 3.2 of Chapter 3.
The greatest lower bound is computed with the following query, which counts the
number of occurrences of tuples for ~z (the grouping variables) in the consistent answer
to q′.
QGlb(~z, low) = select ~z, count(*)
from QConsistent(~x, ~z′)
group by ~z
Notice that the free variables of QConsistent, ~x and ~z′, contain the variables of ~z, but
may have additional variables. In the final result, we are projecting out these additional
variables, since they are not in the select clause of the query q.
The lowest upper bound is obtained by counting the number of tuples that satisfy
q′(~x, ~z′) and checking that some instantiation of the grouping variables of ~z appear in the
Chapter 4. Rewritings for Queries with Grouping and Aggregation 55
RewriteCount(q, Σ)
Input: A query q of the form
select ~z, count(*)
from q∗(~z)
group by ~z
where q∗ is a conjunctive query in Cforest
Σ, a set of key constraints (one per relation)
Output: Q, an aggregate first-order query that computes aggconsistentΣ(q, I)
for every database I
Let G be the join graph of q
Let R1(~x1, ~y1), . . . , Rm(~xm, ~ym) be the literals at the roots of all trees of G
Let ~x =⋃
i=1...m ~xi
Let ~z′ = ~z − ~x
Let φ(~w, ~z) be the conjunction of literals of q∗
Let ~w′ = ~w − ~x
Let q′(~x, ~z′) = ∃ ~w′.φ(~x, ~w′, ~z′)
Let QConsistent(~x, ~z′) be the query obtained by invoking RewriteForest(q′, Σ)
Let QGlb(~z, low) = select ~z, count(*)
from QConsistent(~x, ~z′)
group by ~z
Let ~x′ = ~x− ~z
Let QLub(~z, up) = select ~z, count(*)
from q′(~x, ~z′) ∧ (∃~x′.QConsistent(~x, ~z′))
group by ~z
Let Q(~z, low, up) = QGlb(~z, low) ∧ QLub(~z, up)
return Q
Figure 4.1: Query rewriting algorithm for queries with count(*).
Chapter 4. Rewritings for Queries with Grouping and Aggregation 56
consistent answers of q′. This is obtained with the query ∃~x′.QConsistent(~x, ~z′), where
~x′ are the variables of ~x that are not free variables of q.
QLub(~z, up) = select ~z, count(*)
from q′(~x, ~z′) ∧ (∃~x′.QConsistent(~x, ~z′))
group by ~z
4.2.2 Queries with the sum, min, and max Functions
In Figure 4.2, we present the query rewriting algorithm for queries with the sum, min,
and max aggregation functions. The main difference with the rewritings produced by
RewriteCount is that aggregation is performed here in two levels. At the inner level of
the rewriting, we aggregate the values for u (the value that is aggregated in the original
query), and we group by the key-root attributes (vector ~x in the figure). We then project
out the key-root attributes that are not in the select clause of the input query, and
apply the aggregation function of the input query.
For example, the greatest lower bound of the max function is computed as follows:
QGlb(~z, low) =
select ~z, max(bottom)
from
select ~x, ~z′, min(u) as bottom
from QConsistent(~x, ~z′) ∧ q′′(~x, ~z′, u)
group by ~x, ~z′
group by ~z
Notice that, as in RewriteCount, the lower bound is obtained by selecting tuples from
QConsistent(~x, ~z′). In addition, we now have a conjunct q′′(~x, ~z′, u), which retrieves the
values for the aggregate attribute u. The inner level of aggregation consists in this case
of the computation of the bottom attribute, as the minimum for the values retrieved for
u. The outer level applies the max function (i.e., the function of the original query) to
the values of the bottom attribute.
Chapter 4. Rewritings for Queries with Grouping and Aggregation 57
RewriteAgg(q, Σ)
Input: A query q of the form
select ~z, [max(u)|min(u)|sum(u)]from q∗(~z, u)
group by ~z
where q∗ is a conjunctive query in Cforest
Σ, a set of key constraints (one per relation)
Output: Q, an aggregate first-order query that computes aggconsistentΣ(q, I)
for every database I
Let G be the join graph of q
Let R1(~x1, ~y1), . . . , Rm(~xm, ~ym) be the literals at the roots of all trees of G
Let ~x =⋃
i=1...m ~xi
Let ~z′ = ~z − ~x
Let φ(~w, ~z, u) be the conjunction of literals of q∗
Let ~w′ = ~w − ~x
Let q′(~x, ~z′) = ∃ ~w′, u.φ(~x, ~w′, ~z′, u)
Let QConsistent(~x, ~z′) be the query obtained by invoking RewriteForest(q′,Σ)
Let q′′(~x, ~z′, u) = ∃ ~w′.φ(~x, ~w′, ~z′, u)
Let ~x′ = ~x− ~z − u
if the aggregate function is max then
QGlb(~z, low) =
select ~z, max(bottom)
from
select ~x, ~z′, min(u) as bottom
from QConsistent(~x, ~z′) ∧ q′′(~x, ~z′, u)
group by ~x, ~z′
group by ~z
QLub(~z, up) =
select ~z, max(top)
from
select ~x, ~z′, max(u) as top
from q′′(~x, ~z′, u) ∧ (∃~x′.QConsistent(~x, ~z′))
group by ~x, ~z′
group by ~z
endif
continues on next page...
Figure 4.2: Query rewriting algorithm for queries with aggregation
Chapter 4. Rewritings for Queries with Grouping and Aggregation 58
continued from previous page...
if the aggregate function is sum then
QGlb(~z, low) =
select ~z, sum(bottom)
from
select ~x, ~z′, min(u) as bottom
from QConsistent(~x, ~z′) ∧ q′′(~x, ~z′, u)
group by ~x, ~z′
having bottom ≥ 0
∨select ~x, ~z′, min(u) as bottom
from q′′(~x, ~z′, u) ∧ (∃~x′.QConsistent(~x, ~z′))
group by ~x, ~z′
having bottom < 0
group by ~z
QLub(~z, up) =
select ~z, sum(top)
from
select ~x, ~z′, max(u) as top
from q′′(~x, ~z′, u) ∧ (∃~x′.QConsistent(~x, ~z′))
group by ~x, ~z′
having top > 0
∨select ~x, ~z′, max(u) as top
from QConsistent(~x, ~z′) ∧ q′′(~x, ~z′, u)
group by ~x, ~z′
having top ≤ 0
group by ~z
endif
continues on next page...
Figure 4.2: Query rewriting algorithm for queries with aggregation
Chapter 4. Rewritings for Queries with Grouping and Aggregation 59
continued from previous page...
if the aggregate function is min then
QGlb(~z, low) =
select ~z, min(bottom)
from
select ~x, ~z′, min(u) as bottom
from q′′(~x, ~z′, u) ∧ (∃~x′.QConsistent(~x, ~z′))
group by ~x, ~z′
group by ~z′
QLub(~z, up) =
select ~z, min(top)
from
select ~x, ~z, max(u) as top
from QConsistent(~x, ~z′) ∧ q′′(~x, ~z′, u)
group by ~x, ~z′
group by ~z
endif
Let Q(~z, low, up) = QGlb(~z, low) ∧ QLub(~z, up)
return Q
Figure 4.2: Query rewriting algorithm for queries with aggregation
Chapter 4. Rewritings for Queries with Grouping and Aggregation 60
4.3 Correctness of the Algorithms
In this section, we prove the correctness of the query rewriting algorithms of this chapter.
We consider the following class of queries, which we call Caggforest.
Definition 4.1. Let q be an aggregate conjunctive query. We say that q is in class
Caggforest if q is of the form
select ~z, [count(*)| F(u)]
from q∗(~z, u)
group by ~z
where q∗ is a conjunctive query in Cforest, and F is one of the aggregation functions
min, max or sum.
The main result of this section is the following theorem:
Theorem 4.2. Let R be a schema. Let Σ be a set of integrity constraints, consisting of
one key dependency per relation of R. Let q(~z, v) be a query in Caggforest. Let Q(~z, l, u)
be the first-order aggregate query returned by RewriteCount(q, Σ) or RewriteAgg(q, Σ)
(depending on the aggregate function of the query).
Let I be an instance over R. If q has the aggregate function sum, assume that the
aggregated attribute ranges over positive numbers on I.
Then, for every tuple ~t, and pair of real numbers low and up, we have that (~t, low, up) ∈aggconsistentΣ(q, I) iff (~t, low, up) ∈ Q(I).
Notice that for the sum operator we have an additional requirement: the aggregated
variable must take only positive numbers. The rewriting for sum, however, does produce
sound bounds for arbitrary numbers (positive or negative), as we prove in Section 4.3.3.
The algorithms use the first-order query rewritings of the previous chapter as a build-
ing block. The semantics of those rewritings is set-theoretic, whereas the aggregate
functions we consider in this chapter take bags as input. In Section 4.3.1, we show that
for a subclass of the conjunctive queries in Cforest, the cardinality of the query results on
every repair is exactly one. Thus, for this subclass, it is not necessary to keep track of
tuple multiplicities in the intermediate results. Recall that in Chapter 3, we showed that
for every query q in a subclass of Cforest, there is a “pessimistic” repairM such that q(M)
Chapter 4. Rewritings for Queries with Grouping and Aggregation 61
retrieves all the consistent answers to q. We will use the notion of pessimistic repair to
prove that the bounds produced by the rewritings are tight. We will also need the dual
notion of an “optimistic” repair, which we introduce in Section 4.3.2. In Section 4.3.3, we
show that the ranges produced by the query rewritings are sound, in the sense that the
value of the aggregation function falls within the range on every repair. In Section 4.3.4,
we show that the ranges produced by the query rewritings are tight, in the sense that
they are satisfied in at least one repair. Finally, in Section 4.3.5 we put it all together,
and give the proof of correctness of the rewritings.
4.3.1 Building Upon First-Order Rewritings
The semantics of first-order rewritings is set-theoretic, whereas aggregate functions take
bags as input. In this subsection, we show that for a class of conjunctive queries that
is relevant in the query rewriting algorithms, the cardinality of the tuples in the result
of applying a query to the repairs is always one. As a consequence, for such queries, it
suffices to obtain a set-theoretic first-order rewriting. The result of applying the first-
order rewriting to the inconsistent database can be used as an intermediate step towards
obtaining the consistent answers for conjunctive queries with aggregation.
The queries with the aforementioned property are the conjunctive queries in Cforest,
where all the variables at key positions of some root of the join graph are free. The
proof is given in Lemma 4.4. The lemma makes use of an auxiliary result, that we give
next, which focuses on queries in Cforest that satisfy the additional condition that the
join graph must be a tree (instead of a forest). Intuitively, we show that in each repair
I, each tuple ~t in the query result is obtained “due to” the same set of tuples in I. More
formally, we show that if S and S ′ are sets that contain exactly one tuple per relation of
I and such that ~t ∈ q(S) and ~t ∈ q(S ′), then S ′ = S.
Lemma 4.3. Let q(~z) be a query in Cforest. Assume that the join graph T of q is a
tree, and that all the variables at key positions of the literal at the root of T are free in q
(that is, there is a literal R(~x, ~y) at the root of T such that ~x ⊆ ~z). Let I be a database
instance over the schema of q, and Σ be a set consisting of at most one key dependency
per relation of q. Let I be a repair of I wrt Σ. Let S and S ′ be sets that contain exactly
one tuple per relation of I and such that ~t ∈ q(S), and ~t ∈ q(S ′). Then, S ′ = S.
Proof. The proof is by induction on the number of literals of q.
Chapter 4. Rewritings for Queries with Grouping and Aggregation 62
Base case. Assume that q has exactly one literal. Assume towards a contradiction
that S 6= S ′. Then, there are distinct tuples ~t0 and ~t′0 in I such that ~t ∈ q({~t0}) and
~t ∈ q({~t′0}). Let R(~x, ~y) be the only literal of q. Since all the variables at key positions of
the root literal of T are free, and ~z are the free variables of q, we have that ~x ⊆ ~z. Thus,
there are vectors of values ~c, ~d and ~d′ such that ~d 6= ~d′, ~t0 = R(~c, ~d), and ~t′0 = R(~c, ~d′).
Thus, I 6|= Σ. But I is a repair of I wrt Σ; contradiction.
Inductive step. Assume that q has more than one literal. Let R be a literal of q
that appears at a leaf of T (recall that T is a tree). Let ~t0 and ~t′0 be tuples of S and S ′,
respectively, such that ~t0 = R(~c, ~d) and ~t′0 = R(~c′, ~d′).
Let M be a set that consists of all the tuples of S, except the one for literal R.
Let M ′ be a set that consists of all the tuples of S ′, except the one for literal R. By
inductive hypothesis, M = M ′. Notice that M and M ′ are the only subsets of S and S ′,
respectively, that satisfy these conditions since S and S ′ contain exactly one tuple per
relation of I.
Let R′(~x′, ~y′) be the parent of R in T . Then, there is a tuple ~t1 in R′ and valuations
ν and ν ′ such that ~t1 ∈ S, ~t1 ∈ S ′, {~t0,~t1} |= R′(~x′, ~y′) ∧ R(~x, ~y)[~z/~t][ν], and {~t′0,~t1} |=R′(~x′, ~y′) ∧ R(~x, ~y)[~z/~t][ν ′]. Notice that ν(~y′) = ν ′(~y′). Since q ∈ Cforest, there is a full
nonkey-to-key join from R′ to R. Thus, all the variables of ~y′ appear in ~x. Therefore,
ν(~x) = ν ′(~x); and ~c = ~c′. Assume towards a contradiction that ~t0 6= ~t′0. Then, there are
tuples R(~c, ~d) and R(~c′, ~d′) in I such that ~c = ~c′ and ~d 6= ~d′. This means that I 6|= Σ.
But I is a repair of I wrt Σ; contradiction.
In the next lemma, we show that for queries in Cforest such that the variables at key
positions of all root literals are free, the cardinality of each tuple in the query result is
exactly one.
Lemma 4.4. Let R be a schema. Let Σ be a set of integrity constraints, consisting of
one key dependency per relation of R. Let q(~z) be a conjunctive query over R such that
q ∈ Cforest. Let G be the join graph of q. Let R1(~x1, ~y1), . . . , Rm(~xm, ~ym) be the literals
at the root of each connected component (tree) of G. Assume that ~x1, . . . , ~xm are free
variables in q (i.e., they occur in ~z).
Let I be an instance over R. Let I be a repair of I wrt Σ. Let B be a bag such that
B = q(I) under bag semantics. Let ~t be such that ~t ∈ q(I). Then, |~t|B = 1.
Proof. Assume towards a contradiction that |~t|B > 1. Then, there are distinct sets S and
S ′ that contain exactly one tuple per literal of q and such that ~t ∈ q(S), and ~t ∈ q(S ′).
Chapter 4. Rewritings for Queries with Grouping and Aggregation 63
Since q ∈ Cforest, G is a forest. For each 1 ≤ i ≤ m, let Ti be the tree whose root is Ri.
Let φi(~w, ~z) be the conjunction of the literals of Ti. Let qi(~z) = ∃~w.φi(~w, ~z). Recall that
~xi (the variables at the key of the root literal of Ti) are free, and therefore occur in ~z.
Thus, qi satisfies the conditions of Lemma 4.3.
Since S 6= S ′, ~t ∈ q(S), and ~t ∈ q(S ′), there must be some i and some sets M and M ′
such that M 6= M ′, M ⊆ S, M ′ ⊆ S ′, M and M ′ have one tuple for each relation symbol
in φi, ~t ∈ qi(M), and ~t ∈ qi(M′). But this contradicts Lemma 4.3 above.
4.3.2 An “Optimistic” Repair
Recall that in Chapter 3 we showed that for every query q in a subclass of Cforest, there is
a “pessimistic” repair M such q(M) retrieves all the consistent answer to q. In Section
4.3.4, we will use M to prove the tightness of the query rewritings. For example, if
we apply an aggregate query on M, the value that we get for the count(*) aggregate
function corresponds to the greatest lower bound computed by the rewriting produced
by RewriteCount(q, Σ).
For the lowest upper bound, we will need the notion of an “optimistic” repair N . The
name “optimistic” comes from the fact that in this repair, if a tuple ~t can be obtained
from some repair of the inconsistent database, then the tuple is also in q(N ). In Lemma
4.6, we show the existence of such a repair.
Before proving the existence of the optimistic repair, we formally define the notion
of possible answers. This notion can be considered as dual to the notion of consistent
answers. While a consistent answer is one that holds in the query results obtained from
all the repairs, a possible answer is one that holds in the query result from at least one
repair.
Definition 4.5 (Possible Answers). Let R be a schema. Let Σ be a set of integrity
constraints. Let I be an instance over R (possibly inconsistent with respect to Σ). Let
q be a query over R. We say that a tuple ~t is a possible answer for q with respect to Σ
if there exists a repair I of I with respect to Σ such that ~t ∈ q(I). We denote this as
~t ∈ possibleΣ(q, I).
For a Boolean query q over R, we say that possibleΣ(q, I) = true if there exists a
repair I of I with respect to Σ such that I |= q. We say that possibleΣ(q, I) = false if
for every repair I of I with respect to Σ, I 6|= q.
Chapter 4. Rewritings for Queries with Grouping and Aggregation 64
Lemma 4.6. Let q(~x) be a query in Cforest, whose join graph T is a tree and where
R(~x, ~y) is the literal at the root of T . Let I be an instance. Then, there is a repair Nsuch that for all ~c if ~c ∈ possibleΣ(q, I), then ~c ∈ q(N ).
Proof. Let N be the instance instance built by BuildOptimisticRepair(q, I) (the al-
gorithm given in Figure 4.3). We will prove the claim by induction on the number of
literals of q.
Base case. Assume that q consists of exactly one literal R(~x, ~y). Let ~t be the
tuple selected by the algorithm in the iteration for literal R and the vector of values ~c.
Assume towards a contradiction that N 6|= ∃~w.R(~c, ~y). Then, {~t} 6|= ∃~w.R(~c, ~y). Since
possibleΣ(∃~w.R(~c, ~y), I) = true, there is some repair I of I such that I |= ∃~w.R(~c, ~y).
Thus, there is a tuple ~t′ such that {~t′} |= ∃~w.R(~c, ~y). Notice that ~t and ~t′ can be added
to N only during the iteration for the vector of values ~c. Since {~t} 6|= ∃~w.R(~c, ~y) and
{~t′} |= ∃~w.R(~c, ~y), the algorithm never selects tuple ~t. But ~t ∈ N ; contradiction.
Inductive step. Assume that q has more than one literal. Let φ(~w, ~x) be the
conjunction of literals of q. Let T1, . . . , Tm be the subtrees of T such that the root of
Tj is a child of the root of T , for 1 ≤ j ≤ m. For each 1 ≤ j ≤ m, let Sj(~xj, ~yj)
be the literal at the root of Tj. Let φj be the conjunction of the literals of Tj. Let
~wj = {w : w is a variable of φj, and w 6∈ ~xj}. Let qj(~xj) = ∃~wj.φj(~xj, ~wj). Let
Nj = BuildOptimisticRepair(qj, I).
Assume towards a contradiction that ~c 6∈ q(N ). Let ~t be the tuple of I selected by the
algorithm in the iteration for literal R and the vector of values ~c. Then, ~t ∈ N , and there
is some ~d such that ~t = R(~c, ~d). Since ~c 6∈ q(N ), there must be some j, some valuation
ν for the variables of ~y, and some ~cj such that 1 ≤ j ≤ m, ν(~y) = ~d, ν(~xj) = ~cj, and
~cj 6∈ qj(Nj).
Since possibleΣ(q(~c), I) = true, there is some repair I of I such that ~c ∈ q(I).
Thus, there is some tuple ~t′ in I, some ~d′, and some valuation ν for the variables of ~y
such that ~t′ = R(~c, ~d′), ν(~y) = ~d′, and the following condition holds: for every j and
tuple of values ~c′j such that 1 ≤ j ≤ m and ν(~xj) = ~c′j, we have that ~c′j ∈ qj(I). Thus,
possibleΣ(qj(~cj), I) = true. By inductive hypothesis ~c′j ∈ qj(Nj). Thus, the algorithm
selects ~t′ in the construction of N , rather than ~t. But ~t ∈ N ; contradiction.
Chapter 4. Rewritings for Queries with Grouping and Aggregation 65
Algorithm BuildOptimisticRepair
Input: q(~x), a query in Cforest of the form ∃~w.φ(~w, ~x),whose join graph T is a tree with root literal R(~x, ~y)
Σ, a set of key constraints, one per relationI, a database instance
Initialize N as an empty instance
if φ has exactly one literal thenfor each ~c such that there is some R(~c, ~d) in I do
if there is some ~d such that R(~c, ~d) ∈ I,
and {R(~c, ~d)} |= ∃~w.R(~c, ~y) then
Let ~t = R(~c, ~d)else
Let ~t be any tuple of I such that ~t = R(~c, ~d), for some ~dend ifAdd ~t to N
end forelse
/* φ has more than one literal*/Let S1, . . . , Sm be the children of R in Tfor j := 1 to m do
Let Tj be the subtree of T whose root is Sj
Let φj be the conjunction of literals of Tj
Let ~wj = {w : w is a variable that occurs in φj and ~w, and w 6∈ ~xj}Let qj(~xj) = ∃~wj.φj(~xj, ~wj)Let Nj = BuildOptimisticRepair(qj, I)Add Nj to N
end forfor each ~c such that there is some R(~c, ~d) in I do
if there is some ~d and some valuation ν for the variables of ~y such that R(~c, ~d) ∈ I,
ν(~y) = ~d, and there is no j and ~cj such that ν(~xj) = ~cj and ~cj 6∈ qj(Nj) then
Let ~t = R(~c, ~d)else
Let ~t be any tuple of I such that ~t = R(~c, ~d), for some ~dend ifAdd ~t to N
end forend if
Figure 4.3: Algorithm to build the “optimistic” repair
Chapter 4. Rewritings for Queries with Grouping and Aggregation 66
4.3.3 Sound Ranges
In this subsection, we show that the ranges produced by the query rewritings are sound,
in the sense that the value of the aggregation function falls within the returned range on
every repair.
The next lemma shows that the rewritings produced by RewriteCount compute sound
ranges.
Lemma 4.7. Let R be a schema. Let Σ be a set of integrity constraints, consisting of
one key dependency per relation of R. Let q(~z, v) be a query of the following form:
select ~z, count(*)
from q∗(~z)
group by ~z
where q∗(~z) is a conjunctive query in Cforest.
Let Q be the first-order aggregate query returned by RewriteCount(q, Σ). Let I be a
database instance over R. Let I be a repair of I wrt Σ. Let ~t be a tuple, and low and up
be a pair of real numbers such that (~t, low, up) ∈ Q(I) and ~t ∈ consistentΣ(q∗, I). Let d
be such that (~t, d) ∈ q(I). Then, low ≤ d ≤ up.
Proof. Let G be the join graph of q. Let R1(~x1, ~y1), . . . , Rm(~xm, ~ym) be the literals at the
roots of all trees of G. Let φ(~w, ~z) be the conjunction of literals of q∗. Let ~x =⋃
i=1...m ~xi,
let ~z′ = ~z − ~x, and let ~w′ = ~w − ~x. Let q′(~x, ~z′) = ∃ ~w′.φ(~x, ~w′, ~z′). Let ~x′ = ~x − ~z. Let
QConsistent(~x, ~z′) be the query obtained by invoking RewriteForest(q′, Σ).
Lower Bound. Since (~t, low, up) ∈ Q(I), the lower bound low of ~t is computed with
the following query:
QGlb(~z, glb) = select ~z, count(*)
from QConsistent(~x, ~z′)
group by ~z
Assume towards a contradiction that d < low. Then, there is a tuple (~c, ~t′) such
that (~c, ~t′) ∈ QConsistent(I) and (~c, ~t′) 6∈ q′(I). Then, (~c, ~t′) 6∈ consistentΣ(q′, I). By
Theorem 3.5, we conclude that (~c, ~t′) 6∈ QConsistent(I); contradiction.
Upper Bound. Since (~t, low, up) ∈ Q(I), the upper bound up of ~t is computed with
the following query:
Chapter 4. Rewritings for Queries with Grouping and Aggregation 67
Let QLub(~z, lub) = select ~z, count(*)
from q′(~x, ~z′) ∧ (∃~x′.QConsistent(~x, ~z′))
group by ~z
Assume towards a contradiction that d > up. Then, there is a valuation ν and a tuple
(~c, ~t′) such that ν(~x) = ~c, ν(~z′) = ~t′, ν(~z) = ~t, (~c, ~t′) ∈ q′(I), and either (1) (~c, ~t′) 6∈ q′(I);
or (2) I 6|= (∃~x′.QConsistent(~x, ~z′)).
Assume that (1) (~c, ~t′) 6∈ q′(I). Since I is a repair of I, by Proposition 3.6, I ⊆ I.
Thus, (~c, ~t′) 6∈ q′(I); contradiction. Assume that (2) I 6|= (∃~x′.QConsistent(~x, ~z′)).
Recall that ~x′ = ~x − ~z. By Theorem 3.5, (~c′, ~t′) 6∈ consistentΣ(q′, I), for every ~c′. In
particular, (~c, ~t′) 6∈ consistentΣ(q′, I). Recall that there is a valuation ν for the variables
of ~x and ~z′ such that ν(~x) = ~c, ν(~z′) = ~t′ and ν(~z) = ~t. Thus, ~t 6∈ consistentΣ(q∗, I);
contradiction.
The next lemma shows that the rewritings for queries with the sum operator compute
sound ranges.
Lemma 4.8. Let R be a schema. Let Σ be a set of integrity constraints, consisting of
one key dependency per relation of R. Let q(~z, v) be a query of the following form:
select ~z, sum(u)
from q∗(~z, u)
group by ~z
where q∗(~z, u) is a conjunctive query in Cforest.
Let Q be the first-order aggregate query returned by RewriteAgg(q, Σ). Let I be a
database instance over R. Let I be a repair of I wrt Σ. Let ~t be a tuple, and low and up
be a pair of real numbers such that (~t, low, up) ∈ Q(I) and ~t ∈ consistentΣ(q∗, I). Let d
be such that (~t, d) ∈ q(I). Then, low ≤ d ≤ up.
Proof. Let G be the join graph of q. Let R1(~x1, ~y1), . . . , Rm(~xm, ~ym) be the literals at
the roots of all trees of G. Let φ(~w, ~z, u) be the conjunction of literals of q∗. Let
~x =⋃
i=1...m ~xi, let ~z′ = ~z − ~x, and let ~w′ = ~w − ~x. Let ~x′ = ~x − ~z − u. Let
q′(~x, ~z′) = ∃ ~w′, u.φ(~x, ~w′, ~z′, u). Let QConsistent(~x, ~z′) be the query obtained by in-
voking RewriteForest(q′, Σ). Let q′′ be the query q′′(~x, ~z′, u) = ∃ ~w′.φ(~x, ~w′, ~z′, u).
Lower Bound. Since (~t, low, up) ∈ Q(I), the lower bound low of ~t is computed with
the following query:
Chapter 4. Rewritings for Queries with Grouping and Aggregation 68
QGlb(~z, glb) = select ~z, sum(v)
from QContribConsistent(~x, ~z′, v) ∨ QContribNonConsistent(~x, ~z′, v)
group by ~z
where QContribConsistent is the following query:
QContribConsistent(~x, ~z′, bottom) =
select ~x, ~z′, min(u) as bottom
from QConsistent(~x, ~z′) ∧ q′′(~x, ~z′, u)
group by ~x, ~z′
having bottom ≥ 0
and QContribNonConsistent is the following query:
QContribNonConsistent(~x, ~z′, bottom) =
select ~x, ~z′, min(u) as bottom
from q′′(~x, ~z′, u) ∧ (∃~x′.QConsistent(~x, ~z′))
group by ~x, ~z′
having bottom < 0
Assume towards a contradiction that d < low. Since (~t, d) ∈ q(I), we must consider
the following cases.
First, assume that there is a valuation ν for the variables in ~z, ~x such that ν(~z) = ~t,
ν(~x) = ~c , ν(~z′) = ~t′, and
• (~c, ~t′) 6∈ q′(I); and
• there is some e such that e > 0; and
• either (~c, ~t′, e) ∈ QContribConsistent ∨ QContribNonConsistent(I).
Since e > 0, (~c, ~t′, e) ∈ QContribConsistent(I). Since (~c, ~t′) 6∈ q′(I), (~c, ~t′) 6∈consistentΣ(q′, I). By Theorem 3.5, we conclude that (~c, ~t′) 6∈ QConsistent(I). There-
fore, (~c, ~t′, e) 6∈ QContribConsistent(I); contradiction.
Second, assume that there is a valuation ν for the variables in ~z, ~x such that ν(~z) = ~t,
ν(~x) = ~c , ν(~z′) = ~t′, and
Chapter 4. Rewritings for Queries with Grouping and Aggregation 69
• there is some e′ such that and e′ < 0; and
• (~c, ~t′, e′) ∈ q′′(I); and
• for every e such that e < 0, we have that (~c, ~t′, e) 6∈ QContribConsistent ∨QContribNonConsistent(I).
Since I ⊆ I and (~c, ~t′, e′) ∈ q′′(I), we have that (~c, ~t′, e′) ∈ q′′(I). Since by hy-
pothesis, ~t ∈ consistentΣ(q∗, I), (~c′, ~t′) ∈ consistentΣ(q′, I) for some ~c′. By Theorem
3.5, (~c′, ~t′) ∈ QConsistent(I). Thus, I |= ∃~x′.QConsistent(~x, ~z′)[~z/~t]. Since e′ < 0,
(~c, ~t′, e′) ∈ q′′(I) and I |= ∃~x′.QConsistent(~x, ~z′)[~z/~t], we conclude that (~c, ~t′, e′) ∈QContribNonConsistent(I); contradiction.
Third, assume that there is a valuation ν for the variables in ~z, ~x such that ν(~z) = ~t,
ν(~x) = ~c , ν(~z′) = ~t′, and
• there is some e such that (~c, ~t′, e) ∈ QContribConsistent∨QContribNonConsistent(I);
and
• there is some e′ such that e′ < e; and
• (~c, ~t′, e′) ∈ q′′(I).
Assume that (~c, ~t′, e) ∈ QContribConsistent(I). Then, (~c, ~t′) ∈ QConsistent(I),
and (~c, ~t′, e) ∈ q′′(I). Since I ⊆ I, and (~c, ~t′, e′) ∈ q′′(I), we have that (~c, ~t′, e′) ∈q′′(I). Notice that e and e′ correspond to the attribute bottom of QContribConsistent.
This attribute is computed as min(u), that is the minimum of the values of u for the
tuples of (~c, ~t′). Since (~c, ~t′, e) and (~c, ~t′, e′) satisfy the conditions of the from clause of
QContribConsistent, e < e′; contradiction.
Now, assume that (~c, ~t′, e) ∈ QContribNonConsistent(I). Since I ⊆ I, (~c, ~t′, e′) ∈q′′(I). Since e corresponds to the attribute bottom of QContribNonConsistent, e < e′;
contradiction.
Upper Bound The proof for the lowest upper bound is analogous to the proof for
the greatest lower bound.
The next lemma shows that the rewritings for queries with the min and max aggrega-
tion functions compute sound ranges.
Lemma 4.9. Let R be a schema. Let Σ be a set of integrity constraints, consisting of
one key dependency per relation of R. Let q(~z, v) be a query of the following form:
Chapter 4. Rewritings for Queries with Grouping and Aggregation 70
select ~z, [min(u)| max(u)]from q∗(~z, u)
group by ~z
where q∗(~z, u) is a conjunctive query in Cforest.
Let Q be the first-order aggregate query returned by RewriteAgg(q, Σ). Let I be a
database instance over R. Let I be a repair of I wrt Σ. Let ~t be a tuple, and low and up
be a pair of real numbers such that (~t, low, up) ∈ Q(I) and ~t ∈ consistentΣ(q∗, I). Let d
be such that (~t, d) ∈ q(I). Then, low ≤ d ≤ up.
Proof. Let G be the join graph of q. Let R1(~x1, ~y1), . . . , Rm(~xm, ~ym) be the literals at
the roots of all trees of G. Let φ(~w, ~z, u) be the conjunction of literals of q∗. Let
~x =⋃
i=1...m ~xi, let ~z′ = ~z − ~x, and let ~w′ = ~w − ~x. Let ~x′ = ~x − ~z − u. Let
q′(~x, ~z′) = ∃ ~w′, u.φ(~x, ~w′, ~z′, u). Let QConsistent(~x, ~z′) be the query obtained by in-
voking RewriteForest(q′, Σ). Let q′′ be the query q′′(~x, ~z′, u) = ∃ ~w′.φ(~x, ~w′, ~z′, u).
Lower Bound. Suppose that the aggregate function of q is max. Since (~t, low, up) ∈Q(I), the lower bound low of ~t is computed with the following query:
QGlb(~z, glb) = select ~z, max(u)
from QContribConsistent(~x, ~z′, u)
group by ~z
where QContribConsistent is the following query:
QContribConsistent(~x, ~z′, bottom) =
select ~x, ~z′, min(u) as bottom
from QConsistent(~x, ~z′) ∧ q′′(~x, ~z′, u)
group by ~x, ~z′
Assume towards a contradiction that d < low. Then, there is a valuation ν for the
variables in ~z, ~x such that ν(~z) = ~t, ν(~x) = ~c , ν(~z′) = ~t′, and
• there is some e such that (~c, ~t′, e) ∈ QContribConsistent(I); and
• there is some e′ such that e′ < e; and
Chapter 4. Rewritings for Queries with Grouping and Aggregation 71
• (~c, ~t′, e′) ∈ q′′(I).
We arrive to a contradiction by the same arguments used in Lemma 4.8 for queries
with the sum operator.
Now, suppose that the aggregate function of q is min. Since (~t, low, up) ∈ Q(I), the
lower bound low of ~t is computed with the following query:
QGlb(~x, ~z, bottom) =
select ~z, min(bottom)
from QContribNonConsistent(~x, ~z′, u)
group by ~z
where QContribNonConsistent is the following query:
select ~x, ~z′, min(u) as bottom
from q′′(~x, ~z′, u) ∧ (∃~x′.QConsistent(~x, ~z′))
group by ~x, ~z′)
Assume towards a contradiction that d < low. Then, there is a valuation ν for the
variables in ~z, ~x such that ν(~z) = ~t, ν(~x) = ~c , ν(~z′) = ~t′, and
• there is some e such that (~c, ~t′, e) ∈ QContribNonConsistent(I); and
• there is some e′ such that e′ < e; and
• (~c, ~t′, e′) ∈ q′′(I).
We arrive to a contradiction by the same arguments used in Lemma 4.8 for queries
with the sum operator.
Upper Bound For the max operator, we can give an argument analogous to the
argument given for the lower bound of the min operator. For the min operator, we
can give an argument analogous to the argument given for the lower bound of the max
operator.
Chapter 4. Rewritings for Queries with Grouping and Aggregation 72
4.3.4 Tight Ranges
In this section, we show that the ranges produced by the query rewritings are tight. For
this, we must exhibit two repairs, where the result of the aggregation function corresponds
to the greatest lower bound in one repair, and to the lowest upper bound in the other. For
example, if the query has the count(*) operator, the repair that we need for the greatest
lower bound turns out to be the “pessimistic” repair M used in the correctness proof of
the first-order rewritings of Section 3.3.3. For the lowest upper bound, the needed repair
is the “optimistic” repair N that we introduced in Section 4.3.2.
We start by showing that the rewritings produced by RewriteCount give tight bounds.
In the next lemma, we show that the greatest lower bound of count(*) can be obtained
by executing the query on the pessimistic repair M. We also show that the query
rewriting that we obtain correctly returns such bound.
Lemma 4.10. Let R be a schema. Let Σ be a set of integrity constraints, consisting of
one key dependency per relation of R. Let q(~z) be a query of the following form:
select ~z, count(*)
from q∗(~z)
group by ~z
where q∗(~z) is a query in Cforest.
Let G the the join graph of q. Let R1(~x1, ~y1), . . . , Rm(~xm, ~ym) be the literals at the roots
of each tree of G. Let φ(~w, ~z) be the conjunction of literals of q∗. Let ~x =⋃
i=1...m ~xi,
let ~z′ = ~z − ~x, and let ~w′ = ~w − ~x. Let q′(~x, ~z′) = ∃ ~w′.φ(~x, ~w′, ~z′). Let Q(~z, l, u) be the
first-order aggregate query returned by RewriteCount(q, Σ). Let I be an instance over
R. Let ~t be a tuple and low and up be a pair of real numbers.
Then, there is a repair M of I wrt Σ and a bag B such that B = q(M), and the
following conditions hold:
1. for every valuation ν such that ν(~x) = ~c, ν(~z′) = ~t′, and ν(~z) = ~t, if (~c, ~t′) ∈ q′(M),
then ~c ∈ consistentΣ(q′[~z/~t], I), and
2. if (~t, low, up) ∈ aggconsistentΣ(q, I), then |~t|B = low, and
3. if (~t, low, up) ∈ Q(I), then |~t|B = low.
Chapter 4. Rewritings for Queries with Grouping and Aggregation 73
Proof. Let M be the pessimistic repair obtained by invoking the algorithm BuildPess-
imisticRepair(q, Σ, I). Condition (1) holds by Lemma 3.10. We must now prove Con-
ditions (2) and (3).
In order to prove Condition 2, let ~t be a tuple, and low, and up be a pair of real
numbers such that (~t, low, up) ∈ aggconsistentΣ(q, I). Then, there is a repair I of I
wrt Σ and a bag B′ such that B′ = q(I) and |~t|B′ = low. Furthermore, by Lemma
4.7, since M is a repair of I wrt Σ, |~t|B ≥ low. Assume towards a contradiction that
|~t|B > low. Then, there is a valuation ν for the variables of ~x and ~z such that ν(~x) = ~c,
ν(~z) = ~t and ν(~z′) = ~t′, and one of the following conditions holds:
• (~c, ~t′) ∈ q′(M) and |(~c, ~t′)|B > 1; or
• (~c, ~t′) ∈ q′[~z/~t](M) and (~c, ~t′) 6∈ q′[~z/~t](I).
Assume that (~c, ~t′) ∈ q′(M) and |(~c, ~t′)|B > 1. This contradicts Lemma 4.4. Now, as-
sume that (~c, ~t′) ∈ q′[~z/~t](M) and (~c, ~t′) 6∈ q′[~z/~t](I). Then, (~c, ~t′) 6∈ consistentΣ(q′[~z/~t], I).
By Condition 1, we have that (~c, ~t′) 6∈ q′[~z/~t](M); contradiction.
In order to prove Condition 3, let ~t, low, and up be such that (~t, low, up) ∈ Q(I).
Since M is a repair of I, by Lemma 4.7, |~t|B ≥ low. Let QConsistent(~x, ~z′) be the query
obtained by invoking RewriteForest(q′, Σ). Then, the lower bound low of ~t is computed
with the following query:
QGlb(~z, low) = select ~z, count(*)
from QConsistent(~x, ~z′)
group by ~z
Assume towards a contradiction that |~t|B > low. Then, there is a valuation ν for the
variables of ~x and ~z such that ν(~x) = ~c, ν(~z) = ~t and ν(~z′) = ~t′, and one of the following
conditions holds:
• (~c, ~t′) ∈ q′(M) and |(~c, ~t′)|B > 1; or
• (~c, ~t′) ∈ q′(M) and (~c, ~t′) 6∈ QConsistent(I).
Assume that (~c, ~t′) ∈ q′(M) and |(~c, ~t′)|B > 1. This contradicts Lemma 4.4. Now,
assume that (~c, ~t′) ∈ q′(M) and (~c, ~t′) 6∈ QConsistent(I). Since (~c, ~t′) 6∈ QConsistent(I),
by Theorem 3.5, (~c, ~t′) 6∈ consistentΣ(q′, I). Then, by Condition 1, we have that (~c, ~t′) 6∈q′(M); contradiction.
Chapter 4. Rewritings for Queries with Grouping and Aggregation 74
In the next lemma, we show that the lowest upper bound of count(*) can be obtained
by executing q on the optimistic repair N . We also show that the query rewriting of q
correctly returns such bound.
Lemma 4.11. Let R be a schema. Let Σ be a set of integrity constraints, consisting
of one key dependency per relation of R. Let q(~z) be a query in Cforest of the following
form:
select ~z, count(*)
from q∗(~z)
group by ~z
where q∗(~z) is a query in Cforest.
Let G the the join graph of q. Let R1(~x1, ~y1), . . . , Rm(~xm, ~ym) be the literals at the roots
of each tree of G. Let φ(~w, ~z) be the conjunction of literals of q∗. Let ~x =⋃
i=1...m ~xi,
let ~z′ = ~z − ~x, and let ~w′ = ~w − ~x. Let q′(~x, ~z′) = ∃ ~w′.φ(~x, ~w′, ~z′). Let Q(~z, l, u) be the
first-order aggregate query returned by RewriteCount(q, Σ). Let I be an instance over
R. Let ~t be a tuple and low and up be a pair of real numbers.
Then, there is a repair N of I wrt Σ and a bag B such that B = q(N ), and the
following conditions hold:
1. for every valuation ν such that ν(~x) = ~c and ν(~z) = ~t, if ~c ∈ possibleΣ(q′[~z/~t], I),
then ~c ∈ q′[~z/~t](N ), and
2. if (~t, low, up) ∈ aggconsistentΣ(q, I), then |~t|B = up, and
3. if (~t, low, up) ∈ Q(I), then |~t|B = up.
Proof. Let N be the optimistic repair obtained by invoking the algorithm BuildOpti-
misticRepair(q, Σ, I). Condition (1) holds by Lemma 4.6. We must now prove Condi-
tions (2) and (3).
In order to prove Condition 2, let ~t be a tuple, and low and up be real numbers such
that (~t, low, up) ∈ aggconsistentΣ(q, I). Then, there is a repair I of I wrt Σ and a bag
B′ such that B′ = q(I) and |~t|B′ = up. Furthermore, since N is a repair of I wrt Σ, by
Lemma 4.7, |~t|B ≤ up. Assume towards a contradiction that |~t|B < up. Then, there is a
valuation ν for the variables of ~x and ~z such that ν(~x) = ~c, ν(~z) = ~t and ν(~z′) = ~t′, and
one of the following conditions holds:
Chapter 4. Rewritings for Queries with Grouping and Aggregation 75
• (~c, ~t′) ∈ q′(I) and |(~c, ~t′)|B′ > 1; or
• (~c, ~t′) 6∈ q′(N ) and (~c, ~t′) ∈ q′(I).
Assume that (~c, ~t′) ∈ q′(I) and |(~c, ~t′)|B′ > 1. This contradicts Lemma 4.4. Now,
assume that (~c, ~t′) 6∈ q′(N ) and (~c, ~t′) ∈ q′(I). Then, ~c ∈ possibleΣ(q′[~z/~t], I). By
Condition 1, we have that ~c ∈ q′[~z/~t](N ); contradiction.
In order to prove Condition 3, let ~t, low, and up be such that (~t, low, up) ∈ Q(I).
Since N is a repair of I, by Lemma 4.7, |~t|B ≤ up. Let ~x′ = ~x−~z. Let QConsistent(~x, ~z′)
be the query obtained by invoking RewriteForest(q′, Σ). Since (~t, low, up) ∈ Q(I), the
upper bound up of ~t is computed with the following query:
Let QLub(~z, up) = select ~z, count(*)
from q′(~x, ~z′) ∧ (∃~x′.QConsistent(~x, ~z′))
group by ~z
Assume towards a contradiction that |~t|B < up. Then, there is a valuation ν for the
variables of ~x and ~z such that ν(~x) = ~c, ν(~z) = ~t and ν(~z′) = ~t′, and either:
• (~c, ~t′) is accounted for more than once in the from clause of QLub; or
• (~c, ~t′) 6∈ q′(N ), (~c, ~t′) ∈ q′(I), and I |= (∃~x′.QConsistent[~z/~t]).
Assume that (~c, ~t′) is accounted for more than once in the from clause of QLub. This
is a contradiction since by definition the from clause of a first-order aggregate query is
computed using set semantics. Now, assume that (~c, ~t′) 6∈ q′(N ), (~c, ~t′) ∈ q′(I), and
I |= (∃~x′.QConsistent[~z/~t]). Since (~c, ~t′) ∈ q′(I), we have that ~c ∈ possibleΣ(q′[~z/~t], I).
Thus, by Condition 1, ~c ∈ q′[~z/~t](N ); contradiction.
For the unary operators, the proof of tightness proceeds in an analogous way, except
that the optimistic and pessimistic repairs have to be modified to ensure every tuple has
the minimum (or maximum, depending on the case) for attribute u. We next show how
to obtain a pessimistic repair for queries with the sum operator.
Algorithm BuildPessimisticRepairForSum (q, I,M∗)
Input: A query q of the form
select ~z, sum(u)
Chapter 4. Rewritings for Queries with Grouping and Aggregation 76
from q∗(~z)
group by ~z
where q∗ is a conjunctive query in Cforest
I, an instance
M∗, an pessimistic repair
Output:M, an pessimistic repair
Initialize M as M∗
Let R(~x, ~y) be the literal of q where u appears
for each tuple R(~c, ~d) of M do
Let ν be a valuation for the variables of R such that ν(~x) = ~c and ν(~y) = ~d
for every valuation ν ′ for the variables of R such that ν ′(~x) = ~c′, ν ′(~y) = ~d′,
R(~c′, ~d′) ∈ I, and ν(z) = ν ′(z) for every z such that z 6= u do
if ν ′(u) < ν(u) then
Replace R(~c, ~d) with R(~c′, ~d′) in Mend if
end for
end for
Notice in the algorithm that a tuple R(~c, ~d) is replaced only if there is another tuple
with the same values, except for the attribute u, and the other tuple has a smaller value
on u (condition ν ′(u) < ν(u) in the algorithm). In the rewriting for the lower bound of
the sum operator, this corresponds to the fact that for positive values we aggregate over
the minimum value of u for all tuples in the intermediate result. In contrast, for the upper
bound, we aggregate over the maximum value of u. Thus, for the upper bound, a similar
algorithm can be used, where we replace tuples for which the condition ν ′(u) > ν(u)
is satisfied. Since we choose the conditions that correspond to positive numbers in the
rewriting given in RewriteAgg, the tightness results for the sum operator need to restrict
the domain of the aggregated value to range over positive numbers (for min and max we
do not have this restriction). In Figure 4.4, we summarize the repairs that must be
modified in order to obtain the tight bounds of each aggregation function, and which
condition must be checked.
The following lemma shows that the greatest lower bound computed for the sum
operator can be obtained from the pessimistic repair computed with the procedure given
above. We also show that our query rewriting correctly returns such bound.
Chapter 4. Rewritings for Queries with Grouping and Aggregation 77
Function Bound Repair Condition
max glb pessimistic ν ′(u) < ν(u)
max lub optimistic ν ′(u) > ν(u)
sum glb pessimistic ν ′(u) < ν(u)
sum lub optimistic ν ′(u) > ν(u)
min glb optimistic ν ′(u) < ν(u)
min lub pessimistic ν ′(u) > ν(u)
Figure 4.4: Repairs that must be used to obtain the tight bounds of unary operators
Lemma 4.12. Let R be a schema. Let Σ be a set of integrity constraints, consisting of
one key dependency per relation of R. Let q(~z) be a query of the following form:
select ~z, sum(u)
from q∗(~z, u)
group by ~z
where q∗(~z, u) is a conjunctive query in Cforest and u ranges over the positive numbers.
Let G the the join graph of q. Let R1(~x1, ~y1), . . . , Rm(~xm, ~ym) be the literals at the roots
of each tree of G. Let φ(~w, ~z, u) be the conjunction of literals of q∗. Let ~x =⋃
i=1...m ~xi,
let ~z′ = ~z − ~x, and let ~w′ = ~w − ~x. Let q′(~x, ~z′) = ∃ ~w′, u.φ(~x, ~w′, ~z′, u). Let Q(~z, l, u)
be the first-order aggregate query returned by RewriteAgg(q, Σ). Let I be an instance
over R. Let ~t be a tuple and low and up be a pair of real numbers. Let q′′(~x, ~z′, u) =
∃ ~w′.φ(~x, ~w′, ~z′, u).
Then, there is a repair M of I wrt Σ and some value d such that (~t, d) ∈ q(M), and
the following conditions hold:
1. for every valuation ν such that ν(~x) = ~c, ν(~z′) = ~t′, and ν(~z) = ~t, if (~c, ~t′) ∈ q′(M),
then ~c ∈ consistentΣ(q′[~z/~t], I), and
2. if (~t, low, up) ∈ aggconsistentΣ(q, I), then d = low, and
3. if (~t, low, up) ∈ Q(I), then d = low.
Proof. Let M∗ be the repair obtained by invoking the algorithm BuildPessimistic-
Repair(q, Σ, I). Let M be the repair obtained by invoking the algorithm BuildPess-
imisticRepairForSum(q, I,M∗). Condition (1) holds by Lemma 3.10. We must now
prove Conditions (2) and (3).
Chapter 4. Rewritings for Queries with Grouping and Aggregation 78
In order to prove Condition 2, let ~t be a tuple, and low, and up be a pair of real
numbers such that (~t, low, up) ∈ aggconsistentΣ(q, I). Then, there is a repair I of I
wrt Σ such that (~t, low) ∈ q(I). Furthermore, by Lemma 4.8, since M is a repair of I wrt
Σ, d ≥ low. Assume towards a contradiction that d > low. Let B = q′(M). Then, there
is a valuation ν for the variables of ~x and ~z such that ν(~x) = ~c, ν(~z) = ~t and ν(~z′) = ~t′,
and one of the following conditions holds:
• (~c, ~t′) ∈ q′(M) and |(~c, ~t′)|B > 1; or
• there are e and e′ such that e > e′, (~c, ~t′, e) ∈ q′(M) and (~c, ~t′, e′) ∈ q′(I); or
• (~c, ~t′) ∈ q′[~z/~t](M) and (~c, ~t′) 6∈ q′[~z/~t](I).
Assume that (~c, ~t′) ∈ q′(M) and |(~c, ~t′)|B > 1. This contradicts Lemma 4.4. Now,
assume that there are e and e′ such that e > e′, (~c, ~t′, e) ∈ q′(M) and (~c, ~t′, e′) ∈q′(I). Let ν ′ and ν ′′ be valuations such that for every w 6= u, ν(w) = ν ′(w) and
ν(w) = ν ′′(w); ν ′(w) = e; and ν ′′(w) = e′. Since M is constructed using the algo-
rithm BuildPessimisticRepairForSum and I ⊆ I, ν ′(w) < ν ′′(w). Thus, e < e′;
contradiction. Finally, assume that (~c, ~t′) ∈ q′[~z/~t](M) and (~c, ~t′) 6∈ q′[~z/~t](I). Then,
(~c, ~t′) 6∈ consistentΣ(q′[~z/~t], I). By Condition 1, we have that (~c, ~t′) 6∈ q′[~z/~t](M); con-
tradiction.
In order to prove Condition 3, let ~t, low, and up be such that (~t, low, up) ∈ Q(I).
Since M is a repair of I, by Lemma 4.8, d ≥ low. Let QConsistent(~x, ~z′) be the query
obtained by invoking RewriteForest(q′, Σ). Since u ranges only over positive numbers,
the lower bound low of ~t is computed with the following query:
QGlb(~z, glb) = select ~z, sum(v)
from QContribConsistent(~x, ~z′, v)
group by ~z
where QContribConsistent is the following query:
QContribConsistent(~x, ~z′, bottom) =
select ~x, ~z′, min(u) as bottom
from QConsistent(~x, ~z′) ∧ q′′(~x, ~z′, u)
group by ~x, ~z′
Chapter 4. Rewritings for Queries with Grouping and Aggregation 79
having bottom ≥ 0
Assume towards a contradiction that d > low. Then, there is a valuation ν for the
variables of ~x and ~z such that ν(~x) = ~c, ν(~z) = ~t and ν(~z′) = ~t′, and one of the following
conditions holds:
• (~c, ~t′) ∈ q′(M) and |(~c, ~t′)|B > 1; or
• there are e and e′ such that e > e′, (~c, ~t′, e) ∈ q′(M) and
(~c, ~t′, e′) ∈ QContribConsistent(I); or
• (~c, ~t′) ∈ q′(M) and (~c, ~t′) 6∈ QConsistent(I).
Assume that (~c, ~t′) ∈ q′(M) and |(~c, ~t′)|B > 1. This contradicts Lemma 4.4. Now,
assume that there are e and e′ such that e > e′, (~c, ~t′, e) ∈ q′(M) and (~c, ~t′, e′) ∈QContribConsistent(I). Since e′ is computed as min(u) in QContribConsistent,
and M ⊆ I, e′ < e; contradiction. Finally, assume that (~c, ~t′) ∈ q′(M) and (~c, ~t′) 6∈QConsistent(I). Since (~c, ~t′) 6∈ QConsistent(I), by Theorem 3.5, we have that (~c, ~t′) 6∈consistentΣ(q′, I). Then, by Condition 1, we have that (~c, ~t′) 6∈ q′(M); contradic-
tion.
Notice that the proof above is similar to the one for Lemma 4.10, except that we need
to account for the fact that each tuple may contribute a value greater than one. A proof
similar to Lemma 4.11 can be given for the lowest upper bound.
4.3.5 Putting It All Together
The next lemma states the correctness of the algorithm RewriteCount. The correctness
for the unary operators can be obtained analogously by employing the optimistic and
pessimistic repairs as shown in Figure 4.4.
Lemma 4.13. Let R be a schema. Let Σ be a set of integrity constraints, consisting
of one key dependency per relation of R. Let q(~z) be a query in Cforest of the following
form:
select ~z, count(*)
from q∗(~w, ~z)
group by ~z
Chapter 4. Rewritings for Queries with Grouping and Aggregation 80
Let Q(~z, l, u) be the first-order aggregate query returned by RewriteCount(q, Σ). Let
I be an instance over R. Then, for every tuple ~t, and pair of real numbers low and up,
we have that (~t, low, up) ∈ aggconsistentΣ(q, I) iff (~t, low, up) ∈ Q(I).
Proof. Let G be the join graph of q. Let R1(~x1, ~y1), . . . , Rm(~xm, ~ym) be the literals at
the roots of all trees of G. Let φ(~w, ~z) be the conjunction of literals of q∗. Following the
algorithm RewriteCount, let ~x =⋃
i=1...m ~xi, let ~z′ = ~z − ~x, and let ~w′ = ~w − ~x. Let
~x′ = ~x−~z. Let q′(~x, ~z′) = ∃ ~w′.φ(~x, ~w′, ~z′). Let QConsistent(~x, ~z′) be the query obtained
by invoking RewriteForest(q′, Σ).
(⇒) Let ~t be a tuple and low and up be real numbers such that (~t, low, up) ∈aggconsistentΣ(q, I). By Lemma 4.10, there is a “pessimistic” repair M of I wrt Σ
and a bag B such that B = q(M), and the following conditions hold:
1. for every valuation ν such that ν(~x) = ~c, ν(~z′) = ~t′, and ν(~z) = ~t, if (~c, ~t′) ∈ q′(M),
then ~c ∈ consistentΣ(q′[~z/~t], I), and
2. if (~t, low, up) ∈ aggconsistentΣ(q, I), then |~t|B = low.
Since, (~t, low, up) ∈ aggconsistentΣ(q, I), by item (2) above, |~t|B = low. Assume
towards a contradiction that (~t, low, up) 6∈ Q(I). Let low′ be a value computed as follows:
QGlb(~z, low′) = select ~z, count(*)
from QConsistent(~x, ~z′)
group by ~z
Assume that low′ < low. Then, there is a valuation ν for the variables of ~x and ~z
such that ν(~x) = ~c, ν(~z) = ~t, ν(~z′) = ~t′, and one of the following conditions holds:
• (~c, ~t′) ∈ q′(M) and |(~c, ~t′)|B > 1; or
• (~c, ~t′) ∈ q′(M) and (~c, ~t′) 6∈ QConsistent(I).
Assume that (~c, ~t′) ∈ q′(M) and |(~c, ~t′)|B > 1. This contradicts Lemma 4.4. Now,
assume that (~c, ~t′) ∈ q′(M) and (~c, ~t′) 6∈ QConsistent(I). By Theorem 3.5, (~c, ~t′) 6∈consistentΣ(q′, I). By Condition 1 above, (~c, ~t′) 6∈ q′(M); contradiction.
Assume towards a contradiction that low′ > low. Then, there is a valuation ν for
the variables of ~x and ~z such that ν(~x) = ~c, ν(~z) = ~t, ν(~z′) = ~t′, (~c, ~t′) 6∈ q′(M) and
Chapter 4. Rewritings for Queries with Grouping and Aggregation 81
(~c, ~t′) ∈ QConsistent(I). Since (~c, ~t′) ∈ QConsistent(I), by Theorem 3.5, (~c, ~t′) ∈consistentΣ(q′, I). Then, since M is a repair of I wrt Σ, we have that (~c, ~t′) ∈ q′(M);
contradiction.
By Lemma 4.11, there is an “optimistic” repair N of I wrt Σ and a bag B such that
B = q(N ), and the following conditions hold:
1. for every valuation ν such that ν(~x) = ~c and ν(~z) = ~t, if ~c ∈ possibleΣ(q′[~z/~t], I),
then ~c ∈ q′[~z/~t](N ), and
2. if (~t, low, up) ∈ aggconsistentΣ(q, I), then |~t|B = up.
Since, (~t, low, up) ∈ aggconsistentΣ(q, I), by item (2) above, |~t|B = up. Assume
towards a contradiction that (~t, low, up) 6∈ Q(I). Let up′ be a value computed as follows:
Let QLub(~z, up′) = select ~z, count(*)
from q′(~x, ~z′) ∧ (∃~x′.QConsistent(~x, ~z′))
group by ~z
Assume that up′ < up. Then, there is a valuation ν for the variables of ~x and
~z such that ν(~x) = ~c, ν(~z) = ~t, ν(~z′) = ~t′, (~c, ~t′) 6∈ q′(N ), (~c, ~t′) ∈ q′(I), and I |=∃~x′.QConsistent(~x, ~t′). Since (~c, ~t′) ∈ q′(I), (~c, ~t′) ∈ possibleΣ(q′, I). Thus, by Lemma
4.6, (~c, ~t′) ∈ q′(N ); contradiction.
Assume that up′ < up. Then, there is a valuation ν for the variables of ~x and ~z such
that ν(~x) = ~c, ν(~z) = ~t, ν(~z′) = ~t′, and one of the following two cases holds. First,
(~c, ~t′) ∈ q′(N ) and |(~c, ~t′)|B > 1. But this contradicts Lemma 4.4. Second, (~c, ~t′) ∈ q′(N )
and either (1) (~c, ~t′) 6∈ q′(I), or (2) I 6|= ∃~x′.QConsistent(~x, ~t′). Assume that (1) (~c, ~t′) 6∈q′(I). Since N is a repair of I wrt Σ, N ⊆ I. Thus, (~c, ~t′) 6∈ q′(N ); contradiction.
Assume that (2) I 6|= ∃~x′.QConsistent(~x, ~t′). Recall that ~x′ = ~x − ~z. By Theorem 3.5,
(~c′, ~t′) 6∈ consistentΣ(q′, I), for every ~c′. In particular, (~c, ~t′) 6∈ consistentΣ(q′, I). Thus,
(~c, ~t′) 6∈ q′(N ); contradiction.
(⇐) Let ~t be a tuple and low and up be real numbers such that (~t, lb, up) ∈ Q(I). In
order to prove that (~t, low, up) ∈ aggconsistentΣ(q, I), we must show that:
1. For every repair I of I wrt Σ, if B = q(I), then low ≤ |~t|B ≤ up.
2. There is a repair I of I wrt Σ, and a bag B such that B = q(I) and |~t|B = low.
Chapter 4. Rewritings for Queries with Grouping and Aggregation 82
3. There is a repair I of I wrt Σ, and a bag B such that B = q(I) and |~t|B = up.
Claim 1 follows by Lemma 4.7. Claim 2 follows by Lemma 4.10. Claim 3 follows by
Lemma 4.11.
4.4 Related Work
Our work on aggregation is inspired by Arenas et al. [ABC+03b], who were the first to
propose the use of ranges in a semantics for consistent query answering. The work of
Arenas et al. is restricted to queries of the following form:
select F (A)
from r
where F is an aggregation function, r is a single relation, and A is an attribute from
r. Notice that such queries have no grouping and no selection or join conditions (i.e., no
where clause). In this chapter, we consider a much richer class of queries. For the class
of queries considered by Arenas et al., the semantics proposed in their paper and our
semantics for aggregate queries coincide. However, we need to extend their semantics in
order to be able to deal with queries that perform grouping.
In their paper, Arenas et al. [ABC+03b] consider functional dependencies. If there
is exactly one functional dependency on the (only) relation of the query, they show that
the problem of obtaining the lowest upper and greatest lower bounds is tractable for the
count(*), min, max, sum, and avg functions. Except for avg, we considered all these
functions in our class Caggforest. Arenas et al. also show the intractability of queries with
the count(distinct) operator and exactly one functional dependency. If the relation
of the query has more than one functional dependency, they show that the problem
of obtaining tight bounds is intractable for all the aggregate functions they consider
(count(*), min, max, sum, and avg, count(distinct)). This gives further evidence of
the maximality of the class considered in this chapter: going from one to two functional
dependencies may lead to intractability even for queries on just one relation and with no
grouping.
Chapter 5
Complexity-Theoretic Analysis
In the previous chapters, we presented query rewriting algorithms that work on a broad
class of queries. In this chapter, we show the maximality of this class based on complexity-
theoretic arguments. In Section 5.1, we show that minimal relaxations of the conditions of
the class lead to intractability. Then, in Section 5.2, we embark on a more ambitious goal:
for a large class of conjunctive queries, we show that the conditions of the class Cforest
presented in Chapter 3 are not only sufficient, but they are also necessary conditions for
a query to be first-order rewritable.
5.1 Minimal Relaxations of Cforest
In this section, we show that minimal relaxations of the conditions of Cforest lead to
intractability. In particular, we show the intractability of the problem of computing
consistent answers for: (1) a conjunctive query whose join graph is a cycle of length
two; and (2) a conjunctive query whose join graph is a forest, but the query has some
nonkey-to-key joins that are not full.
Chomicki and Marcinkowski [CM05] proved that the problem of computing consistent
answers for a query with a single nonkey-to-nonkey join is coNP-complete. Their result
used a query with repeated relation symbols (specifically, a query with only two literals
both for a single relation R). We can use their insight to show that the problem of
computing consistent answers for the following query without repeated relation symbols,
but with a single nonkey-to-nonkey join is also coNP-complete.
qnk = ∃x, x′, y.S1(x, y) ∧ S2(x′, y)
83
Chapter 5. Complexity-Theoretic Analysis 84
Notice that qnk has a cycle of length two (actually, a nonkey-to-nonkey join), and
no nonkey-to-key joins. Our proof of hardness is a simple modification to the re-
sults of Chomicki and Marcinkowski [CM05] and uses a reduction from the problem
MONOTONE-3SAT, which is well known to be NP-complete. The only difference between
the MONOTONE-3SAT and 3SAT problems is that the former assumes that the input 3CNF
propositional formula is monotone. That is, each clause Φi contains either positive or
negative atoms, but not both. We shall say that a clause that contains only positive
(negative) atoms is a positive (negative) clause.
Lemma 5.1. Let q be the query ∃x, x′, y.S1(x, y)∧ S2(x′, y). Then, CONSISTENT(q, Σ) is
coNP-hard.
Proof. We will prove hardness by reduction from MONOTONE-3SAT. Let Φ = Φ1∧ · · · ∧Φm
be a 3CNF formula such that each clause Φi contains either positive or negative atoms,
but not both. We shall build an instance I as follows:
• For each positive clause Φi and each atom z that occurs in Φi, we add a tuple
S1(i, z) to I.
• For each negative clause Φi and each atom z that occurs in Φi, we add a tuple
S2(i, z) to I.
We now show that consistentΣ(q, I) = false iff Φ is satisfiable.
(⇒) Since consistentΣ(q, I) = false, there exists a repair I of I such that I 6|= q.
We now build a valuation v for the variables of Φ as follows. For each variable z, we let
v(z) = true if there is some i such that S1(i, z) ∈ I; and we let v(z) = false if there is
some i such that S2(i, z) ∈ I. It is easy to see that v is a truth valuation that satisfies
Φ.
(⇐) Assume that Φ is satisfiable. Let v be a truth assignment for the variables of Φ.
We shall build a repair I as follows. For each positive clause Φi, select a variable z that
appears in Φi and such that v(z) = true. Let S1(i, z) ∈ I. For each negative clause Φi,
select a variable z that appears in Φi and such that v(z) = false. Let S2(i, z) ∈ I. It is
easy to see that I 6|= q.
Now, we show the intractability of the problem for a conjunctive query whose join
graph is a forest, but the query has nonkey-to-key joins that are not full. In particular,
we focus on the following query:
Chapter 5. Complexity-Theoretic Analysis 85
∃x, x′, w, w′, z, z′, m.R1(x, w) ∧R2(m,w, z) ∧R3(x′, w′) ∧R4(m,w′, z′)
We prove hardness by showing a reduction from the problem of computing the con-
sistent answers for the query qnk shown to be coNP-hard in Lemma 5.1.
Lemma 5.2. Let q be the query ∃x, x′, w, w′, z, z′,m.R1(x,w)∧R2(m,w, z)∧R3(x′, w′)∧
R4(m,w′, z′). Let q′ be the query ∃x, x′, y.S1(x, y)∧S2(x′, y). Then, there is a polynomial
time reduction from the problem CONSISTENT(q′, Σ′) to the problem CONSISTENT(q, Σ).
Proof. Let I ′ be an instance over the schema of q′. We shall build an instance I over the
schema of q as follows:
Initialize I as the empty instance
for each tuple S1(c1, d1) ∈ I ′ do
Add R1(c1, d1) to I
end for
for each tuple S2(c2, d2) ∈ I ′ do
Add R3(c2, d2) to I
end for
Let cz, cz′ be some constants
for each valuation νq′ such that I ′ |= S1(x, y) ∧ S2(x′, y)[νq′ ] do
Let νq(x) = νq′(x)
Let νq(x′) = νq′(x
′)
Let νq(w) = νq′(y)
Let νq(w′) = νq′(y)
Let cm be a newly-created constant
Let νq(m) = cm
Let νq(z) = cz
Let νq(z′) = cz′
Add tuple R2(m,w, z)[νq] to I
Add tuple R4(m,w′, z′)[νq] to I
end for
We claim that consistentΣ(q′, I ′) = true iff consistentΣ(q, I) = true.
(⇒) Let I be a repair of I. We shall build an instance I ′ as follows:
Chapter 5. Complexity-Theoretic Analysis 86
for each tuple R1(c1, d1) of I do
Add a tuple S1(c1, d1) to I ′end for
for each tuple R3(c2, d2) of I do
Add a tuple S2(c2, d2) to I ′end for
Notice that R1 and S1 (and, similarly, R3 and S2) have the same extensions in I and I ′,
respectively. Thus, since I is a repair of I, I ′ is a repair of I ′. Since consistentΣ(q′, I ′) =
true, I ′ |= q′. Thus, there is a valuation νq′ such that I ′ |= S1(x, y) ∧ S2(x′, y)[νq′ ]. Let
c1 = νq′(x), c2 = νq′(x′), d = νq′(y). Let cz and cz′ be the constants used in the algorithm
that constructs I. Let cm be the constant created in the algorithm for the iteration
corresponding to νq′ . Let νq be a valuation for the variables of q such that:
• νq(x) = c1
• νq(x′) = c2
• νq(w) = d
• νq(w′) = d
• νq(m) = cm
• νq(z) = cz
• νq(z′) = cz′
Since S1(c1, d) ∈ I ′, R1(c1, d) ∈ I. Since S2(c2, d) ∈ I ′, R3(c2, d) ∈ I. By Proposition
3.6, I ′ ⊆ I ′. Thus, S1(c1, d) ∈ I ′ and S2(c2, d) ∈ I ′. Since cm is the constant chosen in the
iteration for νq′ in the algorithm that constructs I, R2(cm, d, cz) ∈ I and R4(cm, d, cz′) ∈ I.
By Proposition 3.7, R2(cm, d, e) ∈ I and R4(cm, d, e′) ∈ I, for some e, e′. Thus, I |= q[νq].
(⇐) Let I ′ be a repair of I ′. We shall build an instance I as follows.
for each tuple S1(c1, d1) ∈ I ′ do
Add R1(c1, d1) to Iend for
for each tuple S2(c2, d2) ∈ I ′ do
Chapter 5. Complexity-Theoretic Analysis 87
Add R3(c2, d2) to Iend for
for each tuple R2(c1, c2, d) ∈ I do
Add R2(c1, c2, d) to Iend for
for each tuple R4(c1, c2, d) ∈ I do
Add R4(c1, c2, d) to Iend for
We now show that I is a repair of I. First, notice that R1 and S1 (and, similarly, R3
and S2) have the same extensions in I and I ′, respectively. Second, in the construction
of I, every tuple of R2 and R4 is given a distinct key value. Then, by Propositions 3.6
and 3.7, every tuple in the extension of R2 in I is in the extension of R2 in I; and every
tuple in the extension of R4 in I is in the extension of R4 in I.
Since consistentΣ(q, I) = true, I |= q. Thus, there exists some valuation νq such
that I |= R1(x,w) ∧ R2(m,w, z) ∧ R3(x′, w′) ∧ R4(m,w′, z′)[νq]. By construction of I, if
R2 and R4 join on m, then νq(w) = νq(w′). Let νq′ be such that:
• νq′(x) = νq(x)
• νq′(x′) = νq(x
′)
• νq′(y) = νq(w) = νq(w′)
It is easy to see that I ′ |= S1(x, y) ∧ S2(x′, y)[νq′ ]. Thus, I ′ |= q′.
5.2 A Dichotomy Result
5.2.1 The Class C∗
In Chapter 3, we presented a query rewriting algorithm which works on a class of queries
that we call Cforest. Clearly, Cforest gives sufficient conditions for a query to be first-
order rewritable. In this section, we address the following question: for which class of
queries does Cforest also give necessary conditions? That is, we show a class of queries
such that the problem of computing the consistent answers is coNP-complete for every
query of the class which does not satisfy the conditions of Cforest. Notice that this
Chapter 5. Complexity-Theoretic Analysis 88
establishes a dichotomy between first-order rewritability and coNP-completeness, and
is therefore much stronger than the complexity results that we presented in Section
5.1 (and, in fact, all the complexity results present in the consistent query answering
literature [CLR03a, CM05]). In the literature, a class C is said to be coNP-hard if there
is at least one query q ∈ C such that CONSISTENT(q, Σ) is a coNP-hard problem. Under
such a definition, it suffices to exhibit just one intractable query in order to conclude
that the entire class is coNP-complete. In contrast, in this section we will present a class
of queries such that for every query q in the class, CONSISTENT(q, Σ) is coNP-complete.
We will focus on conjunctive queries without repeated relation symbols and all of
whose nonkey-to-key joins are full. Within this class, there are some queries for which
the existence of a cycle is not a sufficient condition for intractability. Consider, for
example, the query q = ∃x, y.R1(x, y) ∧ R2(x, y). The join graph of this query is not a
forest; yet, it can be rewritten as follows:
∃x, y.R1(x, y) ∧R2(x, y) ∧ ∀y′.(R1(x, y′) → y′ = y) ∧ ∀y′.(R2(x, y′) → y′ = y)
Recall that the problem of computing consistent answers is intractable for the query
qnk = ∃x, x′, y.R1(x, y)∧R2(x′, y). Notice that qnk and q have exactly the same join graph.
The only difference between them is that in qnk, the two literals are related exclusively
by a nonkey-to-nonkey join; whereas in q, they are related by both a key-to-key and a
nonkey-to-nonkey join. Our intuition is that a query with a cyclic join graph may be
tractable only if there are literals related by more than one type of join (e.g., nonkey-
to-nonkey and key-to-key). We formalize this intuition with the definition of a class C∗,which essentially “separates” the different types of joins of the query. In C∗, every pair of
literals can be related by at most one of type of join (i.e., key-to-key, nonkey-to-nonkey,
and nonkey-to-key).
Definition 5.3. Let q be a conjunctive query without repeated relation symbols and all
of whose nonkey-to-key joins are full. We say that q is in class C∗ if for every pair R
and R′ of literals of q at most one of the following conditions holds:
• there is a key-to-key join between R and R′.
• there is a nonkey-to-nonkey join between R and R′.
• there are literals R1 . . . Rm in q such that there is a nonkey-to-key join from R to
R1, from Rm to R′, and from Ri to Ri+1, for every i such that 1 ≤ i < m.
Chapter 5. Complexity-Theoretic Analysis 89
Notice that C∗ is a fairly broad class of queries. For example, it includes the class
of queries that have exclusively nonkey-to-key joins. In general, the only queries that
are outside C∗ are the ones that have a pair of literals related by more than one type of
join. As anecdotal evidence of the practicality of the class, the only query in the TPC-H
benchmark [TPC03] that has nonkey-to-nonkey joins (Query 5) is in C∗. From the results
of this chapter, we can immediately conclude that the problem of computing consistent
answers for this query is not first-order rewritable.
We will consider a class, called Chard, of all queries of C∗ that are not in Cforest. The
main result of this chapter, Theorem 5.5, proves that the problem of computing the
consistent answers for every query of Chard is coNP-complete.
Definition 5.4. We say that a query q is in class Chard if q ∈ C∗ and q 6∈ Cforest.
Theorem 5.5. Let q be a query such that q ∈ Chard. Then, CONSISTENT(q, Σ) is coNP-
complete in data complexity.
Our motivation to provide a dichotomy for C∗ is the following. First, for a fairly broad
class of queries we can test in polynomial time if the problem of computing consistent
answers is tractable. Second, our results are an initial step towards proving a dichotomy
for the larger class of all conjunctive queries. Indeed, as a result of our work, future
efforts for finding dichotomy results for conjunctive queries need to focus only on queries
whose literals are related by more than one type of join.1
In general, by Ladner’s Theorem [Lad75], there are classes of coNP problems for
which there is no dichotomy between P and coNP-complete problems. However, this
is not the case for the class of queries that is the focus of this section. In fact, as a
corollary of Theorems 3.5 and 5.5, we get a dichotomy between membership in P and
coNP-completeness. Notice that, given a query q such that q ∈ C∗, it can be decided in
polynomial time on which side of the dichotomy the query q falls.
Corollary 5.6. Let q be a query such that q ∈ C∗. Then, CONSISTENT(q, Σ) is either in
P , or it is coNP-complete.
Under a complexity-theoretic assumption, we also get a dichotomy between first-order
rewritability and first-order inexpressibility for the class C∗. That is, for all the queries
of C∗ that are not in Chard, we can produce a first-order rewriting using our algorithm
1Since C∗ intersects, but does not contain Cforest, we know that there are queries outside C∗ for whichthe problem of computing consistent answers is tractable.
Chapter 5. Complexity-Theoretic Analysis 90
RewriteForest. For the queries of Chard, since the problem of obtaining consistent an-
swers is coNP-complete, there is no first-order rewriting, unless P=NP (which is unlikely).
Corollary 5.7. Let q be a query such that q ∈ C∗. Assuming P 6= NP , the problem
CONSISTENT(q, Σ) is first-order rewritable iff q ∈ Cforest.
Tractable but not First-Order Rewritable Queries
An interesting question is whether there are queries for which the problem of computing
consistent answers is tractable, yet not first-order rewritable. Although this remains
open for conjunctive queries without inequalities, we now show that there are tractable
conjunctive queries with inequalities that are not first-order rewritable.
Consider a schema with one binary relation R(E, S). Assume that E is the key of
the relation. Consider the following query q:
q = ∃e1, e2, s : R(e1, s) ∧R(e2, s) ∧ e1 6= e2
In order to find the consistent answers for q, we construct a graph of the inconsistent
database instance as follows.2 Let I be a database instance with one binary relation
R(E, S). The graph G of I is a bipartite graph G, with partitions E and S. Partitions
E and S have one vertex for each value in the active domain of attributes E and S,
respectively. The set of edges of G consists of all tuples (e, s) of R.
We use the graph of I to introduce the following necessary and sufficient condition
for consistentΣ(q, I) = false.
Lemma 5.8. Let I be a database with one binary relation R(E, S), possibly inconsistent
wrt a functional dependency Σ = {E → S}. Then, consistentΣ(q, I) = false iff the
graph G of I has a perfect matching.
Proof. ⇐ Assume that G has a perfect matching M . We can build an instance I by
creating a tuple in I for each edge in M . Since M is a matching, each vertex from
partition S is incident to at most one edge. Therefore, I 6|= q. Also, since the matching
is perfect, every key appears in I. Consequently, I is minimal, and therefore it is a repair
of I wrt Σ.
2Notice that unlike the join graph of a query, this graph is constructed from a database instance, nota query.
Chapter 5. Complexity-Theoretic Analysis 91
⇒ Assume that consistentΣ(q, I) = false. Then, there must exist a repair I of I
wrt Σ such that I 6|= q. We can construct a graph G′ by selecting the edges of G that
correspond to tuples of I. It is easy to see that G′ is a perfect matching of G.
There are a number of algorithms in the literature for deciding the existence of a
perfect bipartite matching. For example, one of the best known is given by Hopcroft and
Karp [HK75], and runs in O(n2.5) time. Therefore, q is a tractable query. We now show
that no approach based on query-rewriting works for q.
Theorem 5.9. There is no first-order rewriting Q of q such that consistentΣ(q, I) =
Q(I) for every instance I.
Proof. Let A1, . . . , An be a system of distinct representatives. A system of distinct rep-
resentatives [Ost70] of A1, . . . , An is a sequence of n distinct elements a1, . . . , an with
ai ∈ Ai, 1 ≤ i ≤ n. Let R be a binary relation that encodes A1, . . . , An as follows:
R(i, x) iff x ∈ Ai. Let G be the graph of R as constructed above. Clearly, G has a
perfect matching iff A1, ..., An has a system of distinct representatives. By Lemma 5.8,
consistentΣ(q, I) = false iff G has a perfect matching.
Let I be the database instance that consists of relation R. Assume that there is
a first order query Q such that I 6|= Q iff consistentΣ(q, I) = false. Then, Q can
test whether A1, ..., An has a system of distinct representatives. But it is known in the
literature [LW95] that relational algebra, with an appropriate encoding of sets, cannot
test whether a family of sets has a system of distinct representatives; contradiction.
5.2.2 Basic Intractable Cases
The intractability of all queries in Chard will be shown as follows. First, we show in
Lemma 5.10 that the problem of computing consistent answers for conjunctive queries
is in coNP. This is a result known in the literature, but we briefly give a proof for our
setting. For hardness, we will use a reduction from the problem of computing consistent
answers for one of two particular queries to the problem of computing consistent answers
for q. One of these specific queries is the query qnk = ∃x, x′, y.S1(x, y) ∧ S2(x′, y). This
query has a nonkey-to-nonkey join, and was shown to be intractable in Lemma 5.1. The
other query has a cycle of nonkey-to-key joins, and is shown to be intractable in Lemma
5.11.
Chapter 5. Complexity-Theoretic Analysis 92
The next lemma shows that the problem of computing consistent answers for con-
junctive queries is in coNP.
Lemma 5.10. Let q be a conjunctive query. The problem CONSISTENT(q, Σ) is in coNP.
Proof. Let I be an instance. In order to decide whether ~t 6∈ consistentΣ(q, I), it suffices
to show a repair I of I such that I 6|= q[~t]. The size of I is polynomially bounded by the
size of I. In particular, by Proposition 3.6, I ⊆ I. Furthermore, I 6|= q[~t] can be checked
in polynomial time, since q is a conjunctive query.
In the next lemma, we show the coNP hardness of computing consistent answers for
one of the two particular queries that will be used in Lemma 5.14. The coNP hardness
of the other query was proven in Lemma 5.1.
Lemma 5.11. Let q = ∃x, y.T1(x, y) ∧ T2(y, x). Then, the problem CONSISTENT(q, Σ) is
coNP-hard.
Proof. We will prove hardness by reduction from MONOTONE-3SAT. Let Φ = Φ1∧ · · · ∧Φm
be a monotone 3CNF formula. We shall build an instance I as follows:
• For each atom z, let Φi1 , . . . , Φin be the positive clauses where z occurs. Add tuples
T1(< Φi1 , . . . , Φin >, z) and T2(z, < Φi1 , . . . , Φin >) to I.
• For each atom z, let Φi1 , . . . , Φin be the negative clauses where z occurs. Add tuples
T1(< Φi1 , . . . , Φin >, z) and T2(z, < Φi1 , . . . , Φin >) to I.
We now show that consistentΣ(q, I) = false iff Φ is satisfiable.
(⇒) Since consistentΣ(q, I) = false, there exists a repair I of I such that I 6|= q.
Assume towards a contradiction that there are tuples T1(c, z) ∈ I and T1(c′, z) ∈ I such
that c 6= c′. By construction of I, if T2(z, d) ∈ I, then d = c or d = c′. By Propositions
3.6 and 3.7, either T2(z, c) ∈ I or T2(z, c′) ∈ I. Thus, I |= q; contradiction.
We now build a valuation v for the variables of Φ as follows. For each variable z,
we let v(z) = true if there is some c such that T1(c, z) ∈ I and c is a list of positive
clauses; and we let v(z) = false if there is some i such that T1(c, z) ∈ I, and c is a list
of negative clauses. It is easy to see that v is a truth valuation that satisfies Φ.
(⇐) Assume that Φ is satisfiable. Let v be a truth assignment for the variables of Φ.
We shall build a repair I as follows. For each positive clause Φi, select a variable z that
appears in Φi and such that v(z) = true. Add T1(c, z) to I, where c is a list of positive
Chapter 5. Complexity-Theoretic Analysis 93
clauses. For each negative clause Φi, select a variable z that appears in Φi and such that
v(z) = false. Add T1(c, z) to I, where c is a list of negative clauses. For each variable
z, if v(z) = false, add T2(z, c) to I, where c is a list of positive clauses; if v(z) = true,
add T2(z, c) to I, where c is a list of negative clauses. It is easy to see that I 6|= q.
We now give some auxiliary results before proving Lemma 5.14. The next lemma
generalizes Lemma 5.11 from cycles of length two to the case of cycles of arbitrary length.
Lemma 5.12. Let q be the query ∃w1, . . . , wm.S1(wm, w1)∧S2(w1, w2)∧· · ·∧Sm(wm−1, wm).
Let q′ = ∃x, y.T1(x, y)∧T2(y, x) Then, there is a polynomial time reduction from the prob-
lem CONSISTENT(q′, Σ′) to the problem CONSISTENT(q, Σ).
Proof. Let I ′ be an instance over the schema of q′. We shall build an instance I over the
schema of q as follows:
for each valuation νq′ for the variables of q′ such that I ′ |= T1(x, y) ∧ T2(y, x)[νq′ ] do
Let νq(wm) = νq′(x)
Let νq(w1) = νq′(y)
Create a new constant cnew
for i := 2 to m− 1 do
Let νq(wi) = cnew
end for
Add the tuples of S1(wm, w1) ∧ S2(w1, w2) ∧ · · · ∧ Sm(wm−1, wm)[νq] to I
end for
We claim that consistentΣ(q′, I ′) = true iff consistentΣ(q, I) = true.
(⇒) Let I be a repair of I over the schema of q. We shall build a repair I ′ over the
schema of q′ as follows:
for each tuple S1(cm, c1) of I do
Add a tuple T1(cm, c1) to I ′for each cnew such that S2(c1, cnew) ∈ I and Sm(cnew, cm) ∈ I do
Add a tuple T2(c1, cm) to I ′end for
end for
Since consistentΣ(q′, I ′) = true, I ′ |= q′. Thus, there is a valuation νq′ such that
Chapter 5. Complexity-Theoretic Analysis 94
I ′ |= T1(x, y) ∧ T2(y, x)[νq′ ]. Let cm = νq′(x), c1 = νq′(y). Since T2(c1, cm) ∈ I ′, there
exists cnew such that S2(c1, cnew) ∈ I and Sm(cnew, cm) ∈ I. Let νq be a valuation for the
variables of q such that:
• νq(wm) = cm
• νq(w1) = c1
• νq(wi) = cnew, for 1 < i < m
Since T1(cm, c1) ∈ I ′, S1(cm, c1) ∈ I. By construction of νq, S2(c1, cnew) ∈ I and
Sm(cnew, cm) ∈ I. For 2 < i ≤ m, notice that by construction of I, there are no tuples
Si(ci, di) and Si(ci, d′i) in I such that di 6= d′i. Therefore, by Propositions 3.6 and 3.7,
every tuple in the extension of Si in I appears in the extension of Si in I. By construction
of I, Si(cnew, cnew) ∈ I, for 3 ≤ i ≤ m − 1. Thus, Si(cnew, cnew) ∈ I. We conclude that
I |= S1(wm, w1) ∧ S2(w1, w2) ∧ . . . Sm(wm−1, wm)[νq]. Thus, I |= q.
(⇐) Let I ′ be a repair of I ′. We shall build an instance I as follows.
for each tuple T1(cm, c1) of I ′ do
Add a tuple S1(cm, c1) to ILet cnew be a constant such that S2(c1, cnew) ∈ I and Sm(cnew, cm) ∈ I
Add a tuple S2(c1, cnew) to Ifor i := 3 to m− 1 do
Add a tuple Si(cnew, cnew) to Iend for
Add a tuple Sm(cnew, cm) to Iend for
It is easy to see that I is a repair of I. Since consistentΣ(q, I) = true, I |=q. Thus, there exists some valuation νq such that I |= S1(wm, w1) ∧ S2(w1, w2) ∧. . . Sm(wm−1, wm)[νq]. Let νq′ be such that:
• νq′(x) = νq(wm)
• νq′(y) = νq(wm1)
It is easy to see that I ′ |= T1(x, y) ∧ T2(y, x)[νq′ ]. Thus, I ′ |= q′.
Chapter 5. Complexity-Theoretic Analysis 95
5.2.3 Generalizing the Basic Cases
Our strategy for proving the dichotomy will be to show that if q has a subquery q′ that
is known to be intractable (in particular, a cycle), then q is not tractable. This does not
hold in general, but as we show with the next auxiliary result, it holds for the queries in
C∗.
Lemma 5.13. Let q be a Boolean query such that q ∈ C∗. Let R1(~x1, ~y1), . . . ,
Rn(~xn, ~yn) be the literals of q. Let q′ be a Boolean query. Let S1(x1, y1), . . . ,
Sm(xm, ym) be the literals of q′, where m ≤ n. Assume that the join graph of q′ is a cycle.
Let L = {x1, y1, . . . , xm, ym}. Assume that:
• xi occurs in ~xi, for 1 ≤ i ≤ m, and
• yi occurs in ~yi, for 1 ≤ i ≤ m, and
• for 1 ≤ i ≤ m, if w ∈ L and w occurs in Ri, then w occurs in Si.
Then, there is a polynomial-time reduction from the problem CONSISTENT(q′, Σ′) to
CONSISTENT(q, Σ).
Proof. Let F = {w : w occurs in Ri, and 1 ≤ i ≤ m}−L. Let U = {w : w occurs in q}−F − L.
Let I ′ be an instance over the schema of q′. We shall build an instance I over the
schema of q as follows:
for each variable w such that w ∈ F do
Create a new constant cnew
Let νF (w) = cnew
end for
for each valuation νq′ for the variables of q′ such that I ′ |= S1(x1, y1)∧· · ·∧Sm(xm, ym)[νq′ ]
do
for each variable w such that w ∈ F do
Let νq(w) = νF (w)
end for
for each variable w such that w ∈ U do
Create a new constant cnew
Let νq(w) = cnew
Chapter 5. Complexity-Theoretic Analysis 96
end for
for i := 1 to m do
Let νq(xi) = νq′(xi)
Let νq(yi) = νq′(yi)
end for
Add the tuples of R1(~x1, ~y1) ∧ · · · ∧Rn(~xn, ~yn)[νq] to I
end for
We claim that consistentΣ(q′, I ′) = true iff consistentΣ(q, I) = true.
(⇒) Let I be a repair of I over the schema of q. We shall build an instance I ′ over
the schema of q′ as follows.
for i := 1 to m do
for each tuple Ri(~ci,~di) of I do
Let ci be the constant that appears in ~ci at the position of one of the occurrences
of xi in ~xi.
Let di be the constant that appears in ~di at the position of yi in ~yi
Add Si(ci, di) to I ′end for
end for
We make the following observations with respect to the construction of I ′. By con-
struction of I, if Ri(~ci,~di) ∈ I, the same constant appears in ~ci at all the positions where
xi appears in ~xi. By Proposition 3.6, I ⊆ I. Thus, in the construction of I ′, it suffices
to choose the constant that occurs in ~ci at any of the positions where xi occurs in ~xi.
Assume that I ′ is not a repair of I ′. Then, there are constants ci, di and d′i such
that di 6= d′i, Si(ci, di) ∈ I ′ and Si(ci, d′i) ∈ I ′. By construction of I ′, there are tuples
Ri(~ci,~di) ∈ I and Ri(~c
′i,
~d′i) ∈ I such that ci appears in ~ci and ~c′i at all the positions
where xi appears in ~xi; and di and d′i appear in ~di and ~d′i, respectively, at the position
of yi in ~yi. Clearly, ~di 6= ~d′i. By construction of I, if w is a variable such that w 6∈ L,
w is assigned the value νF (w) in every tuple of I. By Proposition 3.6, I ⊆ I. Thus,
~ci = ~c′i. Since ~di 6= ~d′i, I does not satisfy the key constraints of Σ. Thus I is not a repair;
contradiction. We conclude that I ′ is a repair of I ′.
Since consistentΣ(q′, I ′) = true, I ′ |= q′. Thus, there is some valuation νq′ such
Chapter 5. Complexity-Theoretic Analysis 97
that I ′ |= S1(x1, y1) ∧ · · · ∧ Sm(xm, ym)[νq′ ]. Let νm be a valuation for the variables of
R1, . . . , Rm such that:
• νm(xi) = νq′(xi), for 1 ≤ i ≤ m
• νm(yi) = νq′(yi), for 1 ≤ i ≤ m
• νm(w) = νF (w) if w ∈ F
Let w be a variable that appears in Ri, for 1 ≤ i ≤ m. If w ∈ L and w occurs in
Ri, by hypothesis, w occurs in Si. If w 6∈ L, then w ∈ F , by definition of F . Since
I ′ |= S1(x1, y1) ∧ · · · ∧ Sm(xm, ym)[νq′ ], and νm(w) = νF (w) if w ∈ F , we conclude that
I |= R1(~x1, ~y1) ∧ · · · ∧Rm(~xm, ~ym)[νm].
By construction of I, there is a valuation νq for the variables of q such that:
• νm(w) = νq(w) if w appears in Ri, for 1 ≤ i ≤ m; and
• I |= R1(~x1, ~y1) ∧ · · · ∧Rn(~xn, ~yn)[νq].
Let Ri(~xi, ~yi) be a literal of q such that i > m. Notice that we assume that the join
graph of q′ is a cycle. Since q is in C∗, there exists some variable w such that w occurs in
~xi and w does not occur in any of R1, . . . , Rm. Thus, w ∈ U . Since the variables of U are
assigned a distinct constant in every iteration of the algorithm that constructs I, if two
tuples Ri(~ci,~di) and Ri(~c
′i,
~d′i) are added at different iterations, then ~ci 6= ~c′i. Therefore,
by Proposition 3.6 and 3.7, every tuple in the extension of Ri in I is in the extension of
Ri in I. Therefore, I |= R1(~x1, ~y1) ∧ · · · ∧Rn(~xn, ~yn)[νq].
(⇐) Let I ′ be a repair of I ′. We shall build an instance I as follows.
for i := 1 to m do
for each tuple Si(ci, di) of I ′ do
Let Ri(~ci,~di) be a tuple of I such that ci appears in ~ci at all the positions of xi in
~xi, and di appears in ~di at the position of yi in ~yi
Add Ri(~ci,~di) to I
end for
end for
for i := m + 1 to n do
for each tuple Ri(~ci,~di) in I do
Chapter 5. Complexity-Theoretic Analysis 98
Add Ri(~ci,~di) to I
end for
end for
We will now show that I is a repair of I. Towards a contradiction, assume that I is
not a repair of I. Then, there are values ~ci, ~di, and ~d′i such that ~di 6= ~d′i, Ri(~ci,~di) ∈ I,
and Ri(~ci,~d′i) ∈ I.
First, assume that 1 ≤ i ≤ m. For every variable w such that w 6∈ L and w occurs
in Ri, w ∈ F . Thus, w is assigned the same constant νF (w) in every tuple of I. By
Proposition 3.6, I ⊆ I. Therefore, there are constants ci, di and d′i such that di 6= d′i, ci
appears in ~ci at the positions of xi in ~xi, and di and d′i appears in ~di and ~d′i, respectively,
at the position of yi in ~yi. By construction of I, there are tuples Si(ci, di) and Si(ci, d′i)
in I ′. Since di 6= d′i, I ′ does not satisfy the key constraints of Σ′. Thus, I ′ is not a repair;
contradiction.
Now, assume that m < i ≤ n. Notice that we assume that the join graph of q′ is
a cycle. Since q is in C∗, there exists some variable w such that w occurs in ~xi and
w does not occur in any of R1, . . . , Rm. Thus, w ∈ U . Since the variables of U are
assigned a different constant in every iteration of the algorithm that constructs I, if two
tuples Ri(~ci,~di) and Ri(~c
′i,
~d′i) are added at different iterations, then ~ci 6= ~c′i. Therefore,
the extension of Ri in I satisfies the key dependencies of Σ. Thus, by construction of
I, the extension of Ri in I satisfies the key constraints of Σ. Thus, I is a repair of I;
contradiction.
We conclude that I is a repair of I. Since consistentΣ(q, I) = true, I |= q. Thus,
there exists some valuation νq such that I |= R1(~x1, ~y1) ∧ · · · ∧Rn(~xn, ~yn)[νq]. Let νq′ be
a valuation for the variables of q′ such that, for 1 ≤ i ≤ m:
• νq′(xi) = νq(xi)
• νq′(yi) = νq(yi)
It is easy to see that I ′ |= S1(x1, y1) ∧ · · · ∧ Sm(xm, ym)[νq′ ]. Thus, I ′ |= q′.
We are now ready to prove Lemma 5.14, which gives a polynomial-time reduction
from the problem of computing consistent answers for the queries of Lemmas 5.1 or 5.11
to every query in Chard. From this, Theorem 5.5 follows directly.
Chapter 5. Complexity-Theoretic Analysis 99
Lemma 5.14. Let q be a query such that q ∈ Chard. Then, there is a polynomial-time
reduction from CONSISTENT(q′, Σ′) to CONSISTENT(q, Σ), where q′ is one of the following
queries:
• ∃x, x′, y.S1(x, y) ∧ S2(x′, y)
• ∃x, y.T1(x, y) ∧ T2(y, x)
Proof. Let G be the join graph of q. Let G′ be an induced subgraph of G such that:
• G′ is connected, and
• G′ is not a tree, and
• if G′′ is a proper induced subgraph of G′, and G′′ is connected, then G′′ is a tree.
Let P = 〈R1, R2, R1〉 be a cycle of G′. Let R1(~x1, ~y1) and R2(~x2, ~y2) be the literals in
G′. Assume that there is some variable y such that y occurs in ~y1 and ~y2. By Definition
of C∗, there is no key-to-key join between R1 and R2. Therefore, there exists a variable
x such that x occurs in ~x1, and x does not occur in ~x2; and a variable x′ such that x′
occurs in ~x2 and x′ does not occur in ~x1. Let q′ = S1(x, y) ∧ S2(x′, y). By Lemma 5.13,
there is a polynomial-time reduction from CONSISTENT(q′, Σ′) to CONSISTENT(q, Σ).
Let P = 〈R1, . . . , Rm, R1〉 be a cycle of G′. Let R1(~x1, ~y1),. . . , Rm(~xm, ~ym) be the
literals of P . Let w1, w2, . . . , wm be variables such that wi occurs in ~yi and in R(i mod m)+1,
for every 1 ≤ i ≤ m. Assume that there is some wi such that 1 ≤ i ≤ m and wi occurs in
some literal Rj of q such that j 6= i and j 6= (i mod m)+1. Then {R1, . . . , Ri, Rj, . . . , R1}is a cycle. Therefore G′ contains a proper induced subgraph G′′ such that G′′ is connected,
and G′′ is not a tree; contradiction. Let q′′ = S1(wm, w1)∧S2(w1, w2)∧ . . . Sm(wm−1, wm).
It can be checked that q and q′′ satisfy the conditions of Lemma 5.13. Consequently,
there is a polynomial-time reduction from CONSISTENT(q′′, Σ′′) to CONSISTENT(q, Σ). Let
q′ = ∃x, y.T1(x, y)∧T2(y, x). By Lemma 5.12, there is a polynomial-time reduction from
CONSISTENT(q′, Σ′) to CONSISTENT(q′′, Σ′′).
Finally, we give the proof for Theorem 5.5, the main result of this chapter.
Theorem 5.5. Let q be a query such that q ∈ Chard. Then, CONSISTENT(q, Σ) is coNP-
complete in data complexity.
Chapter 5. Complexity-Theoretic Analysis 100
Proof. By Lemma 5.10, CONSISTENT(q, Σ) is in coNP. In order to prove hardness, let q′
be one of the following queries:
• ∃x, x′, y.S1(x, y) ∧ S2(x′, y)
• ∃x, y.T1(x, y) ∧ T2(y, x)
By Lemma 5.14, there is a polynomial-time reduction from CONSISTENT(q′, Σ′) to
CONSISTENT(q, Σ). By Lemmas 5.1 and 5.11, CONSISTENT(q′, Σ′) is coNP-hard. Thus,
CONSISTENT(q, Σ) is coNP-hard.
5.3 Related Work
Chomicki and Marcinkowski [CM05] and Calı, Lembo and Rosati [CLR03a] thoroughly
study the decidability and complexity of consistent query answering for several classes
of queries and integrity constraints. In order to show intractability of a class, they
take the usual approach of exhibiting one query of the class for which the problem is
intractable. To the best of our knowledge, the result that we present in Section 5.2 is the
first dichotomy result in the area of consistent query answering.
Both Chomicki and Marcinkowski and Calı, Lembo and Rosati show that the problem
of obtaining consistent answers for conjunctive queries under primary key constraints is
coNP-complete. Chomicki and Marcinkowski also show an example of a query with just
one literal but two key dependencies for which the problem is coNP-complete. This gives
further support for our decision of considering exactly one key dependency per relation.
Calı, Lembo and Rosati show the undecidability of the problem of obtaining consis-
tent answers when the set of constraints contains primary keys and arbitrary inclusion
dependencies. They also show the problem becomes decidable for foreign key constraints
(it is coNP-complete). Chomicki and Marcinkowski study the same problem but under
a semantics where only tuple deletion is allowed (i.e., repairs are always subsets of the
inconsistent database). In this case, the problem is Π2p-complete, and becomes coNP-
complete if the inclusion dependencies are restricted to be acyclic.
Chapter 6
ConQuer: System Implementation
and SQL Rewritings
In this chapter, we present ConQuer, a system for querying inconsistent databases.
We demonstrated this system at the International Conference on Very Large Databases
(VLDB) [FFM05b]. In Section 6.1, we describe the system implementation and a typical
scenario where it can be used. Then, in Sections 6.2 and 6.3, we present the SQL rewrit-
ings that are at the core of ConQuer’s approach. In Section 6.4, we show how, if desired,
ConQuer can process the database offline in order to improve the performance of the
queries. Finally, in Section 6.5, we review other systems that are related to ConQuer.
6.1 System Implementation
ConQuer is implemented in Java and follows a modular architecture. It consists of the
following components:
• Query Rewriting Module. It rewrites an input SQL query into another SQL
query that computes the consistent answers. The details of the rewritings are
presented in Sections 6.2 to 6.4. The SQL queries are parsed using javacc.
• Query Execution Engine. The rewritten queries are executed using IBM DB2
UDB Version 8.2. The connection with the database is done through JDBC.
• Conflict Resolution Module. Provides a tracing facility to find the data that
leads to differences between the answer to the original query and the consistent
answer. This module also permits a user to update the database to correct errors.
101
Chapter 6. ConQuer: System Implementation and SQL Rewritings 102
Figure 6.1: Interface for entering hypothetical primary key constraints in ConQuer
• User Interface. Query results are displayed using a Web-accessible interface that
is implemented in PHP.
We illustrate a typical use case of ConQuer on a database with information about
airports. The user first specifies a set of primary key constraints using the interface shown
in Figure 6.1. These are the constraints that should hold on a consistent database, but
may be violated by the actual database that is being queried. Notice that for the same
schema and database, there is the flexibility of running queries under different sets of
potentially violated primary key constraints. Then, the user writes a SQL query within
the interface. In Figure 6.2, we show a query where the user is asking for all the countries
that have airports located north of parallel 63N. The result to the query is shown in Figure
6.3. The consistent answers are shown in bold, and the “potential answers” (i.e., possible
answers that are not consistent answers) are shown in italics. For example, in this case
“Italy” is a potential answer.
While consistent answers are best suited for decision making, potential answers can be
used to understand the reasons why a database is inconsistent. In this case, the user could
click on “Italy” and obtain an explanation, which is shown in Figure 6.4. The explanation
is the lineage (or why-provenance) [BKT01, CW03] of the result, i.e., the tuples in the
database that contribute to the answer. According to the explanation, Italy is a potential
answer because it has one airport that appears as satisfying the query (parallel 63) in
Chapter 6. ConQuer: System Implementation and SQL Rewritings 103
Figure 6.2: Interface for entering queries in ConQuer
one tuple, and violating it (parallel 45) in another. Notice that in the comment to the
query, the user wrote “select countries that are located north of Trondheim”. Trondheim
is a Norwegian city, and the user may have background knowledge telling that all Italian
cities are south of Norwegian cities. Thus, the user could use the explanation obtained
from ConQuer in order to remove the tuple for the Italian airport located on parallel 63.
6.2 ConQuer Rewritings for Queries without Aggre-
gation
In this section, we present the SQL rewritings produced by ConQuer for a class of Select-
Project-Join (SPJ) queries with set semantics. We delay the treatment of conjunctive
queries that return duplicates until the next section, where the number of duplicates
returned by the queries can be counted with the count(*) aggregate function. We first
give the query rewriting algorithm, and then we illustrate it with a number of examples.
6.2.1 Rewriting Algorithm
We now present a SQL rewriting algorithm for SPJ queries that are equivalent to a
conjunctive query in the class Cforest, introduced in Definition 3.4, which we repeat next.
Chapter 6. ConQuer: System Implementation and SQL Rewritings 104
Figure 6.3: Query results in ConQuer
Figure 6.4: Query explanation in ConQuer
Chapter 6. ConQuer: System Implementation and SQL Rewritings 105
Definition 3.4. Let q be conjunctive query without repeated relation symbols and all of
whose nonkey-to-key joins are full. Let G be the join graph of q. We say that q ∈ Cforest
if G is a forest (i.e., every connected component of G is a tree).
The above definition requires three conditions on the conjunctive query. First, that
the query has no repeated relation symbols. For an SPJ SQL query, this means that each
relation can be used at most once in the where clause. Second, that all its nonkey-to-key
joins must be full. For an SPJ query, this means that if an attribute of a key of a relation
r1 is equated in the where clause with a nonkey attribute of another relation r2, then all
the attributes of the key of r1 are equated to nonkey attributes of r2. Finally, the join
graph of q must be a forest. The notion of a join graph is introduced in Definition 3.1,
and we repeat it next.
Definition 3.1 (join graph). Let q be a conjunctive query. The join graph G of q is a
directed graph such that:
• the vertices of G are the literals of q;
• there is an arc from literal Ri to literal Rj if i 6= j, and there is some variable w
such that w is existentially-quantified in q, w occurs at the position of a nonkey
attribute in Ri, and w occurs in Rj.
An analogous definition can be given for the join graph of an SPJ SQL query. The
vertices of the graph will be the relation symbols in the from clause of the query. Fur-
thermore, there will be an arc from relation ri to relation rj if there is an attribute A
in ri such that (1) A is not in the key of r1 (it is a nonkey attribute), (2) A does not
appear in the select clause of the query, and A is not equated to any attribute B such
that B appears in the select clause of the query (this corresponds to the notion of
an existentially-quantified variable for conjunctive queries); and (3) there is some equal-
ity in the where clause relating A to some attribute B of r2 (i.e., a nonkey-to-key or
nonkey-to-nonkey join).1
We can now give a definition analogous to Cforest for SPJ SQL queries. A query q is
in class Csqlforest if no relation appears twice in the from clause of q, all the nonkey-to-key
joins of q are full, and the join graph of q is a forest.
1This definition works for repeated relation symbols as well. In such case, we assume that if a relationappears more than once in the from clause, then it is aliased to a new name using the as operator.
Chapter 6. ConQuer: System Implementation and SQL Rewritings 106
We are now ready to give ConQuer’s rewriting algorithm for SPJ queries in Csqlforest.
The algorithm is called RewriteForestSQL and is shown in Figure 6.5. The algorithm
takes as input a SQL query q in Csqlforest and a set of key constraints (one per relation of
the schema), and returns a SQL rewriting Q of q.
In the rewriting Q, the attributes of the relations in q play different roles. In par-
ticular, we will distinguish the attributes that the query projects on (i.e., that appear
in the select clause), and the attributes that appear in the key of a relation that is
at the root of some tree in the join graph of q. In the rest of the discussion, we will
call these attributes projecting attributes, and key-root attributes, respectively. The for-
mer are denoted in Figure 6.5 with the symbols S1, . . . , Sl; the latter are denoted with
K1, . . . , Kn.
The rewriting Q has three subqueries, specified using a with clause: candidates-
SubQuery, countViolSubQuery and countProjSubQuery. The purpose of candidates-
SubQuery is to prune the number of values for the key-root attributes that should be
considered by the other subqueries. In particular, candidatesSubQuery applies the
selection conditions of the original query q, and projects on its key-root attributes. These
attributes are used to perform an inner join in the next subquery (countViolSubQuery).
If the selectivity of q is low (i.e., few tuples satisfy its conditions), and the query optimizer
pushes down the selection conditions of candidatesSubQuery in the query plan, we would
expect the rewriting to have a low overhead with respect to the original query. We validate
this conjecture in Section 7.2.
Let CONDS be the list of conditions in the where clause of q. In the from clause
of countViolSubQuery, we count the number of tuples that violate the conditions of
CONDS, we group by the key-root attributes, and keep the result in an attribute called
countViol as follows:
sum(case when CONDS then 0 else 1 end)
over (partition by K1, . . . , Kn)
as countViol
Notice the use of the partition by clause. This clause (introduced in the OLAP
Amendment to SQL [ISO01]) differs from the typical group by clause in that it permits
grouping by a set of attributes that may not include all the attributes in the select
Chapter 6. ConQuer: System Implementation and SQL Rewritings 107
clause. This is useful here because we “partition by” the root-key attributes, but the
select clause of countViolSubQuery also includes the projecting attributes of the query.
In the main body of the query, we filter out the tuples whose key-root attributes are
involved in a violation of CONDS by checking the condition countViol=0.
The from clause of subquery countViolSubQuery is obtained by calling a procedure
called GetJoinsExpression (shown in Figure 6.6), with the join graph of q and the list
of conditions CONDS as parameters. This procedure consists of two parts. In the first
part, an inner join is computed for the key-to-key joins of relations that are at the root
of some tree of the join graph. Notice that since these relations are in distinct connected
components of the join graph, they are not related by a nonkey-to-key join. In the second
part, the procedure produces a left outer join expression for each tree of the join graph.
This is done by recursively calling the procedure GetTreeJoinsExpression for the nodes
of each tree (also shown in Figure 6.6). The expression returned by GetTreeJoinsExpres-
sion is a left outer join of all relations in the input tree, listed in an order corresponding
to a preorder traversal of the trees.
We will illustrate shortly (in Example 6.4) the rewriting for queries where some of
the root-key attributes do not appear in the select clause (that is, some root-key at-
tributes are not projecting attributes). We will argue that in such cases, we would
like to count the number of distinct values for the projecting attributes, grouping by
the root-key attributes. We will also show how to do this by using the max aggre-
gate function (with a partition by clause) and the rank OLAP function. In the al-
gorithm RewriteForestSQL of Figure 6.5, the rank function is used in the subquery
countViolSubQuery, and the max function is used in the subquery countProjSubquery.
The result of this aggregation is kept in an attribute called countProjection, which
keeps the count of distinct values for each instantiation of the root-key variables. This
attribute is used in the main body of the rewriting, where we check countProjection=1.
In the subqueries, we project not only on the projecting attributes S1, . . . , Sl, but
also on the root-key attributes K1, . . . , Kn. However, in the main query of the rewriting
we project only on the attributes S1, . . . , Sl. In this way, the rewritten query Q and the
input query q return tuples for the same set of attributes.
For the sake of clarity, we omitted the order by clause in the query q. However,
dealing with ordering in the rewriting is quite easy. We just need to add the attributes
of the order by of q to the select clause of the subqueries, and include them in the
order by clause of the main body of the rewriting.
Chapter 6. ConQuer: System Implementation and SQL Rewritings 108
Algorithm RewriteForestSQL(q,Σ)
Input: q, a SQL query in Csqlforest of the form
select <list of attributes>from <list of relations>where <list of conditions>
Σ, a set of key constraints (one per relation)Output: Q, a SQL query that computes consistentΣ(q, I), for every database I
Let S1, . . . , Sl be the attributes in the select clause of qLet G be the join graph (forest) of qLet r1, . . . , rm be the relations at the root of all trees of GLet K1, . . . , Kn be the attributes in the keys of r1, . . . , rm
Let CONDS be the list of conditions in the where clause of qLet JOINS be the expression obtained by calling the procedure
GetJoinsExpression(G, CONDS) of Figure 6.6Let Q be the following SQL query:
with candidatesSubQuery as (select K1 as cK1,. . . ,Kn as cKn
from <list of relations in q>where CONDS ),
countViolSubQuery as (select K1, . . . , Kn,
S1, . . . , Sl,rank() over (partition by K1, . . . , Kn
order by S1, . . . , Sl) as rankProjection,sum(case when CONDS then 0 else 1 end)
over (partition by K1, . . . , Kn) as countViol,from JOINS ),where exists (select * from candidatesSubQuery
where K1 = cK1 and . . . and Kn = cKn),countProjSubQuery as (
select K1, . . . , Kn,S1, . . . , Sl,max(rankProjection) over (partition by K1, . . . , Kn)
as countProjection,countViol
from countViolSubQuery )
select distinct S1, . . . , Sl
from countProjSubQuerywhere countProjection = 1 and countViol=0
return Q
Figure 6.5: SQL query rewriting algorithm for SPJ queries in Csqlforest
Chapter 6. ConQuer: System Implementation and SQL Rewritings 109
6.2.2 Examples
We now present some examples to illustrate the use of the RewriteForestSQL algorithm.
In the examples, we first show the first-order rewriting that we obtain with the algorithms
of Chapter 3, and then we present the actual SQL query produced by ConQuer.
Selection
In the next example, we illustrate ConQuer’s SQL rewritings with a simple query that
has one selection condition.
Example 6.1. Let R be a schema with our standard employee(emplKey, salary) re-
lation. Consider a SQL query q1 that retrieves the names and salaries of all employees
whose salary is less than or equal to 1000.
q1: select distinct emplKey
from employee
where salary <= 1000
Using the notation for conjunctive queries, q1 can be written as follows:
q1(e) = ∃s.employee(e, s) ∧ s ≤ 1000
A first-order query rewriting that computes the consistent answers to q1 can be
obtained with the algorithms of Chapter 3. In particular, the rewriting returned by
RewriteForest(q1, Σ) is the following:
Q1(e) = ∃s.employee(e, s) ∧ s ≤ 1000 ∧ ∀s′.(employee(e, s′) → s′ ≤ 1000)
Notice that the first and second conjuncts of the first-order rewriting Q1 actually
correspond to the original query q1. Thus, the rewriting starts with a subquery called
candidatesSubQuery that retrieves the employee names that satisfy q1 (and are thus
candidates to be consistent answers).
Chapter 6. ConQuer: System Implementation and SQL Rewritings 110
Algorithm GetJoinsExpression(G, CONDS)
Input: G, a join graph that forms a forestCONDS, a list of conditions of the form xθy,
where θ is some binary comparison operator such as =, 6=, <, etc.Output: a subexpression of a SQL query
Let r1, . . . , rm be the relations at the root of all trees of GInitialize RJOINS as the string “r1”for i := 2 to m do
Let IJOINS be the conjunction of all join conditions (i.e., equalities) between attributesof ri−1 and ri
Concatenate “join ri on IJOINS” to RJOINSend forInitialize T JOINS as an empty expressionLet T1, . . . , Tm be the trees of G rooted at r1, . . . , rm
for i := 1 to m doConcatenate the expression returned by GetTreeJoinsExpression(Ti, CONDS) toT JOINS
end forreturn “RJOINS and T JOINS”
Algorithm GetTreeJoinsExpression(T, CONDS)
Input: T , a join graph that forms a treeCONDS, a list of conditions of the form xθy,
where θ is some binary comparison operator such as =, 6=, <, etc.Output: a subexpression of a SQL query
Initialize LOJOINS as an empty stringif T consists of more than one node r then
Let r1, . . . , rm be the relations whose root is a child of rfor i := 1 to m do
Let IJOINS be the conjunction of all join conditions (i.e., equalities) between at-tributes of r and ri
Concatenate “left outer join ri on IJOINS” to LOJOINSend forfor i := 1 to m do
Let Ti be the subtree of T rooted at ri
Concatenate the expression returned by GetTreeJoinsExpression(Ti, CONDS) toLOJOINS
end forend ifreturn LOJOINS
Figure 6.6: Procedures to obtain an expression for the joins of a query
Chapter 6. ConQuer: System Implementation and SQL Rewritings 111
with candidatesSubQuery as (
select emplKey
from employee
where salary <= 1000)
Since emplKey is a key of the relation employee, in the repairs, each employee name
will be associated with exactly one salary. However, in the inconsistent database, an
employee name may appear with several different salaries. Thus, the rewriting must
ensure that the employee names in the consistent answers are associated with salaries
satisfying the selection condition of the input query q1 (i.e., that the salary is less or
equal than 1000) in every tuple of the inconsistent relation employee where the employee
name appears. This is done in Q1 with the expression ∀s′.employee(e, s′) → s′ <= 1000.
It is straightforward to translate this expression into SQL using nested queries and the
not exists construct. However, from our empirical observations in the context of DB2,
we have noticed that such constructs lead in many cases to inefficient queries. Thus,
for the sake of efficiency, the rewritings produced by ConQuer avoid the not exists
construct. One way of doing this is to count, for each employee, the number of salaries
in the inconsistent database that violate the selection condition of q1. If there are no
violations (i.e., the number of salaries violating the condition for the employee is zero),
then the employee name satisfies the selection condition in every tuple of the inconsistent
relation. This can be achieved with the following subquery.
with countViolSubQuery as (
( select emplKey,
sum(case
when salary ≤ 1000 then 0 else 1 end) as countViol
from employee
where exists (select *
from candidatesSubQuery C where
C.emplKey=employee.emplKey)
group by emplKey)
In the above subquery, we count the number of violations for each employee. We keep
this count in an attribute called countViol. The final result of the query consists of the
Chapter 6. ConQuer: System Implementation and SQL Rewritings 112
employee names for which there are no violations (countViol = 0). In the subquery,
for each tuple of employee, we compute a case statement. If the salary in the tuple
is less than or equal to 1000 (i.e., it satisfies the selection condition of q1) we output
a value of zero (meaning no violation). Otherwise, we output 1 (meaning a violating
tuple). The query aggregates these values, summing them up for each employee name.
If the sum for an employee name is zero, that means that there are no violating tuples
involving that employee name. Otherwise, we get the number of violating tuples (hence
the name, countViol). In the main body of the query (which we give below), we return
all employee names that are not involved in any violation.
select emplKey
from countViolSubQuery
where countViol = 0
Join
We now present two examples to illustrate the rewriting of queries that contain join
conditions. In the first example, we show the rewriting for a query that has one join
condition. In the second example, we show the rewriting for a query with a more complex
join graph.
Example 6.2. Let R be a schema with relations employee(emplKey, deptFKey), and
dept(deptKey,mgrName). Consider a SQL query q2 that retrieves the names of all
employees whose department appears in the dept relation:
q2: select distinct emplKey
from employee,dept
where employee.deptFKey= dept.deptKey
Notice that q2 has an inner join specified with the condition employee.deptFKey=
dept.deptKey of its where clause. In conjunctive query notation, q2 can be written as
follows.
q2(e) = ∃d,m.employee(e, d) ∧ dept(d,m)
It can be easily checked that q2 is in the class Cforest of conjunctive queries. The
first-order query rewriting obtained by applying the algorithm RewriteForest(q2, Σ) is
the following:
Chapter 6. ConQuer: System Implementation and SQL Rewritings 113
Q2(e) = ∃d,m.employee(e, d) ∧ dept(d,m) ∧ ∀d.(employee(e, d) → ∃m.R2(d,m))
We could translate Q2 to SQL using a not exists construct to achieve the effect of
the universal quantifier. Although this may be a reasonable strategy for a simple query
like q2, we will show in the next example that it leads to deeply nested rewritings when
the original queries have several joins.
We now illustrate how to avoid the not exists construct in the rewritings. As in
the previous example, we can count, for each employee, the number of tuples violating
the conditions of the input query (in this case, the join condition). In order to detect
violations of the join condition employee.deptFKey=dept.emplKey, we need to check
whether there is a tuple in the employee relation whose department is not in the dept re-
lation. This can be achieved by performing a left outer join between the relations as
follows:
with candidatesSubQuery as (
select emplKey
from employee,dept
where employee.deptFKey= dept.deptKey ),
countViolSubQuery as (
select emplKey,
sum(case
when employee.deptFKey=dept.emplKey then 0 else 1 end)
as countViol
from employee left outer join dept
on employee.deptFKey=dept.emplKey
where exists (select *
from candidatesSubQuery C where
C.emplKey=employee.emplKey)
group by emplKey )
select emplKey
from countViolSubQuery
where countViol = 0
Chapter 6. ConQuer: System Implementation and SQL Rewritings 114
Notice that there is a subquery called countViolSubQuery, specified using a with
clause. In this subquery, we count the number of violations for each employee. We keep
this count in an attribute called countViol. The final result of the query consists of the
employee names for which there are no violations (countViol = 0). In the computa-
tion of countViol, we use a case statement. If there is a join with some tuple of the
dept relation, we output a value of zero (meaning no violation). Otherwise, we output 1
(meaning a violating tuple). Notice that we can detect the violations of the (inner) join
of the input query q2 because we are performing a left-outer join in the rewritten query
Q2. Had we performed an inner join in Q2, the tuples that do not join on the department
would have never been “seen” by the case statement.
As in the previous example, the query aggregates the values for countViol, summing
them up for each employee name. If the sum for an employee name is zero, there are no
violating tuples involving that employee name. Otherwise, we get the number of violating
tuples.
We just illustrated how we can avoid the use of not exists in the SQL rewritings
by performing a left outer join. In next example, we show why we adopt this strategy
in ConQuer: a naive translation may lead to a deeply nested query , where the level of
nesting may be as large as the number of relations in the from clause of the query.
Example 6.3. Let R be a schema with relations employee(emplKey, cityFKey, deptFKey),
dept(deptKey,mgrName), city(cityKey, provFKey), and prov(provKey, countryName).
Consider a SQL query q3 that retrieves the names of all employees that are located in
Canada and whose manager is Peter:
q3: select distinct emplKey
from employee, city, prov, dept
where employee.cityFKey=city.cityKey
and city.provFKey=prov.provKey
and employee.deptFKey=dept.deptKey
and prov.countryName= "Canada"
and dept.mgrName="Peter"
In conjunctive query notation, q3 can be written as follows.
q3(e) = ∃d, c, m, p.employee(e, d, c) ∧ city(c, p) ∧ prov(p, Canada) ∧ dept(d, Peter)
Chapter 6. ConQuer: System Implementation and SQL Rewritings 115
Figure 6.7: Join graph of query q3.
It can be checked that q3 is in class Cforest. In particular, notice that the join graph of q3
(given in Figure 6.7) is a tree. As shown in Chapter 3, a first-order rewriting of q3 can
be obtained by recursively traversing its join graph. The first-order query rewriting Q3
obtained by applying RewriteForest(q3, Σ) is the following:
Q3(e) = ∃d, c, m, p.employee(e, d, c) ∧ dept(d,m) ∧ city(c, p) ∧ prov(p, Canada) ∧Q′(e)
where :
Q′(e) = ∃d, c.employee(e, d, c) ∧ ∀d, c.employee(e, d, c) → (Q′′(c) ∧QIV (d))
Q′′(c) = ∃p.city(c, p) ∧ ∀p.city(c, p) → Q′′′(p)
Q′′′(p) = prov(p, Canada) ∧ ∀w′.(prov(p, w′) → w′ = Canada)
QIV (d) = dept(d, Peter) ∧ ∀u′.(dept(d, u′) → u′ = Peter)
The universal quantifiers can be translated to SQL using the not exists construct.
However, this may lead to an inefficient query. First, because it would have four self
joins (since each relation appears twice in the rewriting). Second, because each recursive
invocation of the algorithm produces a new universal quantifier, and a new subquery
within its scope. For example, Q′′ is under the scope of a universal quantifier for variable
d in Q′, and Q′′′ is under the scope of another universal quantifier (for variable p) in Q′′.
As a consequence, the level of nesting of the SQL rewriting Q3 would be three, which
corresponds to the height of the join graph.
As we showed in the previous example, in ConQuer we avoid using the not exists
construct by performing a left-outer join of the relations in each tree of the join graph.
The SQL rewriting produced by ConQuer in this case is the following:
Chapter 6. ConQuer: System Implementation and SQL Rewritings 116
with candidatesSubQuery as (
select emplKey
from employee,city, prov,dept
where employee.cityFKey=city.cityKey
and city.provFKey=prov.provKey
and employee.deptFKey=dept.deptKey
and prov.countryName= "Canada"
and dept.mgrName="Peter" ),
countViolSubQuery as (
select emplKey,
sum(case
when employee.cityFKey=city.cityKey
and city.provFKey=prov.provKey
and employee.deptFKey=dept.deptKey
and prov.countryName= "Canada"
and dept.mgrName="Peter"
then 0 else 1 end) as countViol
from employee left outer join dept on employee.deptFKey=dept.deptKey
left outer join city on employee.cityFKey=city.cityKey
left outer join prov on city.provFKey=prov.provKey
where exists (select *
from candidatesSubQuery C where
C.emplKey=employee.emplKey)
group by emplKey )
select emplKey
from countViolSubQuery
where countViol = 0
It is important to note that the SQL rewriting has only two subqueries, even though
q3 has four relations, and a join graph with a tree of depth three.
Chapter 6. ConQuer: System Implementation and SQL Rewritings 117
Projection and the Need for OLAP Functions
In Example 6.1, we dealt with a query that projects on the key attribute of the relation
employee. If a query does not project on the key attribute, then special care must be
taken in the rewriting. We illustrate this with the next example.
Example 6.4. Let R be a schema with our standard employee(emplKey, salary) rela-
tion. Let q4 be a query that retrieves all salaries (regardless of the employee name).
q4: select distinct salary
from employee
Comparing q4 to q1, the former query does not project on the key attribute emplKey,
and it has no where clause. In conjunctive query notation, q4 can be written as follows.
q4(s) = ∃e.employee(e, s)
The first-order query rewriting obtained by invoking RewriteForest(q4, Σ) is the
following.
Q4(s) = ∃e.employee(e, s) ∧ ∀s′.(employee(e, s′) → s′ = s)
Again, we would like to avoid the naive (but inefficient) translation of Q4 into SQL
that uses the not exists construct. Intuitively, Q4 returns the salaries s for which there
is at least one employee name that is associated to s and only to s in the tuples of the
inconsistent relation employee. In this way, we ensure that salary s will appear in every
repair. One way of writing Q4 in SQL is the following:
select salary
from employee
where emplKey is in
select emplKey
from employee
group by emplKey
having count(distinct salary)=1
Chapter 6. ConQuer: System Implementation and SQL Rewritings 118
In our empirical observations, the self join of the above query sometimes leads to
inefficient queries. The self join is needed because we are not including the salary
attribute in the select clause of the subquery. This is not an arbitrary decision. Rather,
it is forced by the syntax of SQL. In SQL, all the attributes of the select clause must
appear in the group by clause. If we include salary in the select clause of the
subquery, we must also group by it, and hence we are unable to count the number of
distinct salaries per employee name. We will show shortly how we overcome this problem
in ConQuer’s rewritings.
We just argued that there are some query rewritings for which there is no obvious way
of avoiding self joins, and that this is caused by the syntax of the group by clause. This
problem was addressed in the OLAP Amendment to the SQL standard [ISO01], which
introduces aggregate functions with a partition by clause. The OLAP Amendment to
the standard has been implemented by all major database vendors. In particular, for
DB2, the standard has been supported since Version 7 (we are using Version 8.2).
The partition by clause is more flexible than group by for two reasons. First, there
can be one partition by clause for each aggregate function, whereas there can only be
one group by for the entire query. Second, unlike group by, the attributes of the select
clause are not required to appear in the partition by clauses of the query. We illustrate
the use of the partition by clause with the next example.
Example 6.5. Consider the following SQL query:
select emplKey,salary,
sum(salary) over (partition by emplKey)
as countProjection
from employee
The query returns triples of values. The first two values of each triple correspond to
employee names and salaries in the relation employee. The last attribute is the sum of
the salaries for the employee name in the tuple. Notice that the attribute emplKey is
in the partition by clause, but the salary attribute is not. So we are projecting on
two attributes (emplKey and salary), but considering only one of them for grouping the
results of the aggregate function. This cannot be done with a group by clause.
Let us finish this example by showing an application of the query to an actual
database. Consider the database I = {employee(John, 1000), employee(John, 2000),
Chapter 6. ConQuer: System Implementation and SQL Rewritings 119
employee(Mary, 1000)}. The result of applying the SQL query above to I is the following
{(John, 1000, 3000), (John, 2000, 3000), (Mary, 1000, 1000)}.
In the next example, we show how the partition by clause could be used in order
to avoid self joins in the rewritings.
Example 6.4. (continued) Recall that we had obtained a rewriting of query q4 that
performs a self join on the employee relation. We can write an equivalent query without
a self join by taking advantage of the partition by clause.
with countProjSubQuery as (
select emplKey,
salary,
count(distinct salary) over (partition by emplKey) as countProj
from employee )
select salary
from countProjSubQuery
where countProj = 1
In the subquery countProjSubQuery, we obtain the number of distinct salaries for
each employee name (which we keep in a variable called countProj). The rewriting then
returns the salaries of employees for which there is exactly one salary in the database
(countProj = 1).
The query rewriting that we just obtained avoids the use of a self join by using the
partition by clause. Unfortunately, though, this is not the end of the story. The
version of DB2 that we use in ConQuer currently supports the partition by clause for
a variety of aggregate functions (such as sum, min, max, count(*), and avg), but it does
not support the count(distinct) function. Nevertheless, the effect of count(distinct)
can be obtained by combining the use of the max aggregation function (with a partition
by clause) and an OLAP function called rank() as follows.
with rankProjSubQuery as (
select emplKey, salary,
rank() over (partition by emplKey order by salary)
as rankProjection
Chapter 6. ConQuer: System Implementation and SQL Rewritings 120
from employee ),
countProjSubQuery as (
select emplKey, salary,
max(rankProjection) over (partition by emplKey)
as countProjection
from rankProjSubQuery )
select distinct salary
from countProjSubQuery
where countProjection = 1
First, let us explain the use of the rank() function. The syntax of rank() is the
following:
rank() over
(partition by <partition attributes> order by <order attributes>)
The function creates groups for each tuple of values (instantiation) of the attributes
in the partition by clause, as we discussed before for other functions. The tuples of
each group are ordered according to the attributes in the order by clause, and assigned
a number according to their position in this ordering. If there is a tie (in our example,
two tuples with the same employee name and salary), the tuples are mapped to the same
number.
Let us illustrate the semantics of the rank() function in the context of our example
rewriting. Consider a database I = {employee(John, 1000), employee(John, 2000),
employee(Mary, 1000)}. Then, the function rank() over (partition by emplKey
order by salary) would map (John, 1000) to 1, (John, 2000) to 2, and (Mary, 1000)
to 1.
Now consider the instance I as an inconsistent database with respect to Σ (which
contains a constraint stating that emplKey is the key of the employee relation). In
the subquery rankProjSubQuery of the rewritten query, we compute the ranking func-
tion for each tuple and keep the value in an attribute called rankProjection. Then,
in the subquery countProjSubQuery, we obtain the maximum value of the attribute
rankProjection for each employee name, and keep it in an attribute called count-
Projection. Notice that the “grouping” is done by employee names since the attribute
Chapter 6. ConQuer: System Implementation and SQL Rewritings 121
emplKey is in the partition by clause of the max aggregate function. In our example,
we would obtain {(John, 1000, 2), (John, 2000, 2),(Mary, 1000, 1)}. In the final result,
we would like to get salary 1000 because it appears associated with Mary in every re-
pair, but not 2000 because it does not appear in all repairs. We obtain this in the query
rewriting by checking the condition countProjection=1.
6.3 ConQuer Rewritings for SPJ Queries with Ag-
gregation
In this section, we present the SQL query rewritings produced by ConQuer for queries
with grouping and aggregation. We first present the algorithm and then illustrate it with
some examples.
6.3.1 Rewriting algorithm
We now present the SQL rewriting algorithm for SPJ queries with aggregation that are
equivalent to the aggregate conjunctive queries in class Caggforest, introduced in Definition
4.1, which we repeat next.
Definition 4.1. Let q be an aggregate conjunctive query. We say that q is in class
Caggforest if q is of the form
select ~z, [count(*)| F(u)]
from q∗(~z, u)
group by ~z
where q∗ is a conjunctive query in Cforest, and F is one of the aggregation functions
min, max, or sum.
We can now give a definition analogous to Caggforest for SPJ SQL queries with aggre-
gation.
Chapter 6. ConQuer: System Implementation and SQL Rewritings 122
Definition 6.1. We say that query q is in class Csqlaggforest if q is the form
select S1, . . . , Sl,[count(*)],F1(A1), . . . , Fu(Au)
from <list of relations>
where <list of conditions>
group by S1, . . . , Sl
where S1, . . . , Sl, A1, . . . , Au are attributes of the relations in the from clause, and
F1, . . . , Fu may be any of the aggregation functions min, max, and sum.
We are now ready to give ConQuer’s rewriting for queries in Csqlaggforest. The algorithm
is called RewriteAggSQL, and is shown in Figure 6.8. It takes as input a SQL query q in
class Csqlaggforest and a set of key constraints (one per relation of the schema), and returns
a SQL rewriting Q of q.
In the rewriting Q, the attributes of the relations in q play different roles. As in
the algorithm RewriteForestSQL for queries without aggregation, we have projecting
and key-root attributes. The former are the attributes that q projects on (i.e., that
appear in its select clause), and the latter are the attributes that appear in the key
of a relation that is at the root of some tree in the join graph of q. In addition, in
RewriteAggSQL, we have aggregation attributes, that is the attributes that appear as
arguments of some aggregation function of q. In Figure 6.8, we denote the projecting
attributes with the symbols S1, . . . , Sl; the key-root attributes with K1, . . . , Kn; and the
aggregation attributes with A1, . . . , Au.
We denote the aggregation functions of q with F1, . . . , Fu. In the figure, we assume
that the 0-ary function count(*) is present in the query (but during the explanation it
will be easy to see what can be dropped if count(*) is not present).
The rewriting Q has five subqueries, specified using a with clause: candidatesSub-
Query, countViolSubQuery, contribAllSubQuery, contribConsistentSubQuery, and
contribNonConsistentSubQuery.
As in the algorithm RewriteForestSQL, the purpose of candidatesSubQuery is to
determine the values for the key-root attributes that should be considered by the other
subqueries. The subquery countViolSubQuery has the same purpose (counting the num-
ber of violations per key-root value) as the subquery of the same name in the rewrit-
ing RewriteForestSQL. One difference is that here we need to compute the attribute
Chapter 6. ConQuer: System Implementation and SQL Rewritings 123
satConds which keeps track of whether each tuple satisfies the conditions of the query
(denoted as CONDS). The other difference is that in the select clause of the subquery,
we must project on the aggregation attributes since their values are needed to perform
aggregation in the rest of the rewriting.
The other three subqueries are used to compute the “contributions” to the lower and
upper bounds of each aggregate result. The subquery contribAllSubQuery computes,
for each instantiation of the key-root and projecting attributes, the minimum and max-
imum value for each aggregation attribute. In particular, in the subquery we group by
K1, . . . , Kn, S1, . . . , Sl (the key-root and projecting attributes), and for each aggregation
Fi(Ai) in the select clause of q, we compute attributes bottomAi and topAi as min(Ai)
and max(Ai), respectively. We also compute an attribute countProjection, to keep
track of the projection on nonkey attributes.
The subqueries contribConsistentSubQuery and contribNonConsistentSubQuery
compute the contribution of the “consistent” and “nonconsistent” tuples to the aggre-
gation. The former are the tuples whose key-root values satisfy the following two con-
ditions. First, they have the same value for the projecting attributes in every tuple
where they appear (checked with condition countProjection = 1). Second, they are
not involved in a violation of the selection conditions CONDS in any of the tuples where
they appear (checked with condition countViol=0). The tuples that violate at least
one of these conditions are considered “nonconsistent” and dealt with in the subquery
contribNonConsistentSubQuery.
For the “consistent” tuples, the contributions computed in contribConsistentSub-
Query correspond to the bottom and top values from contribAllSubQuery. That is,
the attributes bottomAi and topAi of contribAllSubQuery appear in the select clause
of contribConsistentSubQuery. The computation of the contributions of the “noncon-
sistent” tuples is more involved. In contribNonConsistentSubQuery, the expression of
the select clause that handles the contributions is obtained by calling the procedure
GetBoundsNonConsistent given in Figure 6.9. Notice in the figure that the contributions
are different depending on the aggregation function. The rationale and correctness proof
for these contributions were given in Chapter 4. In the figure, we do not include the 0-ary
operator count(*). For this operator, we need to return the attributes bottomCount and
topCount with values of zero and one, respectively.
In the subqueries, we project not only on the projecting attributes S1, . . . , Sl but
also on the root-key attributes K1, . . . , Kn. However, in the main query of the rewriting
Chapter 6. ConQuer: System Implementation and SQL Rewritings 124
we project and group by only the attributes S1, . . . , Sl (i.e., we project out the key-root
attributes). In this way, the rewritten query Q and the input query q return tuples
for the same set of attributes. We also compute the greatest lower bound (glbAi) and
lowest upper bound (lubAi) for each tuple of values for the projecting attributes. This
is obtained by performing the corresponding aggregation function (min, max, or sum) on
the top and bottom values computed in the previous subqueries. For the 0-ary func-
tion count(*), the bounds are computed by summing up the values of the attributes
bottomCount and topCount from the previous subqueries. Notice that there is also a
condition having sum(bottomCount) > 0. This is included in order to ensure that the
tuples for the projecting attributes are consistent answers.
For the sake of clarity, we omitted the order by clause in the query q. However,
dealing with ordering in the rewriting is quite easy. We just need to add the attributes
of the order by clause of q to the select clause of the subqueries, and finally add an
order by clause to the main subquery. The only special case that must be considered
is when an aggregate attribute appears in the order by clause. Since for each aggregate
attribute of q we have two attributes in the rewritten query (one for each bound), we
must (arbitrarily) decide whether the ordering will be by either the greatest lower or the
lowest upper bound.
Chapter 6. ConQuer: System Implementation and SQL Rewritings 125
Algorithm RewriteAggSQL(q, Σ)
Input: q, a SQL query in Csqlaggforest of the form
select <list of attributes>,<list of aggregation functions>from <list of relations>where <list of conditions>group by <list of attributes>
Σ, a set of key constraints (one per relation)Output: Q, a SQL query that computes aggconsistentΣ(q, I) for every database I
Let F1(A1), . . . , Fu(Au) be the aggregation function applications in the select clauseof the query, where each Fi is an aggregation function, and each Ai is an attributefrom a relation that appears in the from clause
Let S1, . . . , Sl be the attributes in the select clause of q (by definition of Csqlaggforest,
these are the attributes in the group by clause as well)Let G be the join graph (forest) of qLet r1, . . . , rm be the relations at the root of some tree of GLet K1, . . . , Kn be the attributes in the keys of r1, . . . , rm
Let CONDS be the list of conditions in the where clauseLet JOINS be the expression obtained by calling the procedure
GetJoinsExpression(G, CONDS) of Figure 6.6Let Q be the following SQL query:
with candidatesSubQuery as (
select K1 as cK1,. . . ,Kn as cKn
from <list of relations in q>where CONDS ),
with countViolSubQuery as (
select K1, . . . , Kn, S1, . . . , Sl, A1, . . . , Au
rank() over (partition by K1, . . . , Kn
order by S1, . . . , Sl) as rankProjection,sum(case when CONDS then 0 else 1 end)
over (partition by K1, . . . , Kn) as countViol,case when CONDS then ‘‘yes’’ else ‘‘no’’ end as satConds
from JOINSwhere exists (select * from candidatesSubQuery
where K1 = cK1 and . . . and Kn = cKn),
continued on next page...
Figure 6.8: SQL query rewriting algorithm for SPJ queries in Csqlaggforest
Chapter 6. ConQuer: System Implementation and SQL Rewritings 126
continues from previous page...
contribAllSubQuery as (
select K1, . . . , Kn, S1, . . . , Sl,min(A1) as bottomA1,max(A1) as topA1,...,
min(Au) as bottomAu,max(Au) as topAu,
max(rankProjection) over (partition by K1, . . . , Kn)as countProjection,
countViol
from countViolSubQuery
where satConds=‘‘yes’’
group by K1, . . . , Kn, S1, . . . , Sl,countViol,rankProjection )
contribConsistentSubQuery as (
select K1, . . . , Kn, S1, . . . , Sl,bottomA1,topA1,. . . ,bottomAu,topAu,
1 as bottomCount,
1 as topCount
from contribAllSubQuery
where countProjection = 1 and countViol=0 )
contribNonConsistentSubQuery as (
select K1, . . . , Kn, S1, . . . , Sl,GetBoundsNonConsistent(F, A1),. . . ,GetBoundsNonConsistent(F, Au),0 as bottomCount,
1 as topCount,
from contribAllSubQuery
where countProjection > 1 or countViol >= 1 )
select S1, . . . , Sl,F(bottomA1) as glbA1,F(topA1) as lubA1,. . . ,F(bottomAu) as glbAu,F(topAu) as lubAu,
sum(bottomCount) as glbCount, sum(topCount) as lubCount
from
( select * from contribConsistentSubQuery
union all
select * from contribNonConsistentSubQuery ) q
group by S1, . . . , Sl
having sum(bottomCount)>0
return Q
Figure 6.8: SQL query rewriting algorithm for SPJ queries in Csqlaggforest
Chapter 6. ConQuer: System Implementation and SQL Rewritings 127
Algorithm GetBoundsNonConsistent
Input: Fi, one of the aggregation functions sum, min, max
Ai, an attribute
Output: a subexpression of a SQL query
if Fi = sum then
return “case when
bottomAi < 0 then bottomAi
else 0 end as bottomAi,
case when
topAi > 0 then topAi
else 0 end as topAi”
end if
if Fi = min
return “bottomAi, 0 as topAi”
end if
if Fi = max
return “0 as bottomAi, topAi”
end if
Figure 6.9: Algorithm to obtain the bottom and top contributions of “nonconsistent”
tuples
6.3.2 Examples
We next illustrate the rewriting for a query that uses the count aggregation function.
Example 6.6. Let R be a schema with relation employee(emplKey, salary, age). Con-
sider a SQL query q5 that, for each age in the database, gives the number of occurrences
of the age on tuples for employees whose salary is less than or equal to 1000.
Chapter 6. ConQuer: System Implementation and SQL Rewritings 128
q5: select age, count(*)
from employee
where salary <= 1000
group by age
In the aggregate conjunctive query notation introduced in Chapter 4, q5 can be written
as follows.
q5(a, cnt) = select a, count(*)
from employee(e, s, a) ∧ s ≤ 1000
group by a
The above query is in the class Caggforest for which we gave a query rewriting algorithm
in Chapter 4. A key idea of that algorithm is to first produce a first-order rewriting for
a conjunctive query, and then perform aggregation on the result of the first-order query.
For our example, this conjunctive query is q′(e, a) = ∃s.employee(e, s, a)∧ s ≤ 1000. Let
us call QConsistent(e, s) to the result of invoking RewriteForest(q′, Σ) (the algorithm
introduced in Chapter 3).
Let Q5 be the query rewriting for q5 obtained by invoking RewriteCount(q5, Σ) (the
algorithm of Figure 4.1 of Chapter 4). In that rewriting, the greatest lower bound is
obtained as follows:
QGlb(s, glb)= select s, count(*)
from QConsistent(e, s)
group by s
Notice that aggregation is performed on the result of the first-order query QConsistent(e, s).
Thus, for computing the greatest lower bound in the SQL rewriting, we can reuse the al-
gorithm RewriteForestSQL introduced in Section 6.2. In particular, we will use the next
two subqueries, which are similar to those that would be produced by RewriteForestSQL(q′, Σ)
(we will show the differences next).
with candidatesSubQuery as (
select emplKey
from employee
where salary <= 1000 )
Chapter 6. ConQuer: System Implementation and SQL Rewritings 129
with countViolSubQuery as (
select emplKey,age,
rank() over (partition by emplKey
order by age) as rankProjection,
sum(case when salary <= 1000 then 0 else 1 end)
over (partition by emplKey) as countViol,
case when salary <= 1000 then ‘‘yes’’ else ‘‘no’’ end
as satConds
from employee
where exists (select *
from candidatesSubQuery C where
C.emplKey=employee.emplKey) )
with contribAllSubQuery as (
select emplKey,age,
max(rankProjection) over (partition by emplKey)
as countProjection,
countViol
from rankProjSubQuery
where satConds=‘‘yes’’
group by emplKey,age,countViol,rankProjection )
The above subqueries differ from the ones that would be produced by Rewrite-
ForestSQL in the following aspects. In countViolSubQuery, we compute an attribute
satConds that keeps track of whether each tuple satisfies or violates the selection con-
dition of q5 (i.e., that the salary is less than or equal to 1000). This is different from
the attribute countViol because countViol counts the violations for all tuples where a
key value (employee name, in this case) appears, whereas satConds may take different
values on different tuples of the same employee, depending on the salary that appears in
the tuple. The third subquery corresponds to the subquery countProjSubQuery of the
Chapter 6. ConQuer: System Implementation and SQL Rewritings 130
algorithm RewriteForestSQL, but it has a different name here (contribAllSubQuery)
because, as we will show shortly, it is used to compute the “contribution” of each tuple
to the lower and upper bounds of count(*). In this subquery, we check the condition
satConds=‘‘yes’’. The intuitive reason is that the tuples that do not satisfy the con-
ditions of q5 (and hence satConds = ‘‘no’’) do not contribute neither to the lower nor
to the upper bound of count(*), and should thus be filtered out.
Let us now consider the computation of the lowest upper bound. In the query Q5
returned by RewriteCount, this bound is obtained as follows:
QLub(a, lub) = select a, count(*)
from q′(e, a) ∧ (∃e.QConsistent(e, a))
group by s
In this case, aggregation is done on the result of the following first-order expression:
q′(e, a)∧(∃e.QConsistent(e, a)). The naive way of writing this expression in SQL may be
inefficient because QConsistent already contains q′ as a subexpression. A more efficient
way of writing Q5 in SQL involves computing the “contributions” of each tuple to the
value of count(*), with the two subqueries shown next.
One of the subqueries (called contribConsistentSubQuery) computes the contribu-
tion of the “consistent” tuples. These are the tuples for employees that (1) have the
same age (the attribute in the select clause of q5) in every tuple where they appear;
and (2) are not involved in a violation of the conditions of q5 in any of the tuples where
they appear (i.e., their salary is always less than or equal to 1000). This can be checked
with the condition countProjection = 1 and countViol=0. In addition, the subquery
has attributes bottomCount and topCount that are used in the main body of the query
to combine the contributions of the “consistent” and “nonconsistent” tuples. For the
consistent tuples, the contribution is one to both the lower and upper bounds.
with contribConsistentSubQuery as (
select emplKey,age
1 as bottomCount
1 as topCount
from contribAllSubQuery
where countProjection = 1 and countViol=0 )
Chapter 6. ConQuer: System Implementation and SQL Rewritings 131
The other subquery (called contribNonConsistentSubQuery) computes the contri-
butions of the “nonconsistent” tuples. We give this name to the tuples that are not
in the consistent answer of q′, but do satisfy q′. These tuple do not contribute to
the greatest lower bound of count(*), but they may contribute to the lowest upper
bound. In the SQL rewriting, the nonconsistent tuples are captured with the condition
countProjection > 1 or countViol >= 1. In addition, the subquery has attributes
bottomCount and topCount that are used in the main body of the query to combine
the contributions of the “consistent” and “nonconsistent” tuples. For the nonconsistent
tuples, the contribution is zero to the lower bound and one to the upper bound (compare
this to the consistent tuples, which contribute one to both bounds).
with contribNonConsistentSubQuery as (
select emplKey,age
0 as bottomCount,
1 as topCount
from contribAllSubQuery
where countProjection > 1 or countViol >= 1 )
Finally, the main body of the rewriting sums ups the contributions of each tuple to the
lower and upper bounds, and projects out the attribute emplKey. The condition having
sum(bottomCount)>0 is used to ensure that we return ages that are consistent answers.
As we mentioned before, this corresponds to checking the condition ∃e.QConsistent(e, a).
select age
sum(bottomCount) as glb,
sum(topCount) as lub
from
( select * from contribConsistentSubQuery
union all
select * from contribNonConsistentSubQuery ) q
group by age
having sum(bottomCount)>0
In the next example, we illustrate the rewriting for a query that has the sum aggre-
gation function. The rewritings for the min and max aggregation functions are similar.
Chapter 6. ConQuer: System Implementation and SQL Rewritings 132
Example 6.7. Consider the same schema as in the previous example. Let q6 be a SQL
query that, for each age in the database, gives the sum of all salaries in the database
that are less or equal than 1000.
q6: select age, sum(salary)
from employee
where salary <= 1000
group by age
The SQL rewriting of q6 is computed by ConQuer along the same lines of the rewriting
for query q5 of the previous example. As in that example, the rewriting starts with three
subqueries: candidatesSubQuery, countViolSubQuery and contribAllSubQuery. The
subquery countViolSubQuery counts the number of violations of the selection condition
for each key value (age), and is the same as in the previous example, except that it
includes the attribute salary in its select clause. The subquery contribAllSubQuery
computes the contribution of all key values to the final result. The only difference with
the previous example is that here we compute the minimum and maximum salary for
each employee (attributes bottomSalary and topSalary). This was not necessary in
the previous example since count(*) is a 0-ary function, whereas sum is a unary function
(in this case, taking the argument salary).
with candidatesSubQuery as (
select emplKey
from employee
where salary <= 1000 )
with countViolSubQuery as (
select emplKey,age,salary,
rank() over (partition by emplKey
order by age) as rankProjection,
sum(case when salary <= 1000 then 0 else 1 end)
over (partition by emplKey) as countViol,
case when salary <= 1000 then ‘‘yes’’ else ‘‘no’’ end
as satConds
from employee
Chapter 6. ConQuer: System Implementation and SQL Rewritings 133
where exists (select *
from candidatesSubQuery C where
C.emplKey=employee.emplKey) )
with contribAllSubQuery as (
select emplKey,age,
min(salary) as bottomSalary,
max(salary) as topSalary,
max(rankProjection) over (partition by emplKey)
as countProjection,
countViol
from rankProjSubQuery
where satConds=‘‘yes’’
group by emplKey,age,countViol,rankProjection )
Then, as in the previous example, the rewriting computes the contributions from the
“consistent” and “nonconsistent” tuples. For clarity of presentation, we will assume that
all salaries are positive values (but in the general algorithm we deal with the case of
negative values as well). For the “consistent tuples” (whose contributions are computed
in contribConsistentSubQuery), the bottom and top salaries computed in contribAll-
SubQuery contribute to the greatest lower bounds and lowest upper bounds, respectively.
The top salary also contributes to the lowest upper bound of the “nonconsistent” tuples
(whose contributions are computed in contribNonConsistentSubQuery). However, as
we explained in Chapter 4, the bottom salary does not contribute to the greatest lower
bound. Therefore, the attribute bottomSalary of contribNonConsistentSubQuery gets
a value of zero.
with contribConsistentSubQuery as (
select emplKey,age,
bottomSalary,
topSalary,
Chapter 6. ConQuer: System Implementation and SQL Rewritings 134
1 as bottomCount
from contribAllSubQuery
where countProjection = 1 and countViol=0 )
with contribNonConsistentSubQuery as (
select emplKey,age
0 as bottomSalary,
topSalary,
0 as bottomCount
from contribAllSubQuery
where countProjection > 1 or countViol >= 1 )
Finally, the main body of the rewriting sums up the contributions of each tuple
to the lower and upper bounds, and projects out the emplKey attribute. Notice that
as in the rewriting for query q5 of the previous example, we have a condition having
sum(bottomCount)>0. This is done because, again, we want to report only the ages that
appear for sure in every repair.
select age,
sum(bottomSalary) as glbSalary,
sum(topSalary) as lubSalary
from
( select * from contribConsistentSubQuery
union all
select * from contribNonConsistentSubQuery ) q
group by age
having sum(bottomCount)>0
6.4 Exploiting Precomputed Annotations
The main focus of the thesis is on query processing directly on the inconsistent database.
However, in some circumstances, it may be advantageous to process the database offline
in order to materialize data structures with information about constraint violations. This
Chapter 6. ConQuer: System Implementation and SQL Rewritings 135
precomputed data could then be exploited during online query answering to improve the
performance of the queries.
In this section, we will present a simple offline precomputation scheme, and show the
rewritings that ConQuer produces in order to exploit it. The scheme is based on annota-
tions attached to each tuple. The annotation consists of just one bit that states whether
the tuple satisfies or violates a given key constraint. If annotation are present, then
ConQuer can produce a rewriting that exploits them. We call such rewriting annotation-
aware. In the next example, we illustrate the annotation-aware rewritings. In the next
section, we will identify the scenarios where it is desirable to exploit the annotations, and
we will empirically validate the effectiveness of the annotation-aware rewritings.
Example 6.8. Let R be a schema with relations employee(emplKey, deptFKey) and
dept(deptKey,mgrName). We will give an example based on a SPJ query without ag-
gregation. However, the example shows all the ingredients of the rewritings on annotated
databases, and extending the rewriting to the case of rewritings for queries with aggre-
gation is straightforward.
Consider a SQL query q7 that retrieves the names of all employees whose department
manager is Peter:
q7: select distinct emplKey
from employee,dept
where employee.deptFKey= dept.deptKey and dept.mgrName=‘‘Peter’’
Consider the database I = {employee(John, Sales), employee(Mary,Engineering),
dept(Sales, Peter), dept(Sales, Tom), dept(Engineering, Peter)}. Suppose that we in-
struct ConQuer to process the database offline and annotate each tuple with a bit stating
whether it satisfies or violates the constraints of Σ. Assume that ConQuer augments the
set of attributes of each relation with an attribute called cons that stores the annotation.
The “annotated database” produced by ConQuer would then be the following.
employee dept
emplKey deptFKey cons deptKey mgrName cons
John Sales y Sales Peter n
Mary Engineering y Sales Tom n
Engineering Peter y
Chapter 6. ConQuer: System Implementation and SQL Rewritings 136
Note that the tuple for Mary in relation employee, and the tuple for Engineering in
relation dept have a value of ‘‘y’’ in their cons attributes, meaning that they do not
violate any constraint. If we join these tuples, we get a tuple that satisfies query q7.
Furthermore, it is easy to see that this will be the only tuple in the result for Mary.
Thus, it must be a consistent answer.
In general, the join of consistent tuples (i.e, tuples where cons = ‘‘y’’) produces
a consistent answer. For such tuples, it suffices to check whether the conditions of the
original query are satisfied (in this example, check that they satisfy q7). In this way, we
can avoid the possibly costly operations of the rewritings produced by the algorithms
RewriteForestSQL and RewriteAggSQL. In the rewriting, we capture these tuples in a
subquery called allConsistentSubQuery (allConsistent because they come from the
join of tuples all of which are consistent). The subquery consists of the input query and a
filter that requires every tuple in the join to have a value of ‘‘y’’ in the cons attribute.
with allConsistentSubQuery as (
select distinct emplKey
from employee,dept
where employee.deptFKey= dept.deptKey and dept.mgrName=‘‘Peter’’
and employee.cons=‘‘y’’ and dept.cons=‘‘y’’
Now, note that the tuple for John also satisfies the constraints and has a value of
‘‘y’’ in its cons attribute. However, this tuple joins with the tuples for the Sales
department, which violate the key constraint of their relation (they are annotated with
‘‘n’’). If we join the tuple for John with the tuple dept(Sales, Peter), the result satisfies
q7. But if we join with dept(Sales, Tom), the result does not satisfy the query. Thus,
John is not a consistent answer to q7.
To keep track of the join of tuples that may violate a constraint, we produce a rewrit-
ing that is similar to the one that would be produced by RewriteForestSQL, the only
difference being that we augment the candidatesSubQuery subquery of the rewriting
with a condition checking whether the cons attribute of at least one of the joined tu-
ples is set to ‘‘n’’. In our example, we check the condition employee.cons=‘‘n’’ or
dept.cons=‘‘n’’. The result obtained from these tuples is kept in a subquery called
someNonConsistentSubQuery (the name comes from the fact that some of the tuples of
the join may not be consistent).
Chapter 6. ConQuer: System Implementation and SQL Rewritings 137
with candidatesSubQuery (
select distinct emplKey
from employee,dept
where employee.deptFKey= dept.deptKey and dept.mgrName=‘‘Peter’’
and (employee.cons=‘‘n’’ or dept.cons=‘‘n’’) )
with countViolSubQuery as (
select emplKey,
sum(case
when employee.deptFKey=dept.emplKey
and dept.mgrName=‘‘Peter’’ then 0 else 1 end)
as countViol
from employee left outer join dept
on employee.deptFKey=dept.emplKey
where exists (select * from Candidates C where C.emplKey=employee.emplKey)
group by emplKey )
with someNonConsistentSubQuery as (
select emplKey
from countViolSubQuery
where countViol = 0)
Finally, the main body of the query takes the union of the tuples obtained with the
subqueries allConsistentSubQuery and someNonConsistentSubQuery.
select emplKey from
(select emplKeyfrom someNonConsistentSubQuery)
union all
(select emplKeyfrom someNonConsistentSubQuery)
Notice that this rewriting is correct even if annotations incorrectly mark a consistent
tuple as inconsistent. Hence, when deleting or updating a tuple, it is not mandatory to
update annotations.
Chapter 6. ConQuer: System Implementation and SQL Rewritings 138
6.5 Related Work
In this section, we review systems for managing inconsistent databases that are related
to ConQuer. Hippo [CMS04b, CMS04a] is a system that produces consistent answers
for unions of quantifier-free conjunctive queries (that is, unions of queries in the class
presented by Arenas, Bertossi, and Chomicki [ABC99]). Hippo does not consider queries
with aggregation, grouping or bag semantics. Apart from the class of queries that it can
handle, Hippo differs from ConQuer in the fact that it is not based on query rewriting.
Rather, Hippo takes the more procedural approach of producing a Java program which
computes the consistent answers. Although the program does interact with an RDBMS
back-end, most of the processing is done by processing an (in-memory) conflict graph
data structure that contains all the tuples that violate the constraints. The system may
not be able to operate on databases where this data structure does not fit in memory.
Hippo has been shown to scale to database of up to 300,000 tuples [CMS04b].
There are a number of systems for consistent query answering that rewrite queries into
powerful logics [CB00, LLR02, EFGL03, CB05]. Infomix [EFGL03] is a notable example
of such an approach. In Infomix, queries are rewritten into disjunctive logic programs.
Such programs are computationally more expensive than SQL, but also more expressive
and permit rewritings over a very rich class of query constraints. For example, Infomix
considers general functional, inclusion, and exclusion query constraints. These systems
focus on expressiveness, more than efficiency and scalability, and therefore address a
different design point than the one we are considering. To give an idea of the scale of
the difference, one of the few experimental studies available in the literature [EFGL03]
reports results for databases with at most 100 tuples violating primary key constraints
(over a database of 50,000 tuples). In contrast, the largest database that we used in the
experiments reported in the next chapter has 8.6 million inconsistent tuples (over a total
of 172 million tuples).
Chapter 7
Experimental Analysis
In this chapter, we validate the efficiency of ConQuer’s rewritings using IBM DB2 UDB
Version 8.2 (from now on, referred to as just DB2). In Section 7.1, we give a detailed
description of the experimental framework. Then, in Section 7.2, we report and analyze
the experimental results obtained within this framework.
7.1 Experimental Framework
7.1.1 System and Database Manager Configuration
The experiments were performed on a Sun v40z server class computer with 4 processors
and 8 GB of RAM, running RedHat Linux AS 4 kernel Version 2.6.9. The relational
database management system used to run the queries was IBM DB2 UDB Version 8.2.
We now describe some important parameters in the database configuration. The
buffer pool size was deliberately kept considerably below the system’s available memory.
This is because our aim is to test the overhead of the queries in environments where the
amount of primary memory is small compared to database size. In particular, the buffer
pool size was restricted to 400 MB (whereas the size of the largest database reported
here is 20 GB).
In order to reduce the number of variables to consider when comparing running
times, the query optimizer was set to use a degree of intra-parallelism (parameter DFT-
DEGREE) of 1, meaning that the query plan always chooses to use one processor, even
though there are four available in the system. The query optimization level, which dic-
tates the amount of time that the query optimizer may spend to produce a query plan,
139
Chapter 7. Experimental Analysis 140
was set to its highest value (parameter DFT QUERYOPT was set to 9) since the time to
produce a plan is always negligible with respect to the time to execute the fairly complex
queries that we use in our experiments.
For all databases, statistics were created by running the DB2 RUNSTATS command.
The parameters for statistics gathering were set as follows: the number of “most frequent”
values to be collected from each table (parameter NUM FREQVALUES) was set to 10;
and the number of quantiles for the distributions (parameter NUM QUANTILES) was
set to 20.
We created clustered indices for the (potentially violated) primary key attributes.
Notice that these indices cannot be declared as “unique” since the database may be
inconsistent. With respect to the annotations introduced in Section 6.4, we added an
attribute called cons to each table, and used it to keep track of whether each tuple satisfies
or violates the primary key constraints. For each relation, we declared a secondary index
on the attributes of the key plus the cons attribute. The values for the cons attributes
are computed offline. However, it is important to point out that in the experimental
results that we report here, this attribute is used only where we explicitly say that the
rewritings are annotation-aware. By default, we assume that the rewritings work on the
inconsistent database without exploiting precomputed information.
Regarding the indices of the database, we considered a worst-case and a typical sce-
nario. In the worst-case scenario, the only indices in the database are those for the key
attributes and the annotations. We also considered a more typical scenario, where several
indices are declared. In particular, we created all indices suggested by DB2’s Configu-
ration Advisor. In each database, the size of the indices proposed by the Configuration
Advisor corresponds to a third of the size of the database. The indices are shown in
Appendix B.
7.1.2 Inconsistent Database Instances
For the inconsistent databases, we employed the schema and data of TPC-H, the standard
benchmark for decision support systems. The schema is shown in Figure 7.1. The sizes
of the tables are also shown in Figure 7.1 (under their names), and are given in number
of tuples for a 1 GB instance. For example, the relation lineitem has 6 million tuples on
a 1 GB instance. As per the TPC-H standard, all tables except nation and region are
scaled proportionally to the size of the database (this is indicated with SF in the figure).
Chapter 7. Experimental Analysis 141
Figure 7.1: Schema specified in the TPC-H standard (taken from [TPC03])
Chapter 7. Experimental Analysis 142
The parameters used to build the databases are the following:
• The size s of the database. We considered databases of various sizes, up to 20
GB (172 million tuples). Notice that this size is 50 times larger than the size of the
buffer pool of the database (whose size is 400 MB).
• The percentage p of the database that is inconsistent. For example on a 1 GB
instance (8.6 million tuples) where p is 25%, there are 2.15 million tuples that
violate the key constraints of the schema. We created the databases in such a way
that every relation has the same value of p as the entire database. We experimented
with values of p ranging from 0% (totally consistent database) to 25%.
• The number of tuples n that share a common key value (and hence violate a key
constraint), for every key value in the inconsistent portion of the database. For
example, if n = 2, then every key value in the inconsistent portion of the database
appears in exactly two tuples. The value is fixed for every tuple of the inconsistent
portion (i.e., every key value of the database appears exactly one or n times). We
experimented with values of n ranging from 2 to 7.
The TPC Consortium provides a data generator called dbgen that produces database
instances compliant with the standard.1 Since the TPC-H standard does not consider
inconsistent databases, dbgen creates instances that do not violate the primary key con-
straints of the schema. For this reason, we modified the source code of dbgen in order to
produce a generator that creates inconsistent databases. The database generator creates
each table as follows. Let l be total number of tuples to be generated in the table. First,
we generate l.(1− p100
+ p100n
) tuples. Second, we randomly select l.p100.n
tuples from them.
Third, for each selected tuple ~t, we generate n−1 additional tuples by invoking the tuple
generation functions of dbgen. We replace the key values of the n − 1 generated tuples
with the key value of ~t.
7.1.3 Workload
The experiments were performed using queries specified in the TPC-H standard. There
are twenty two queries in the standard, twelve of which are aggregate conjunctive queries,
the type of queries that we handle in this work. The other ten queries have features
1The database generator can be obtained from the TPC Consortium’s website at http://www.tpc.org
Chapter 7. Experimental Analysis 143
that are beyond aggregate conjunctive queries, such as aggregation in nested subqueries
(Queries 2, 11, 15, 17, 18 and 20 of the specification), left outer joins (Query 13), and
negation (Queries 16, 21, and 22).
In our experiments, we will focus on eleven queries from the TPC-H specification
(Queries 1, 3, 4, 6, 7, 8, 9, 10, 12, 14, and 19). The original TPC-H queries together with
their rewritings are given in Appendix A. Notice that, of the twelve aggregate conjunctive
queries, we rule out only one query. This is Query 5 of the specification, which contains
a nonkey-to-nonkey join, which we cannot handle with our query rewriting algorithm.
(Following the results of Chapter 5, Query 5 is in class C∗ and thus has no query rewriting
into SQL). Of the eleven queries that we consider, six are strictly in class Csqlaggforest
(Queries 3, 4, 6, 9, 10, and 12), and the other five can be handled with our rewriting
algorithm RewriteAggSQL with little or no modification for the following reasons. First,
Queries 7 and 8 have repeated relation symbols that appear at leaf nodes of the join
graph. The algorithm RewriteAggSQL can handle this case, since the nonkey variables of
these repeated relation symbols are not involved in any join. Second, Queries 7 and 19
have disjunction involving equalities of attributes to constants. We showed in Chapter
3 that it is quite easy to extend the algorithm that produces a first-order rewriting to
handle this case, and the SQL rewriting algorithm RewriteAggSQL of this chapter can
be used for such cases without modification (the disjunction is considered part of the
selection conditions in the expression CONDS of Figures 6.8). Finally, Queries 8, and
14 perform an arithmetic operation (division) on the result of two aggregate operators,
and Query 1 computes an average. In such cases, we give bounds that are sound, but
not tight.2
In Figure 7.2, we summarize the main characteristics of the eleven queries used in
the experiments. For each query, we give the number of relations in the from clause,
the number of selection conditions in the where clause (this excludes join conditions),
the selectivity (as the percentage of joined tuples that satisfy the selection conditions of
the query), the number of projecting attributes in the select clause, and the number of
aggregate functions in the select clause. The queries in the TPC-H specification are pa-
rameterized, and the standard suggests values for these parameters. In the experiments,
we used the suggested values in all the queries. The selectivities reported in Figure 7.2
are based on these parameters.
2For the queries with the sum operator, all ranges are tight since the queries in the TPC-H standardonly aggregate over attributes with positive values.
Chapter 7. Experimental Analysis 144
relations selection selectivity projecting aggregation
conditions (in %) attrs functions
Q1 1 1 98.56 2 8
Q3 3 3 0.51 3 1
Q4 2 3 2.35 1 1
Q6 1 4 1.91 0 1
Q7 5 4 0.10 3 1
Q8 7 4 0.04 1 2
Q9 6 1 5.13 2 1
Q10 4 3 1.87 7 1
Q12 2 5 0.51 2 2
Q14 2 2 1.23 0 2
Q19 2 24 0.001 0 1
Figure 7.2: TPC-H queries used in the experiments
7.2 Experimental Results
In this section, we report the results of the experiments that we performed in order to
quantify the overhead of the rewritings produced by ConQuer.
7.2.1 Scalability
In this subsection, we study the scalability of ConQuer’s approach. In particular, we
show the effect of the size of the inconsistent databases on the overhead of the rewritten
queries. In Figure 7.3, we report the overhead of the eleven rewritten queries on a number
of databases where we fix the degree of inconsistency to 5% of the database (p = 5%), and
2 conflicts per inconsistent key value (n = 2). The size of the databases (reported on the
x-axis) ranges from 1 GB to 20 GB (that is, from 8.6 million tuples to 172 million tuples).
The databases are generated independently of each other, and correspond to the scenario
where indices are created only for the key attributes. On the y-axis, we report the
overhead of the rewritten queries, computed as the ratio between the running time of
the rewritten query over the running time of the original (non-rewritten) query. The
rewritings reported in the figure do not exploit annotations (i.e., they are unaware of
annotations, if any, computed as explained in Section 6.4).
Chapter 7. Experimental Analysis 145
For presentation purposes, we split the queries into three graphs. The queries are
grouped based on the behaviour of the overhead as the size of the databases increases.
The graph at the top shows queries where the overhead initially increases, but then
remains constant or decreases (Queries 1, 7, 12, 14). The graph in the middle shows
queries where the overhead increases monotonically with the size of the database (Queries
3, 8, 10). The rest of the queries are shown in the graph at the bottom (Queries 4, 6, 9,
19).
We identified two factors that have a significant impact on the overhead of the rewrit-
ings: the selectivity of the original queries, and the query plans chosen by DB2’s opti-
mizer. Let us start with the selectivity of the queries. To understand their effect, recall
that in the SQL rewriting algorithm RewriteAggSQL of Figure 6.8, there is a subquery
called candidatesSubQuery that is designed to exploit the selectivity of the original
queries. In particular, this subquery returns only the values for the root-key attributes
that satisfy the conditions of the original query. More specifically, let q be a query,
K1, . . . , Kn be the attributes that appear at some root of the join graph of q, and CONDSbe the selection conditions of q. Then, the rewriting produced by RewriteAggSQL(q, Σ)
has a subquery of the following form:
with candidatesSubQuery as (
select K1 as cK1,. . . ,Kn as cKn
from <list of relations in q>
where CONDS )
Clearly, the lower the selectivity of the original query q, the fewer tuples are returned
by candidatesSubQuery. The rest of the rewriting operates on the result of the following
subquery called countViolSubQuery.
with countViolSubQuery as (
select K1, . . . , Kn, S1, . . . , Sl, A1, . . . , Au
rank() over (partition by K1, . . . , Kn
order by S1, . . . , Sl) as rankProjection,
sum(case when CONDS then 0 else 1 end)
over (partition by K1, . . . , Kn) as countViol,
case when CONDS then ‘‘yes’’ else ‘‘no’’ end as satConds
from JOINS
Chapter 7. Experimental Analysis 146
0 2 4 6 8 10 12 14 16 18 200
1
2
3
4
5
6
7
Size (GB)
Ove
rhea
d
Q: 001Q: 007Q: 012Q: 014
0 2 4 6 8 10 12 14 16 18 200
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
Size (GB)
Ove
rhea
d
Q: 003Q: 008Q: 010
0 2 4 6 8 10 12 14 16 18 200
0.5
1
1.5
2
2.5
3
Size (GB)
Ove
rhea
d
Q: 004Q: 006Q: 009Q: 019
Figure 7.3: Size of the inconsistent database vs. overhead (running time of rewritten
query over running time of original query) for p = 5% and n = 2
Chapter 7. Experimental Analysis 147
where exists (select * from candidatesSubQuery
where K1 = cK1 and . . . and Kn = cKn)
Notice that the where clause of the subquery restricts the focus to the tuples that
join with those returned by candidatesSubQuery. Since all further processing in the
rewriting is done on the result of countViolSubQuery, the selectivity of the original
query q significantly affects the running time of the rewriting.
We can see in Figure 7.2 that the selectivity of Query 1 is much higher than the
selectivity of all the other queries. More specifically, Query 1 has a selectivity of 98.5%,
whereas the highest selectivity of the other ten queries is 5.1% (Query 9). This explains
the high overhead of the rewriting of Query 1, which goes up to 5.8 times the running
time of the original query on the 20 GB database. The selectivity also explains the low
overhead of Query 19. In this case, the overhead of the rewriting goes up to just 1.2
times the running time of the original query on the 20 GB instance. Notice in Figure 7.2
that this query has considerably less selectivity than all other queries: 0.001%. Thus, in
effect, the computation of candidatesSubQuery accounts for most of the running time
of the rewriting; with the computation of the other subqueries having a negligible cost.
We also observed that the query plans selected by DB2 have an effect on the over-
head. For example, all queries involve lineitem, the largest relation of the TPC-H
database, which contains 70% of all tuples in the database. Except for Queries 4 and
10, the running time of all queries (and their rewritings) is dominated by the size of the
lineitem relation. In particular, for all those queries, DB2 selects plans that involve a
costly table scan of the lineitem relation. In contrast, for queries 4 and 10 (and their
rewritings), the running time is dominated by the size of the smaller relation orders
for the following reasons. First, the plans involve a table scan of relation orders, with
the access to lineitem being done through its clustered index. Second, a low selectivity
predicate is applied on the tuples retrieved from orders, which are then joined with
those coming from lineitem. Thus, only a very small fraction of the tuples of lineitem
are actually accessed. We conjecture that for this reason most of the processing of both
the original and rewritten queries can be done in main memory, hence the low overhead
of the rewritings of Queries 4 and 10.
The low overhead of Query 6 (with a maximum of 2.1 on the 10 GB instance) can be
explained in terms of the shape of its rewriting. Notice in Figure 7.2 that this is a query
on one relation (hence no joins), and with a relatively low selectivity. Furthermore, it
Chapter 7. Experimental Analysis 148
does not perform any grouping (it has no projecting attributes) and computes just one
aggregate function. This results in a simpler and more efficient rewriting. In particular,
the attributes countProjection and rankProjection of the rewriting do not need to
be computed.
In Figure 7.3, we can observe three trends in the growth of the overhead as we increase
the size of the instances. For some queries, the overhead increases slowly with the size
of the instances (Queries 4, 6, 10, and 19). These are the low-overhead rewritings, and
thus the processing of both the original queries and their rewritings can be done mostly
in main memory. For others, the overhead increases monotonically at a relatively high
rate (Queries 3 and 8). A possible explanation for this behaviour is that the original
queries can do most of their processing in main memory, whereas this is not the case for
the more costly rewritings. Finally, for another group of queries (Queries 1, 7, 9, 12, and
14), the overhead grows up initially, and then either remains constant or decreases. The
reason is that as the size of the databases grow, the amount of available main memory
becomes small not only for the rewritten queries but also for the original queries. Hence,
the rate of growth of the ratio between them diminishes.
For Query 9, we slightly modified the query rewriting produced by RewriteAggSQL (the
modified rewriting is equivalent to the one produced by RewriteAggSQL). The reason for
this is that for the rewriting obtained with RewriteAggSQL, DB2 was producing a very in-
efficient query plan. For example, on a 2 GB database, the running time of the rewriting
was 28 times the running time of the original query.
We detected that the problem of the rewriting produced by RewriteAggSQL was in
the subquery candidatesSubQuery. To understand the reason, let us show a simplified
version of Query 9:
select n name as nation,
l extendedprice * (1 - l discount) - ps supplycost * l quantity
from part, supplier, lineitem, partsupp, orders, nation
where s suppkey = l suppkey
and ps suppkey = l suppkey
and ps partkey = l partkey
and p partkey = l partkey
and o orderkey = l orderkey
and s nationkey = n nationkey
Chapter 7. Experimental Analysis 149
and p name like ’%green%’
The subquery candidatesSubQuery produced by RewriteAggSQL is the following:
with candidatesSubQuery as (
select l orderkey,l linenumber
from part, supplier, lineitem, partsupp, orders, nation
where s suppkey = l suppkey
and ps suppkey = l suppkey
and ps partkey = l partkey
and p partkey = l partkey
and o orderkey = l orderkey
and s nationkey = n nationkey
and p name like ’%green%’
An important observation is that if we modify candidatesSubQuery, the rewrit-
ing will still be correct (i.e., compute the consistent answers of Query 9) as long as
candidatesSubQuery still returns the tuples that are candidates to be consistent an-
swers, i.e., that they satisfy the selection conditions of Query 9. Based on this observa-
tion, we modified the candidatesSubQuery subquery produced by RewriteAggSQL, and
detected that DB2 would produce a more efficient query plan. In particular, we removed
the relation partsupp from the from clause of candidatesSubQuery and the conditions
ps suppkey = l suppkey and ps partkey = l partkey from its where clause.
The overhead reported in Figure 7.3 corresponds to the modified rewriting. Notice
that we do not provide a value for the 20 GB database. The reason is that the execution
of the original Query 9 on the 20 GB database timed out in our experiments because
DB2 came up with a particulary inefficient plan, different from the one chosen for the
other instances.
Besides the peculiarities of each query, an important conclusion of these experiments
is that the query rewritings can scale to large database instances. Even for an instance
of 20 GB (172 million tuples) the overhead of the queries ranges from 1.2 (Query 19) to
5.8 (Query 1). This is remarkable if we take into account that the semantics of consistent
query answering is much more involved than the semantics of traditional query answering.
Let us now consider the rewritings that exploit annotations, as explained in Section
6.4. In our experiments, the only rewriting that benefited substantially from the annota-
tions was the one on Query 1. The other queries do not benefit from annotations due to
Chapter 7. Experimental Analysis 150
0 5 10 15 200
1
2
3
4
5
6
7
Size (GB)
Ove
rhea
d
Q: 1−annotationsQ: 1−no annotations
Figure 7.4: Size of the inconsistent database vs. overhead of the rewritings that exploit
and do not exploit annotations for Query 1 (for an instance where p = 5% and n = 2).
their low selectivity. Recall that Query 1 has a high selectivity of 98.5%, as opposed to
all other queries, whose selectivity is at most 5.1% (Query 9). Since the annotations (in
particular the cons attribute) are used in the where clause of one of the subqueries of the
annotation-aware rewriting, they are in effect reducing the selectivity of the rewriting,
thereby having a more significant impact on the queries with high selectivity.
In Figure 7.4, we focus on Query 1, and we compare the overhead of the annotation-
aware rewriting with the rewriting which does not exploit annotations. As in the previous
figure, we fix the degree of inconsistency to 5% of the database (p = 5%) and the number
of conflicts per inconsistent key value to 2 (n = 2). The size of the databases (reported on
the x-axis) ranges, as before, from 1 to 20 GB. The databases correspond to the scenario
where indices are created for the key attributes and the annotations. On the y-axis, we
report the overhead of the queries, computed as we explained above.
It can be observed that we get a substantial gain by exploiting the annotations. For
example, on the 20 GB instance, the overhead of the rewriting which does not exploit
annotations is 5.8, whereas the overhead of the annotation-aware rewriting is 3.3. That
is, the running time of the rewriting is reduced by 57% by exploiting the annotations.
Finally, we performed experiments on databases where indices are created by follow-
ing the suggestions of DB2’s Configuration Advisor, in addition to the indices on the
Chapter 7. Experimental Analysis 151
key attributes. In Figure 7.5, we report the overhead of the eleven rewritten queries on
a number of databases where we fix the degree of inconsistency to 5% of the database
(p = 5%), and 2 conflicts per inconsistent key value (n = 2). The size of the databases
(reported on the x-axis) ranges from 1 to 20 GB. On the y-axis, we report the overhead
of the rewritten queries, computed as explained above. The rewritings do not exploit
annotations. The indices suggested by the Configuration Advisor are shown in the ap-
pendix.
For presentation purposes, we present the queries in three graphs. Notice the different
(linear) scales of the graphs. The graphs at the top and center show queries with low
overhead, whereas the one at the bottom shows queries where the overhead is much
higher.
In the graph at the bottom of Figure 7.5, we can observe a sharp spike in the overhead
of Query 14 on the 5 GB database. The overhead jumps from 2.1 on the 3 GB database
to 25.5 on the 5 GB database; and then decreases to 3.2 on the 10 GB database. This is
due to an index of the 5 GB database that is particularly beneficial to the original query.
This is an index on the lineitem relation and on attributes (l shipdate, l discount,
l extendedprice, l partkey). The index is not present on any of the other databases.
There is a similar situation for Query 6. In this case, the overhead jumps from 5.1 on the
2 GB database to 31.2 on the 3 GB database. The overhead stays high on the 5 and 10
GB databases (28.3 and 33.5, respectively) and finally decreases sharply on the 20 GB
database to a value of 2.8.
For Queries 8, 9, and 19 we observe the opposite behavior: the overhead lies below
one. That is, the indices benefit considerably the rewritten query as opposed to the
original query. This is most noticeable on Query 9, whose overhead is 0.05 on the 2 GB
database, and 0.04 on the 5 GB database. Notice that the overhead behaves differently
on the 1, 3, and 10 GB databases, where the original query runs faster than the rewritten
query (the overhead is above 1). We do not report the overhead for the 20 GB database
since, as occurred in the scenario with only key constraints, the original query times out.
Excluding the above exceptions, the overhead of all queries is comparable with the
overhead in the scenario where there are indices only for the key constraints. For example,
on the 20 GB database, the overhead of all queries ranges from 1.2 (Query 19) to 5.8
(Query 1) on the databases with just key constraints; and from 1.06 (Query 19) to 5.4
(Query 1) on the databases with indices suggested by the Configuration Advisor.
Chapter 7. Experimental Analysis 152
0 2 4 6 8 10 12 14 16 18 200
0.5
1
1.5
2
2.5
3
3.5
4
Size (GB)
Ove
rhea
d
Q: 003Q: 004Q: 007Q: 008
0 2 4 6 8 10 12 14 16 18 200
5
10
15
20
25
30
35
Size (GB)
Ove
rhea
d
Q: 001Q: 006Q: 014
Figure 7.5: Size of the inconsistent database vs. overhead (running time of rewritten
query over running time of original query) for p = 5% and n = 2 using indices suggested
by Configuration Advisor
Chapter 7. Experimental Analysis 153
7.2.2 Effect of Degree of Inconsistency
In this subsection, we study the effect of the degree of inconsistency of the databases on
the performance of ConQuer’s rewritings. We consider the two parameters that determine
the degree of inconsistency: the percentage of the database being inconsistent (p), and
the number of conflicts per inconsistent key value (n).
In Figure 7.6, we report the overhead of the eleven queries on a number of databases
where we fix the size to 3 GB and the number of inconsistencies per key value to 2 (n = 2).
The percentage of inconsistency of the databases (reported on the x-axis) ranges from
0 (totally consistent database) to 25% (a quarter of the database being inconsistent).
On the y-axis, we report the overhead of the rewritten queries, computed as the ratio
between the running time of the rewritten query over the running time of the original
(non-rewritten) query. The rewritings reported in the figure do not exploit annotations
(i.e., they are unaware of annotations, if any). All the databases correspond to the
scenario where indices are created only for the key attributes.
We observed that the overhead is not considerably influenced by the percentage of
inconsistency. This is reasonable since in the rewriting we do not make a distinction
between tuples that violate or satisfy the constraints. In the figure, we can see an
anomaly for Query 14, with its overhead sharply decreasing from 0 to 1%, and then
sharply increasing from 1 to 5%. The reason for this is that, for the rewritten query
and the 1% inconsistent database, DB2 chooses a different plan. In particular, for the
rewritten query on all databases except the 1% inconsistent, DB2 chooses a plan that
includes one table scan of the lineitem relation and a join that accesses lineitem
through its clustered index. For the 1% inconsistent database, DB2 chooses a different
plan that involves two tablescans of lineitem and the application of a low selectivity
predicate in each case. In this case, the alternative plan turns out to be a good choice:
the overhead becomes lower than in the other cases.
In Figure 7.7, we turn our attention to the number of conflicts per inconsistent key
value. In particular, we report the overhead of the eleven queries on a number of databases
where we fix the size to 1 GB and the percentage of inconsistency to 5% (p = 5%). The
number of conflicts per inconsistent key value (reported on the x-axis) ranges from 1
(totally consistent database) to 7. On the y-axis, we report the overhead of the rewritten
queries, computed as in the other figures. The rewritings considered in the figure do not
exploit annotations.
Chapter 7. Experimental Analysis 154
0 5 10 15 20 250
1
2
3
4
5
6
7
Percentage of inconsistency
Ove
rhea
d
Q: 001Q: 003Q: 004Q: 006
0 5 10 15 20 250
0.5
1
1.5
2
2.5
3
Percentage of inconsistency
Ove
rhea
d
Q: 007Q: 008Q: 009Q: 010
0 5 10 15 20 250
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
Percentage of inconsistency
Ove
rhea
d
Q: 012Q: 014Q: 019
Figure 7.6: Percentage of inconsistency vs. overhead (running time of rewritten query
over running time of original query) for instances of 3 GB and n = 2
Chapter 7. Experimental Analysis 155
As with the percentage of inconsistency, we observed that the number of conflicts per
key value does not have a considerable effect on the overhead of the rewritten queries.
The only exception is Query 9, where the overhead decreases significantly as the number
of conflicts increases. We detected that this is because DB2’s optimizer was choosing
different plans on different instances. In particular, the plan chosen for the original
query on the database where n = 7 is so inefficient that it runs more slowly than the
corresponding rewritten query (and, hence, the overhead falls below 1).
Chapter 7. Experimental Analysis 156
1 2 3 4 5 6 70
1
2
3
4
5
6
7
Number of inconsistent tuples per key value
Ove
rhea
d
Q: 001Q: 003Q: 004Q: 006
1 2 3 4 5 6 70
0.5
1
1.5
2
2.5
3
Number of inconsistent tuples per key value
Ove
rhea
d
Q: 007Q: 008Q: 009Q: 010
1 2 3 4 5 6 70
0.5
1
1.5
2
2.5
3
3.5
4
Number of inconsistent tuples per key value
Ove
rhea
d
Q: 012Q: 014Q: 019
Figure 7.7: Number of conflicts per inconsistent key value (n) vs. overhead (running
time of rewritten query over running time of original query) for instances of 1 GB and
p = 5%
Chapter 8
Conclusions and Future Work
In this thesis, we presented ConQuer, a system for query answering over inconsistent
databases. We showed the correctness of ConQuer’s rewritings for a broad class of Select-
Project-Join queries with set and bag semantics, and with grouping and aggregation. We
also showed the maximality of the class of queries from a complexity-theoretic point of
view. The efficiency and scalability of the approach was empirically validated with an
extensive set of experiments on a commercial database system.
The assumptions of our work can be relaxed in different directions. For example,
we assumed that the set of constraints that might be violated consists exclusively of
key dependencies. It would be interesting to consider foreign key dependencies as well.
In this way, we would be covering the most common constraints that are supported by
commercial database systems. We are also interested in other constraints, for example
constraints arising from business rules (e.g., a rule saying that a car insurance policy
cannot be held by people who are younger than 18 years old). Regarding the data
model, ConQuer currently works on relational databases. An obvious extension is to
provide support to semi-structured data, such as XML documents. With respect to
queries, we would like to support more expressive query languages, where queries may
have disjunction and negation. We note that this direction of research has been started
recently by Lembo, Rosatti and Ruzzi [LRR06], who extend our class Cforest to consider
unions of conjunctive queries.
In this work, we provide exact algorithms that compute all the consistent answers
to a query. We would also like to explore approximation algorithms [Vaz01]. For ex-
ample, we could compute results where some consistent answers may be missing. For
Select-Project-Join queries, we could give a formal guarantee on the number of poten-
157
Chapter 8. Conclusions and Future Work 158
tially missing tuples. For queries with aggregation, we could also give formal guarantees
about the ranges for the aggregate functions. An interesting question is whether the
query rewriting algorithms used by ConQuer can be used as a building block of the
approximation algorithms.
It is easy to see that, in general, queries under the consistent answers semantics do
not compose. That is, the consistent answers of a first query cannot be used to compute
the consistent answers of other queries. However, it may be possible to produce auxiliary
data when executing the first query that could be used in turn to obtain the result of other
queries. We would like to characterize what kind of auxiliary information is necessary for
the composition of different classes of queries. One application of these results would be
for OLAP queries [CD97], where the computation of, e.g., roll-up operations is usually
done by composing queries.
ConQuer currently deals with inconsistencies that occur after the source data has
been transformed to conform to the schema of the integrated database. The problem of
creating the integrated database is called data exchange, and has recently been formalized
by Fagin, Kolaitis, Miller, and Popa [FKMP05]. In this framework, we are given a
source schema, a target schema, and a mapping, which is a declarative specification of
a transformation. Mappings are unidirectional in the data exchange framework, going
from the source to the target schema. The goal is, given a source database, to materialize
a target database that satisfies the mapping. We, together with other authors, have
proposed a generalization of the data exchange framework, called peer data exchange
[FKMT06], where the mapping may be bidirectional (source-to-target and target-to-
source). An important problem in the context of peer data exchange is the existence-
of-solutions problem, which consists of deciding whether it is actually possible to obtain
a target database that satisfies the mapping. Interestingly, the problem of computing
consistent answers under key constraints can be reduced to the existence-of-solutions
problem in the context of peer data exchange, where the key constraints are encoded in
the mapping [Fux04]. This reduction may contribute to the potential application of the
techniques presented in this thesis to the context of peer data exchange.
ConQuer provides an interface that enables the user to gradually clean the database.
In particular, when a query is submitted, the system shows the clean answers together
with a query explanation. The explanation can be extremely valuable, since it often
points to underlying errors in the database that require attention from the user. For key
constraints, the only actions that a user may perform are deleting or modifying tuples.
Chapter 8. Conclusions and Future Work 159
However, if other constraints are covered in the future, the explanations could trigger
more complex transformations on the database. There are interesting questions as to
how to specify such transformations using, for example, Extract-Transform-Load tools.
Nowadays, there are mature data integration and database management products
in the market. In our opinion, these products should be tightly coupled, with data
integration tools producing databases that are potentially inconsistent, and precise char-
acterizing the inconsistency; and database systems exploiting the knowledge about the
inconsistencies to produce better answers. We expect the results in this thesis to be an
initial step in this direction.
Bibliography
[ABC99] M. Arenas, L. Bertossi, and J. Chomicki. Consistent query answers in incon-
sistent databases. In Symposium on Principles of Database Systems (PODS),
pages 68–79, 1999.
[ABC00] M. Arenas, L. Bertossi, and J. Chomicki. Specifying and querying database
repairs using logic programs with exceptions. In International Conference
on Flexible Query Answering Systems, pages 27–41, 2000.
[ABC03a] M. Arenas, L. Bertossi, and J. Chomicki. Answer sets for consistent query
answering in inconsistent databases. Theory and Practice of Logic Program-
ming, 3(4-5):392–424, 2003.
[ABC+03b] M. Arenas, L. Bertossi, J. Chomicki, X. He, V. Raghavan, and J. Spinrad.
Scalar Aggregation in Inconsistent Databases. Theoretical Computer Science,
296:405–434, 2003.
[AD98] S. Abiteboul and O. M. Duschka. Complexity of answering queries using ma-
terialized views. In Symposium on Principles of Database Systems (PODS),
pages 254–263, 1998.
[AFM06] P. Andritsos, A. Fuxman, and R. J. Miller. Clean answers over dirty
databases: a probabilistic approach. In International Conference on Data
Engineering (ICDE), 2006. Paper 30.
[AHV95] S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases. Addison-
Wesley, 1995.
[AKG87] S. Abiteboul, P. Kanellakis, and G. Grahne. On the representation and
querying of sets of possible worlds. In ACM International Conference on the
Management of Data (SIGMOD), pages 34–48, 1987.
160
Bibliography 161
[AKWS95] S. Agarwal, A. Keller, G. Wiederhold, and K. Saraswat. Flexible relation:
An approach for the integration of data from multiple, possible inconsistent
databases. In International Conference on Data Engineering (ICDE), pages
495–504, 1995.
[ATMS04] P. Andritsos, P. Tsaparas, R. J. Miller, and K. C. Sevcik. Limbo: Scalable
clustering of categorical data. In International Conference on Extending
Database Technology (EDBT), pages 123–146, 2004.
[Bal91] B. Balzer. Tolerating inconsistency. In International Conference on Software
Engineering (ICSE), pages 158–165, 1991.
[BB03a] P. Barcelo and L. Bertossi. Logic programs for querying inconsistent
databases. In International Symposium on Practical Aspects of Declarative
Languages, pages 208–222, 2003.
[BB03b] L. Bravo and L. Bertossi. Logic programs for consistently querying data inte-
gration systems. In International Joint Conference on Artificial Intelligence
(IJCAI), pages 10–15, 2003.
[BBFL05] L. Bertossi, L. Bravo, E. Franconi, and A. Lopatenko. Fixing numerical at-
tributes under integrity constraints. In International Symposium on Database
Programming Languages (DBPL), pages 262–278, 2005.
[BC03] L. Bertossi and J. Chomicki. Logics for Emerging Applications of Databases,
chapter Query Answering in Inconsistent Databases, pages 43–83. Springer,
2003.
[Ber06] L. Bertossi. Consistent query answering in databases. ACM SIGMOD
Record, 35(2):68–76, 2006. Database Principles column.
[BKT01] P. Buneman, S. Khanna, and W. Tan. Why and where: A characterization of
data provenance. In International Conference on Database Theory (ICDT),
pages 316–330, 2001.
[BMFR05] P. Bohannon, F. Michael, W. Fan, and R. Rastogi. A cost-based model and
effective heuristic for repairing constraints by value modification. In ACM
International Conference on the Management of Data (SIGMOD), pages
143–154, 2005.
Bibliography 162
[BMP92] D Barbara, H. Garcia Molina, and D. Porter. The management of probabilis-
tic data. IEEE Transactions on Knowldge and Data Engineering (TKDE),
4:487–502, 1992.
[CB00] A. Celle and L. Bertossi. Querying inconsistent databases: Algorithms and
implementation. In Computational Logic (CL), pages 942–956, 2000.
[CB05] M. Caniupan and L. Bertossi. Optimizing repair programs for consistent
query answering. In International Conference of the Chilean Computer Sci-
ence Society, pages 3–12, 2005.
[CD97] S. Chaudhuri and U. Dayal. An overview of data warehousing and OLAP
technology. ACM SIGMOD Record, 26(1):65–74, 1997.
[CLR03a] A. Calı, D. Lembo, and R. Rosati. On the decidability and complexity of
query answering over inconsistent and incomplete databases. In Symposium
on Principles of Database Systems (PODS), pages 260–271, 2003.
[CLR03b] A. Calı, D. Lembo, and R. Rosati. Query rewriting and answering under
constraints in data integration systems. In International Joint Conference
on Artificial Intelligence (IJCAI), pages 16–21, 2003.
[CM77] A. Chandra and P. Merlin. Computable queries for relational databases. In
ACM Symposium on the Theory of Computing (STOC), pages 77–90, 1977.
[CM05] J. Chomicki and J. Marcinkowski. Minimal-change integrity maintenance
using tuple deletions. Information and Computation, 197(1-2):90–121, 2005.
[CMS04a] J. Chomicki, J. Marcinkowski, and S. Staworko. Computing Consistent
Query Answers using Conflict Hypergraphs. In International Conference
on Information and Knowledge Management (CIKM), pages 417–426, 2004.
[CMS04b] J. Chomicki, J. Marcinkowski, and S. Staworko. Hippo: A System for Com-
puting Consistent Answers to a Class of SQL Queries. In International Con-
ference on Extending Database Technology (EDBT), pages 841–844, 2004.
[CNS99] S. Cohen, W. Nutt, and A. Serebrenik. Rewriting aggregate queries using
views. In Symposium on Principles of Database Systems (PODS), pages
155–166, 1999.
Bibliography 163
[CNS03] S. Cohen, W. Nutt, and Y. Sagiv. Containment of aggregate queries. In
International Conference on Database Theory (ICDT), pages 111–125, 2003.
[CP87] R. Cavallo and M. Pittarelli. The theory of probabilistic databases. In
International Conference on Very Large Databases (VLDB), pages 71–81,
1987.
[CV93] S. Chaudhuri and M. Vardi. Optimization of real conjunctive queries. In
Symposium on Principles of Database Systems (PODS), pages 59–70, 1993.
[CW03] Y. Cui and J. Widom. Lineage tracing for general data warehouse transfor-
mations. Very Large Databases (VLDB) Journal, 12(1):41–58, 2003.
[DeM89] L. DeMichiel. Resolving database incompatibility: An approach to perform-
ing relational operations over mismatched domains. In IEEE Transactions
on Knowldge and Data Engineering (TKDE), pages 485–493, 1989.
[DJ03] T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John
Wiley, 2003.
[DS04] N. Dalvi and D. Suciu. Efficient query evaluation on probabilistic databases.
In International Conference on Very Large Databases (VLDB), pages 864–
875, 2004.
[EFGL03] T. Eiter, M. Fink, G. Greco, and D. Lembo. Efficient Evaluation of Logic
Programs for Querying Data Integration Systems. In International Confer-
ence on Logic Programming (ICLP), pages 163–177, 2003.
[FFM05a] A. Fuxman, E. Fazli, and R. J. Miller. ConQuer: Efficient management of in-
consistent databases. In ACM International Conference on the Management
of Data (SIGMOD), pages 155–166, 2005.
[FFM05b] A. Fuxman, D. Fuxman, and R. J. Miller. ConQuer: A system for effi-
cient querying over inconsistent databases. International Conference on Very
Large Databases (VLDB), pages 1354–1357, 2005.
[FFP05] S. Flesca, F. Furfaro, and F. Parisi. Consistent query answers on numeri-
cal databases under aggregate constraints. In International Symposium on
Database Programming Languages (DBPL), pages 279–294, 2005.
Bibliography 164
[FKMP05] R. Fagin, P. Kolaitis, R. Miller, and L. Popa. Data exchange: semantics and
query answering. Theoretical Computer Science, 336(1):89–124, 2005.
[FKMT06] A. Fuxman, P. Kolaitis, R. J. Miller, and W. Tan. Peer data exchange. ACM
Transactions on Database Systems, 2006. To appear in a special issue with
selected papers from PODS 2005.
[FM05] A. Fuxman and R. J. Miller. First-order query rewriting for inconsistent
databases. In International Conference on Database Theory (ICDT), pages
337–351, 2005.
[FM06] A. Fuxman and R. J. Miller. First-order query rewriting for inconsistent
databases. Journal of Computer and System Sciences (JCSS), 2006. To
appear.
[FPL+01] E. Franconi, A. Laureti Palma, N. Leone, S. Perri, and F. Scarcello. Census
data repair: a challenging application of disjunctive logic programming. In
Logic for Programming, Artificial Intelligence, and Reasoning (LPAR), pages
561–578, 2001.
[FR97] N. Fuhr and T. Rolleke. A probabilistic relational algebra for the integra-
tion of information retrieval and database systems. ACM Transactions on
Information Systems, 15(1):32–66, 1997.
[Fux04] A. Fuxman. A survey of the applications of schema mapping and the certain
answers semantics. Technical Report CSRG-541, University of Toronto, 2004.
Available at ftp://ftp.cs.toronto.edu/cs/ftp/pub/reports/csrg/541.
[GGZ01] G. Greco, S. Greco, and E. Zumpano. A logic programming approach to
the integration, repairing and querying of inconsistent databases. In Inter-
national Conference on Logic Programming (ICLP), pages 348–364, 2001.
[GLRR05] L. Grieco, D. Lembo, R. Rosati, and M. Ruzzi. Consistent query answer-
ing under key and exclusion dependencies: Algorithms and experiments.
In International Conference on Information and Knowledge Management
(CIKM), pages 792–799, 2005.
[GM96] S. Grumbach and T. Milo. Towards tractable algebras for bags. In Journal
of Computer and System Sciences (JCSS), volume 52, pages 570–588, 1996.
Bibliography 165
[GR95] P. Gardenfors and H. Rott. Handbook of Logic in Artificial Intelligence and
Logic Programming, volume 4, chapter Belief Revision, pages 35–132. Oxford
University Press, 1995.
[GRT99] S. Grumbach, M. Rafanelli, and L. Tininini. Querying aggregate data.
In Symposium on Principles of Database Systems (PODS), pages 174–184,
1999.
[GZ00] S. Greco and E. Zumpano. Querying inconsistent databases. In Logic for
Programming, Artificial Intelligence, and Reasoning (LPAR), pages 308–325,
2000.
[HK75] J. Hopcroft and R. M. Karp. An O(n2.5) algorithm for maximum matching
in bipartite graphs. SIAM Journal of Computing, 2:225–231, 1975.
[HLNW01] L. Hella, L. Libkin, J. Nurmonen, and L. Wong. Logics with aggregate
operators. Journal of the ACM, 48(4):880–907, 2001.
[IR95] Y. Ioannidis and R. Ramakrishnan. Containment of conjunctive queries: Be-
yond relations as sets. ACM Transactions on Database Systems, 20(3):288–
324, 1995.
[ISO01] ISO. SQL - part 2: Foundation (SQL/Foundation) - amendment 1: On-line
analytical processing (SQL/OLAP). Technical Report 9075-2-1999/Amd1-
2001, INCITS/ISO/IEC, 2001.
[IvdMV95] T. Imielinski, R. van der Meyden, and K. Vadaparty. Complexity tailored de-
sign: A new design methodology for databases with incomplete information.
Journal of Computer and System Sciences (JCSS), 51(3):405–432, 1995.
[Lad75] R. E. Ladner. On the structure of polynomial time reducibility. Journal of
the ACM, 22(1):155–171, 1975.
[Lev81] H. Levesque. A Formal Treatment of Incomplete Knowledge Bases. PhD
thesis, University of Toronto, 1981.
[Lip79] W. Lipski. On semantic issues connected with incomplete information
databases. ACM Transactions on Database Systems, 4(3):262–296, 1979.
Bibliography 166
[Lip81] W. Lipski. On databases with incomplete information. Journal of the ACM,
28(1):41–70, 1981.
[LLR02] D. Lembo, M. Lenzerini, and R. Rosati. Source inconsistency and incom-
pleteness in data integration. In International Workshop on Knowledge Rep-
resentation meets Databases (KRDB), 2002.
[LLRS97] L. Lakshmanan, N. Leone, R. Ross, and V. Subrahmanian. Probview: A flex-
ible probabilistic database system. ACM Transactions on Database Systems,
22(3):419–469, 1997.
[LM96] J. Lin and A. Mendelzon. Merging databases under constraints. International
Journal of Cooperative Information Systems, 7(1):55–76, 1996.
[LRR06] D. Lembo, R. Rosati, and M. Ruzzi. On the first-order reducibility of unions
of conjunctive queries over inconsistent databases. In International Work-
shop on Inconsistency and Incompleteness in Databases, pages 17–32, 2006.
[LW95] L. Libkin and L. Wong. On representation and querying incomplete informa-
tion in databases with bags. Information Processing Letters, 56(4):209–214,
1995.
[LW97] L. Libkin and L. Wong. Query languages for bags and aggregate functions.
Journal of Computer and System Sciences (JCSS), 55(2):241–272, 1997.
[Moo85] R. Moore. Formal Theories of the Commonsense World, chapter A Formal
Theory of Knowledge and Action, pages 319–358. 1985.
[NER00] B. Nuseibeh, S. Easterbrook, and A. Russo. Leveraging inconsistency in
software development. IEEE Computer, 33(4):24–29, 2000.
[Ost70] P. Ostrand. Systems of distinct representatives. Journal of Mathematical
Analysis and Applications, 32:1–4, 1970.
[TPC03] Transaction Processing Performance Council: TPC. TPC Benchmark H
(Decision Support). Standard Specification Revision 2.1.0, 2003.
[Vaz01] V. Vazirani. Approximation Algorithms. Springer, 2001.
Bibliography 167
[vdM98] R. van der Meyden. Logical approaches to incomplete information: A survey.
In Logics for Databases and Information Systems, pages 307–356. Kluwer,
1998.
[Wij05] J. Wijsen. Database repairing using updates. ACM Transactions on
Database Systems, 30(3):722–768, 2005.
Appendix A
TPC-H Queries and their Rewritings
The following are the queries from the TPC-H standard [TPC03] that we employed in
our experiments, together with their rewritings.
TPC-H Query 1
select
l_returnflag,
l_linestatus,
sum(l_quantity) as sum_qty,
sum(l_extendedprice) as sum_base_price,
sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
avg(l_quantity) as avg_qty,
avg(l_extendedprice) as avg_price,
avg(l_discount) as avg_disc,
count(*) as count_order
from
lineitem
where
l_shipdate <= date(’1998-12-01’) - 90 DAYS
group by
l_returnflag,
l_linestatus
order by
l_returnflag,
l_linestatus;
Rewritten Query 1
168
Appendix A. TPC-H Queries and their Rewritings 169
with candidatesSubQuery as (
select
l_orderkey,
l_linenumber
from
lineitem
where
l_shipdate <= date(’1998-12-01’) - 90 DAYS
),
contribAllSubQuery as (
select
l_returnflag,
l_linestatus,
max(l_quantity) as max_qty,
max(l_extendedprice) as max_extendedprice,
max(l_extendedprice * (1 - l_discount)) as max_disc_price,
max(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as max_charge,
max(l_discount) as max_disc,
min(l_quantity) as min_qty,
min(l_extendedprice) as min_extendedprice,
min(l_extendedprice * (1 - l_discount)) as min_disc_price,
min(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as min_charge,
min(l_discount) as min_disc,
condWhereViol,
condWhereSat,
max(rankProj) over (partition by l_orderkey,l_linenumber) as countProj
from
(select
l_orderkey,
l_linenumber,
l_returnflag,
l_linestatus,
l_quantity,
l_extendedprice,
l_discount,
l_tax,
rank() over (partition by l_orderkey,l_linenumber
order by l_returnflag,l_linestatus)
as rankProj,
sum(case
when l_shipdate <= date(’1998-12-01’) - 90 days then 0 else 1 end)
Appendix A. TPC-H Queries and their Rewritings 170
over (partition by l_orderkey,l_linenumber) as condWhereViol,
case
when l_shipdate <= date(’1998-12-01’) - 90 days then 1 else 0 end
as condWhereSat
from lineitem li
where
exists (select * from candidatesSubQuery sc
where li.l_orderkey=sc.l_orderkey and
li.l_linenumber=sc.l_linenumber)
) q
where condWhereSat = 1
group by l_orderkey,l_linenumber,l_returnflag,l_linestatus,condWhereViol,
condWhereSat,rankProj),
contribConsistentSubQuery as (
select
l_returnflag,
l_linestatus,
max_qty,
max_extendedprice,
max_disc_price,
max_charge,
max_disc,
min_qty,
min_extendedprice,
min_disc_price,
min_charge,
min_disc,
1 as countConsistent
from
contribAllSubQuery Cand
where condWhereViol = 0 and countProj=1),
contribNonConsistentSubQuery as (
select
l_returnflag,
l_linestatus,
max_qty,
max_extendedprice,
max_disc_price,
max_charge,
max_disc,
0 as min_qty,
Appendix A. TPC-H Queries and their Rewritings 171
0 as min_extendedprice,
0 as min_disc_price,
0 as min_charge,
0 as min_disc,
0 as countConsistent
from
contribAllSubQuery Cand
where condWhereViol >= 1 or countProj > 1)
select
l_returnflag,
l_linestatus,
sum(max_qty) as max_sum_qty,
sum(max_extendedprice) as max_sum_base_price,
sum(max_disc_price) as max_sum_disc_price,
sum(max_charge) as max_sum_charge,
sum(max_qty)/sum(countConsistent) as max_avg_qty,
sum(max_extendedprice)/sum(countConsistent) as max_avg_price,
sum(max_disc)/sum(countConsistent) as max_avg_disc,
count(*) as max_count_order,
sum(min_qty) as min_sum_qty,
sum(min_extendedprice) as min_sum_base_price,
sum(min_disc_price) as min_sum_disc_price,
sum(min_charge) as min_sum_charge,
sum(min_qty)/sum(countConsistent) as min_avg_qty,
sum(min_extendedprice)/sum(countConsistent) as min_avg_price,
sum(min_disc)/sum(countConsistent) as min_avg_disc,
sum(countConsistent) as min_count_order
from
(select * from contribConsistentSubQuery
union all
select * from contribNonConsistentSubQuery) q
group by
l_returnflag,
l_linestatus
having sum(countConsistent)>0
order by
l_returnflag,
l_linestatus;
TPC-H Query 3
select
Appendix A. TPC-H Queries and their Rewritings 172
l_orderkey,
sum(l_extendedprice * (1 - l_discount)) as revenue,
o_orderdate,
o_shippriority
from
customer,
orders,
lineitem
where
c_mktsegment = ’BUILDING’
and c_custkey = o_custkey
and l_orderkey = o_orderkey
and o_orderdate < ’1995-03-15’
and l_shipdate > ’1995-03-15’
group by
l_orderkey,
o_orderdate,
o_shippriority
order by
revenue desc,
o_orderdate
fetch first 10 rows only;
Rewritten Query 3
with candidatesSubQuery as (
select
l_orderkey,
l_linenumber
from
customer,
orders,
lineitem
where
c_mktsegment = ’BUILDING’
and c_custkey = o_custkey
and l_orderkey = o_orderkey
and o_orderdate < ’1995-03-15’
and l_shipdate > ’1995-03-15’
),
contribAllSubQuery as (
Appendix A. TPC-H Queries and their Rewritings 173
select
l_orderkey,
l_linenumber,
o_orderdate,
o_shippriority,
min(l_extendedprice * (1 - l_discount)) as min_revenue,
max(l_extendedprice * (1 - l_discount)) as max_revenue,
1 as min_count,
max(rankProj) over (partition by l_orderkey,l_linenumber) as countProj,
cond_viol,
cond_sat
from
(select
l_orderkey,
l_linenumber,
o_orderdate,
o_shippriority,
l_extendedprice,
l_discount,
rank() over (partition by l_orderkey,l_linenumber
order by o_orderdate,o_shippriority)
as rankProj,
sum(case
when c_mktsegment = ’BUILDING’
and c_custkey = o_custkey
and o_orderdate < ’1995-03-15’
and l_shipdate > ’1995-03-15’
then 0 else 1 end)
over (partition by l_orderkey,l_linenumber) as cond_viol,
case when c_mktsegment = ’BUILDING’
and c_custkey = o_custkey
and o_orderdate < ’1995-03-15’
and l_shipdate > ’1995-03-15’
then 1 else 0 end as cond_sat
from orders o1 JOIN lineitem l ON l_orderkey = o1.o_orderkey
LEFT OUTER JOIN customer ON c_custkey=o1.o_custkey
where
exists (select * from candidatesSubQuery sc
where l.l_orderkey=sc.l_orderkey and l.l_linenumber=sc.l_linenumber)
) q
where cond_sat=1
Appendix A. TPC-H Queries and their Rewritings 174
group by
l_orderkey,
l_linenumber,
o_orderdate,
o_shippriority,
cond_viol,cond_sat,rankProj
),
contribConsistentSubQuery as (
select l_orderkey,
l_linenumber,
o_orderdate,
o_shippriority,
min_revenue,
max_revenue,
min_count
from
contribAllSubQuery Cand
where
countProj = 1 and cond_viol=0),
contribNonConsistentSubQuery as (
select
l_orderkey,
l_linenumber,
o_orderdate,
o_shippriority,
0 as min_revenue,
max_revenue,
0 as min_count
from
contribAllSubQuery Cand
where
countProj > 1 or cond_viol >= 1)
select
l_orderkey,
o_orderdate,
o_shippriority,
sum(min_revenue) as sum_min_revenue,
sum(max_revenue) as sum_max_revenue
from
(select * from contribNonConsistentSubQuery
Appendix A. TPC-H Queries and their Rewritings 175
union all
select * from contribConsistentSubQuery) as q
group by
l_orderkey,
o_orderdate,
o_shippriority
having sum(min_count)>0
order by
sum_min_revenue desc,
o_orderdate
fetch first 10 rows only;
TPC-H Query 4
select
o_orderpriority,
count(*) as order_count
from
orders
where
o_orderdate >= ’1993-07-01’
and o_orderdate < date(’1993-07-01’) + 3 MONTHS
and exists (
select *
from
lineitem
where
l_orderkey = o_orderkey
and l_commitdate < l_receiptdate
)
group by
o_orderpriority
order by
o_orderpriority;
Rewritten Query 4
with candidatesSubQuery as (
select
l_orderkey,
l_linenumber
from
Appendix A. TPC-H Queries and their Rewritings 176
orders, lineitem
where
o_orderdate >= ’1993-07-01’
and o_orderdate < date(’1993-07-01’) + 3 MONTHS
and l_orderkey = o_orderkey
and l_commitdate < l_receiptdate
),
contribAllSubQuery as (
select
l_orderkey,
l_linenumber,
o_orderpriority,
1 as min_count,
max(rankProj) over (partition by l_orderkey,l_linenumber) as countProj,
cond_viol,
cond_sat
from
(select
l_orderkey,
l_linenumber,
o_orderpriority,
rank() over (partition by l_orderkey,l_linenumber
order by o_orderpriority) as rankProj,
sum(case
when l_commitdate < l_receiptdate and
o_orderdate >= ’1993-07-01’
and o_orderdate < date(’1993-07-01’) + 3 MONTHS
then 0 else 1 end)
over (partition by l_orderkey,l_linenumber) as cond_viol,
case when l_commitdate < l_receiptdate and
o_orderdate >= ’1993-07-01’
and o_orderdate < date(’1993-07-01’) + 3 MONTHS
then 1 else 0 end as cond_sat
from orders, lineitem li
where
l_orderkey = o_orderkey
and
exists (select * from candidatesSubQuery sc
where li.l_orderkey=sc.l_orderkey and li.l_linenumber=sc.l_linenumber)
) q
where cond_sat=1
Appendix A. TPC-H Queries and their Rewritings 177
group by
l_orderkey,
l_linenumber,
o_orderpriority,
cond_viol,cond_sat,rankProj),
contribConsistentSubQuery as (
select l_orderkey,
l_linenumber,
o_orderpriority,
min_count
from
contribAllSubQuery Cand
where
countProj =1 and cond_viol=0),
contribNonConsistentSubQuery as (
select
l_orderkey,
l_linenumber,
o_orderpriority,
0 as min_count
from
contribAllSubQuery Cand
where
countProj > 1 or cond_viol >= 1)
select
o_orderpriority,
count(*) as max_order_count,
sum(min_count) as min_order_count
from
(select * from contribNonConsistentSubQuery
union all
select * from contribConsistentSubQuery) as q
group by
o_orderpriority
having sum(min_count)>0
order by
o_orderpriority;
TPC-H Query 6
select
Appendix A. TPC-H Queries and their Rewritings 178
sum(l_extendedprice * l_discount) as revenue
from
lineitem
where
l_shipdate >= ’1994-01-01’
and l_shipdate < date(’1994-01-01’) + 1 YEAR
and l_discount >= 0.06 - 0.01
and l_discount <= 0.06 + 0.01
and l_quantity < 24;
Rewritten Query 6
with candidatesSubQuery as (
select
l_orderkey,
l_linenumber
from
lineitem
where
l_shipdate >= ’1994-01-01’
and l_shipdate < date(’1994-01-01’) + 1 YEAR
and l_discount >= 0.06 - 0.01
and l_discount <= 0.06 + 0.01
and l_quantity < 24
),
contribAllSubQuery as (
select l_orderkey,
l_linenumber,
min(l_extendedprice * l_discount) as min_revenue,
max(l_extendedprice * l_discount) as max_revenue,
cond_viol,
cond_sat
from (
select
l_orderkey,
l_linenumber,
l_extendedprice,
l_discount,
sum(case
when l_shipdate >= ’1994-01-01’
and l_shipdate < date(’1994-01-01’) + 1 YEAR
Appendix A. TPC-H Queries and their Rewritings 179
and l_discount >= 0.06 - 0.01
and l_discount <= 0.06 + 0.01
and l_quantity < 24
then 0 else 1 end)
over (partition by l_orderkey,l_linenumber) as cond_viol,
case when l_shipdate >= ’1994-01-01’
and l_shipdate < date(’1994-01-01’) + 1 YEAR
and l_discount >= 0.06 - 0.01
and l_discount <= 0.06 + 0.01
and l_quantity < 24
then 1 else 0 end as cond_sat
from
lineitem li
where
exists (select * from candidatesSubQuery sc
where li.l_orderkey=sc.l_orderkey
and li.l_linenumber=sc.l_linenumber) ) q
where cond_sat=1
group by l_orderkey,
l_linenumber, cond_viol,cond_sat),
contribConsistentSubQuery as (
select l_orderkey,
l_linenumber,
min_revenue,
max_revenue
from
contribAllSubQuery Cand
where
cond_viol=0),
contribNonConsistentSubQuery as (
select l_orderkey,
l_linenumber,
0 as min_revenue,
max_revenue
from
contribAllSubQuery Cand
where cond_viol >= 1
)
select
sum(min_revenue) as min_sum_revenue,
Appendix A. TPC-H Queries and their Rewritings 180
sum(max_revenue) as max_sum_revenue
from
(select * from contribNonConsistentSubQuery
union all
select * from contribConsistentSubQuery) as q;
TPC-H Query 7
select
supp_nation,
cust_nation,
l_year,
sum(volume) as revenue
from
(
select
n1.n_name as supp_nation,
n2.n_name as cust_nation,
year(l_shipdate) as l_year,
l_extendedprice * (1 - l_discount) as volume
from
supplier,
lineitem,
orders,
customer,
nation n1,
nation n2
where
s_suppkey = l_suppkey
and o_orderkey = l_orderkey
and c_custkey = o_custkey
and s_nationkey = n1.n_nationkey
and c_nationkey = n2.n_nationkey
and (
(n1.n_name = ’FRANCE’ and n2.n_name = ’GERMANY’)
or (n1.n_name = ’GERMANY’ and n2.n_name = ’FRANCE’)
)
and l_shipdate >= ’1995-01-01’
and l_shipdate <= ’1996-12-31’
) as shipping
group by
Appendix A. TPC-H Queries and their Rewritings 181
supp_nation,
cust_nation,
l_year
order by
supp_nation,
cust_nation,
l_year;
Rewritten Query 7
with candidatesSubQuery as (
select
l_orderkey,
l_linenumber
from
supplier,
lineitem,
orders,
customer,
nation n1,
nation n2
where
s_suppkey = l_suppkey
and o_orderkey = l_orderkey
and c_custkey = o_custkey
and s_nationkey = n1.n_nationkey
and c_nationkey = n2.n_nationkey
and (
(n1.n_name = ’FRANCE’ and n2.n_name = ’GERMANY’)
or (n1.n_name = ’GERMANY’ and n2.n_name = ’FRANCE’)
)
and l_shipdate >= ’1995-01-01’
and l_shipdate <= ’1996-12-31’
),
contribAllSubQuery as (
select
supp_nation,
cust_nation,
l_year,
min(volume) as low_revenue,
max(volume) as up_revenue,
Appendix A. TPC-H Queries and their Rewritings 182
condWhereSat,
condWhereViol,
max(rankProj) as countProj
from
(
select
l_orderkey,
l_linenumber,
n1.n_name as supp_nation,
n2.n_name as cust_nation,
year(l_shipdate) as l_year,
l_extendedprice * (1 - l_discount) as volume,
rank() over (partition by l_orderkey,l_linenumber
order by n1.n_name,n2.n_name,year(l_shipdate)) as rankProj,
sum(case
when (
s_suppkey = l_suppkey
and o_orderkey = l_orderkey
and c_custkey = o_custkey
and s_nationkey = n1.n_nationkey
and c_nationkey = n2.n_nationkey
and (
(n1.n_name = ’FRANCE’ and n2.n_name = ’GERMANY’)
or (n1.n_name = ’GERMANY’ and n2.n_name = ’FRANCE’)
)
and l_shipdate >= ’1995-01-01’
and l_shipdate <= ’1996-12-31’)
then 0 else 1 end)
over (partition by l_orderkey,l_linenumber) as condWhereViol,
case
when (
s_suppkey = l_suppkey
and c_custkey = o_custkey
and s_nationkey = n1.n_nationkey
and c_nationkey = n2.n_nationkey
and (
(n1.n_name = ’FRANCE’ and n2.n_name = ’GERMANY’)
or (n1.n_name = ’GERMANY’ and n2.n_name = ’FRANCE’)
)
and l_shipdate >= ’1995-01-01’
and l_shipdate <= ’1996-12-31’)
Appendix A. TPC-H Queries and their Rewritings 183
then 1 else 0 end as condWhereSat
from
lineitem li JOIN orders ON o_orderkey = l_orderkey
LEFT OUTER JOIN supplier ON s_suppkey = l_suppkey
LEFT OUTER JOIN nation n1 ON s_nationkey = n1.n_nationkey
LEFT OUTER JOIN customer ON c_custkey = o_custkey
LEFT OUTER JOIN nation n2 ON c_nationkey = n2.n_nationkey
where
exists
(select * from candidatesSubQuery sc
where li.l_orderkey=sc.l_orderkey
and li.l_linenumber=sc.l_linenumber)
) q
where condWhereSat=1
group by
l_orderkey,
l_linenumber,
supp_nation,
cust_nation,
l_year,
condWhereSat,condWhereViol),
contribConsistentSubQuery as (
select
supp_nation,
cust_nation,
l_year,
low_revenue,
up_revenue,
1 as countConsistent
from
contribAllSubQuery Cand
where condWhereViol = 0 and countProj = 1),
contribNonConsistentSubQuery as (
select
supp_nation,
cust_nation,
l_year,
low_revenue,
0 as up_revenue,
0 as countConsistent
from
Appendix A. TPC-H Queries and their Rewritings 184
contribAllSubQuery Cand
where condWhereViol >= 1 or countProj >1)
select
supp_nation,
cust_nation,
l_year,
sum(low_revenue) as low_sum_revenue,
sum(up_revenue) as up_sum_revenue
from
(select * from contribConsistentSubQuery
union all
select * from contribNonConsistentSubQuery) q
group by
supp_nation,
cust_nation,
l_year
having sum(countConsistent)>0
order by
supp_nation,
cust_nation,
l_year;
TPC-H Query 8
select
YEAR(o_orderdate) as o_year,
sum(case
when n2.n_name = ’BRAZIL’ then l_extendedprice * (1 - l_discount)
else 0
end) / sum(l_extendedprice * (1 - l_discount)) as mkt_share
from
part,
supplier,
lineitem,
orders,
customer,
nation n1,
nation n2,
region
where
p_partkey = l_partkey
Appendix A. TPC-H Queries and their Rewritings 185
and s_suppkey = l_suppkey
and l_orderkey = o_orderkey
and o_custkey = c_custkey
and c_nationkey = n1.n_nationkey
and n1.n_regionkey = r_regionkey
and r_name = ’AMERICA’
and s_nationkey = n2.n_nationkey
and o_orderdate >= ’1995-01-01’
and o_orderdate <= ’1996-12-31’
and p_type = ’ECONOMY ANODIZED STEEL’
group by
YEAR(o_orderdate)
order by
YEAR(o_orderdate);
Rewritten Query 8
with candidatesSubQuery as (
select
l_orderkey,
l_linenumber
from
part,
supplier,
lineitem,
orders,
customer,
nation n1,
nation n2,
region
where
p_partkey = l_partkey
and s_suppkey = l_suppkey
and l_orderkey = o_orderkey
and o_custkey = c_custkey
and c_nationkey = n1.n_nationkey
and n1.n_regionkey = r_regionkey
and r_name = ’AMERICA’
and s_nationkey = n2.n_nationkey
and o_orderdate >= ’1995-01-01’
and o_orderdate <= ’1996-12-31’
Appendix A. TPC-H Queries and their Rewritings 186
and p_type = ’ECONOMY ANODIZED STEEL’
),
contribAllSubQuery as (
select
o_year,
min(dividend) as low_dividend,
max(dividend) as up_dividend,
min(divisor) as low_divisor,
max(divisor) as up_divisor,
condWhereSat,
condWhereViol,
max(rankProj) over (partition by l_orderkey,l_linenumber) as countProj
from
(
select
l_orderkey,
l_linenumber,
YEAR(o_orderdate) as o_year,
case
when n2.n_name = ’BRAZIL’ then l_extendedprice * (1 - l_discount)
else 0
end as dividend,
l_extendedprice * (1 - l_discount) as divisor,
rank() over (partition by l_orderkey,l_linenumber
order by YEAR(o_orderdate)) as rankProj,
sum(case when
(p_partkey = l_partkey
and s_suppkey = l_suppkey
and o_custkey = c_custkey
and c_nationkey = n1.n_nationkey
and n1.n_regionkey = r_regionkey
and r_name = ’AMERICA’
and s_nationkey = n2.n_nationkey
and o_orderdate >= ’1995-01-01’
and o_orderdate <= ’1996-12-31’
and p_type = ’ECONOMY ANODIZED STEEL’)
then 0 else 1 end)
over (partition by l_orderkey,l_linenumber) as condWhereViol,
case when
(p_partkey = l_partkey
and s_suppkey = l_suppkey
Appendix A. TPC-H Queries and their Rewritings 187
and o_custkey = c_custkey
and c_nationkey = n1.n_nationkey
and n1.n_regionkey = r_regionkey
and r_name = ’AMERICA’
and s_nationkey = n2.n_nationkey
and o_orderdate >= ’1995-01-01’
and o_orderdate <= ’1996-12-31’
and p_type = ’ECONOMY ANODIZED STEEL’)
then 1 else 0 end as condWhereSat
from
lineitem li JOIN orders ON l_orderkey = o_orderkey
LEFT OUTER JOIN supplier ON s_suppkey = l_suppkey
LEFT OUTER JOIN nation n2 ON s_nationkey = n2.n_nationkey
LEFT OUTER JOIN part ON p_partkey = l_partkey
LEFT OUTER JOIN customer ON o_custkey = c_custkey
LEFT OUTER JOIN nation n1 ON c_nationkey = n1.n_nationkey
LEFT OUTER JOIN region ON n1.n_regionkey = r_regionkey
where
exists (select * from candidatesSubQuery sc
where li.l_orderkey=sc.l_orderkey
and li.l_linenumber=sc.l_linenumber)
) q
where
condWhereSat=1
group by
l_orderkey,l_linenumber,o_year,condWhereSat,condWhereViol,rankProj),
contribConsistentSubQuery as (
select
o_year,
low_dividend,
up_dividend,
low_divisor,
up_divisor,
1 as countConsistent
from
contribAllSubQuery Cand
where condWhereViol = 0 and countProj=1),
contribNonConsistentSubQuery as (
select
o_year,
Appendix A. TPC-H Queries and their Rewritings 188
0 as low_dividend,
up_dividend,
0 as low_divisor,
up_divisor,
0 as countConsistent
from
contribAllSubQuery Cand
where condWhereViol >= 1 or countProj > 1)
select o_year,
sum(low_dividend)/sum(up_divisor) as low_mkt_share,
sum(up_dividend)/sum(low_divisor) as up_mktshare
from
(select * from contribConsistentSubQuery
union all
select * from contribNonConsistentSubQuery) q
group by o_year
having sum(countConsistent)>0
order by o_year;
TPC-H Query 9
select
n_name,
YEAR(o_orderdate) as o_year,
sum(l_extendedprice * (1 - l_discount) - ps_supplycost * l_quantity) as sum_profit
from
part,
supplier,
lineitem,
partsupp,
orders,
nation
where
s_suppkey = l_suppkey
and ps_suppkey = l_suppkey
and ps_partkey = l_partkey
and p_partkey = l_partkey
and o_orderkey = l_orderkey
and s_nationkey = n_nationkey
and p_name like ’%green%’
group by
Appendix A. TPC-H Queries and their Rewritings 189
n_name,
o_orderdate
order by
n_name,
o_orderdate desc;
Rewritten Query 9
with candidatesSubQuery as (
select
l_orderkey,
l_linenumber
from
part,
supplier,
lineitem,
orders,
nation
where
s_suppkey = l_suppkey
and p_partkey = l_partkey
and o_orderkey = l_orderkey
and s_nationkey = n_nationkey
and p_name like ’%green%’
),
contribAllSubQuery as (
select
l_orderkey,
l_linenumber,
n_name as nation,
YEAR(o_orderdate) as o_year,
min(l_extendedprice * (1 - l_discount) - ps_supplycost * l_quantity) as min_profit,
max(l_extendedprice * (1 - l_discount) - ps_supplycost * l_quantity) as max_profit,
1 as min_count,
max(rankProj) over (partition by l_orderkey,l_linenumber) as countProj,
cond_viol,
cond_sat
from
(select l_orderkey,
l_linenumber,
n_name,
Appendix A. TPC-H Queries and their Rewritings 190
o_orderdate,
l_extendedprice,
l_discount,
ps_supplycost,
l_quantity,
rank() over (partition by l_orderkey,l_linenumber
order by n_name,YEAR(o_orderdate)) as rankProj,
sum(case
when s_suppkey = l_suppkey
and ps_suppkey = l_suppkey
and ps_partkey = l_partkey
and p_partkey = l_partkey
and s_nationkey = n_nationkey
and p_name like ’%green%’
then 0 else 1 end)
over (partition by l_orderkey,l_linenumber) as cond_viol,
case when
s_suppkey = l_suppkey
and ps_suppkey = l_suppkey
and ps_partkey = l_partkey
and p_partkey = l_partkey
and s_nationkey = n_nationkey
and p_name like ’%green%’
then 1 else 0 end as cond_sat
from
lineitem l JOIN orders o1 ON l_orderkey=o1.o_orderkey
LEFT OUTER JOIN part ON p_partkey=l_partkey
LEFT OUTER JOIN supplier ON s_suppkey=l_suppkey
LEFT OUTER JOIN nation n1 ON n1.n_nationkey=s_nationkey
LEFT OUTER JOIN partsupp ON ps_partkey=l_partkey and ps_suppkey=l_suppkey
where
exists (select * from candidatesSubQuery sc
where l.l_orderkey=sc.l_orderkey
and l.l_linenumber=sc.l_linenumber)
) q
where cond_sat=1
group by
l_orderkey,
l_linenumber,
n_name,
o_orderdate,
Appendix A. TPC-H Queries and their Rewritings 191
cond_viol,cond_sat,rankProj),
contribConsistentSubQuery as (
select l_orderkey,
l_linenumber,
nation,
o_year,
min_profit,
max_profit,
min_count
from
contribAllSubQuery Cand
where
countProj = 1 and cond_viol=0),
contribNonConsistentSubQuery as (
select
l_orderkey,
l_linenumber,
nation,
o_year,
0 as min_profit,
max_profit,
0 as min_count
from
contribAllSubQuery Cand
where
countProj > 1 or cond_viol >= 1)
select
nation,
o_year,
sum(min_profit) as min_sum_profit,
sum(max_profit) as max_sum_profit
from
(select * from contribNonConsistentSubQuery
union all
select * from contribConsistentSubQuery) as q
group by
nation,
o_year
having sum(min_count)>0
order by
nation,
Appendix A. TPC-H Queries and their Rewritings 192
o_year desc;
TPC-H Query 10
select
c_custkey,
c_name,
sum(l_extendedprice * (1 - l_discount)) as revenue,
c_acctbal,
n_name,
c_address,
c_phone,
c_comment
from
customer,
orders,
lineitem,
nation
where
c_custkey = o_custkey
and l_orderkey = o_orderkey
and o_orderdate >= ’1993-10-01’
and o_orderdate < date(’1993-10-01’) + 3 MONTHS
and l_returnflag = ’R’
and c_nationkey = n_nationkey
group by
c_custkey,
c_name,
c_acctbal,
c_phone,
n_name,
c_address,
c_comment
order by
revenue desc
fetch first 20 rows only;
Rewritten Query 10
with candidatesSubQuery as (
select
l_orderkey,
Appendix A. TPC-H Queries and their Rewritings 193
l_linenumber
from
customer,
orders,
lineitem,
nation
where
c_custkey = o_custkey
and l_orderkey = o_orderkey
and o_orderdate >= ’1993-10-01’
and o_orderdate < date(’1993-10-01’) + 3 MONTHS
and l_returnflag = ’R’
and c_nationkey = n_nationkey
),
contribAllSubQuery as (
select
l_orderkey,
l_linenumber,
c_custkey,
c_name,
c_acctbal,
n_name,
c_address,
c_phone,
c_comment,
min(l_extendedprice * (1 - l_discount)) as min_revenue,
max(l_extendedprice * (1 - l_discount)) as max_revenue,
1 as min_count,
max (rankProj) over (partition by l_orderkey,l_linenumber) as countProj,
cond_viol,
cond_sat
from
(select l_orderkey,
l_linenumber,
c_custkey,
c_name,
c_acctbal,
n_name,
c_address,
c_phone,
c_comment,
Appendix A. TPC-H Queries and their Rewritings 194
l_extendedprice,
l_discount,
rank() over (partition by l_orderkey,l_linenumber
order by c_custkey,c_name,
c_acctbal,n_name, c_address, c_phone,
c_comment) as rankProj,
sum(case
when c_custkey = o_custkey
and o_orderdate >= ’1993-10-01’
and o_orderdate < date(’1993-10-01’) + 3 MONTHS
and l_returnflag = ’R’
and c_nationkey = n_nationkey
then 0 else 1 end)
over (partition by l_orderkey,l_linenumber) as cond_viol,
case when c_custkey = o_custkey
and o_orderdate >= ’1993-10-01’
and o_orderdate < date(’1993-10-01’) + 3 MONTHS
and l_returnflag = ’R’
and c_nationkey = n_nationkey
then 1 else 0 end as cond_sat
from
lineitem l JOIN orders on l_orderkey=o_orderkey
LEFT OUTER JOIN customer c1 ON c1.c_custkey=o_custkey
LEFT OUTER JOIN nation n1 ON n1.n_nationkey=c1.c_nationkey
where
exists (select * from candidatesSubQuery sc
where l.l_orderkey=sc.l_orderkey
and l.l_linenumber=sc.l_linenumber)
) q
where cond_sat=1
group by
l_orderkey,
l_linenumber,
c_custkey,
c_name,
c_acctbal,
c_phone,
n_name,
c_address,
c_comment,
cond_viol,cond_sat,rankProj),
Appendix A. TPC-H Queries and their Rewritings 195
contribConsistentSubQuery as (
select l_orderkey,
l_linenumber,
c_custkey,
c_name,
c_acctbal,
n_name,
c_address,
c_phone,
c_comment,
min_revenue,
max_revenue,
min_count
from
contribAllSubQuery Cand
where
countProj=1 and cond_viol=0),
contribNonConsistentSubQuery as (
select
l_orderkey,
l_linenumber,
c_custkey,
c_name,
c_acctbal,
n_name,
c_address,
c_phone,
c_comment,
0 as min_revenue,
max_revenue,
0 as min_count
from
contribAllSubQuery Cand
where
countProj > 1 or cond_viol >= 1)
select
c_custkey,
c_name,
sum(min_revenue) as min_sum_revenue,
sum(max_revenue) as max_sum_revenue,
c_acctbal,
Appendix A. TPC-H Queries and their Rewritings 196
n_name,
c_address,
c_phone,
c_comment
from
(select * from contribNonConsistentSubQuery
union all
select * from contribConsistentSubQuery) as q
group by
c_custkey,
c_name,
c_acctbal,
c_phone,
n_name,
c_address,
c_comment
having sum(min_count)>0
order by
min_sum_revenue desc
fetch first 20 rows only;
TPC-H Query 12
select
l_shipmode,
sum(case
when o_orderpriority = ’1-URGENT’
or o_orderpriority = ’2-HIGH’
then 1
else 0
end) as high_line_count,
sum(case
when o_orderpriority <> ’1-URGENT’
and o_orderpriority <> ’2-HIGH’
then 1
else 0
end) as low_line_count
from
orders,
lineitem
where
Appendix A. TPC-H Queries and their Rewritings 197
o_orderkey = l_orderkey
and l_shipmode in (’MAIL’, ’SHIP’)
and l_commitdate < l_receiptdate
and l_shipdate < l_commitdate
and l_receiptdate >= ’1994-01-01’
and l_receiptdate < date(’1994-01-01’) + 1 YEAR
group by
l_shipmode
order by
l_shipmode;
Rewritten Query 12
with candidatesSubQuery as (
select
l_orderkey,
l_linenumber
from
orders,
lineitem
where
o_orderkey = l_orderkey
and l_shipmode in (’MAIL’, ’SHIP’)
and l_commitdate < l_receiptdate
and l_shipdate < l_commitdate
and l_receiptdate >= ’1994-01-01’
and l_receiptdate < date(’1994-01-01’) + 1 YEAR
),
contribAllSubQuery as (
select l_orderkey,
l_linenumber,
l_shipmode,
min(case
when o_orderpriority = ’1-URGENT’
or o_orderpriority = ’2-HIGH’
then 1
else 0
end) as min_high_line_count,
max(case
when o_orderpriority = ’1-URGENT’
or o_orderpriority = ’2-HIGH’
Appendix A. TPC-H Queries and their Rewritings 198
then 1
else 0
end) as max_high_line_count,
min(case
when o_orderpriority <> ’1-URGENT’
and o_orderpriority <> ’2-HIGH’
then 1
else 0
end) as min_low_line_count,
max(case
when o_orderpriority <> ’1-URGENT’
and o_orderpriority <> ’2-HIGH’
then 1
else 0
end) as max_low_line_count,
1 as min_count,
max(rankProj) over (partition by l_orderkey,l_linenumber) as countProj,
cond_viol,
cond_sat
from
(select l_orderkey,
l_linenumber,
l_shipmode,
o_orderpriority,
rank() over (partition by l_orderkey,l_linenumber
order by l_shipmode) as rankProj,
sum(case
when l_shipmode in (’MAIL’, ’SHIP’)
and l_commitdate < l_receiptdate
and l_shipdate < l_commitdate
and l_receiptdate >= ’1994-01-01’
and l_receiptdate < date(’1994-01-01’) + 1 YEAR
then 0 else 1 end)
over (partition by l_orderkey,l_linenumber) as cond_viol,
case when l_shipmode in (’MAIL’, ’SHIP’)
and l_commitdate < l_receiptdate
and l_shipdate < l_commitdate
and l_receiptdate >= ’1994-01-01’
and l_receiptdate < date(’1994-01-01’) + 1 YEAR
then 1 else 0 end as cond_sat
from orders JOIN lineitem l ON l_orderkey = o_orderkey
Appendix A. TPC-H Queries and their Rewritings 199
where
exists (select * from candidatesSubQuery sc
where l.l_orderkey=sc.l_orderkey
and l.l_linenumber=sc.l_linenumber)
) q
where cond_sat=1
group by
l_shipmode,
l_orderkey,
l_linenumber,
cond_viol,cond_sat,rankProj),
contribConsistentSubQuery as (
select l_orderkey,
l_linenumber,
l_shipmode,
min_high_line_count,
max_high_line_count,
min_low_line_count,
max_low_line_count,
min_count
from
contribAllSubQuery Cand
where
countProj=1 and cond_viol=0),
contribNonConsistentSubQuery as (
select
l_orderkey,
l_linenumber,
l_shipmode,
0 as min_high_line_count,
max_high_line_count,
0 as min_low_line_count,
max_low_line_count,
0 as min_count
from
contribAllSubQuery Cand
where
countProj > 1 or cond_viol >= 1)
select
l_shipmode,
sum(min_high_line_count) as sum_min_high_line_count,
Appendix A. TPC-H Queries and their Rewritings 200
sum(max_high_line_count) as sum_max_high_line_count,
sum(min_low_line_count) as sum_min_low_line_count,
sum(max_low_line_count) as sum_max_low_line_count
from
(select * from contribNonConsistentSubQuery
union all
select * from contribConsistentSubQuery) as q
group by
l_shipmode
having sum(min_count)>0
order by
l_shipmode;
TPC-H Query 14
select
100.00 * sum(case
when p_type like ’PROMO%’
then l_extendedprice * (1 - l_discount)
else 0
end) / sum(l_extendedprice * (1 - l_discount)) as promo_revenue
from
lineitem,
part
where
l_partkey = p_partkey
and l_shipdate >= ’1995-09-01’
and l_shipdate < date(’1995-09-01’) + 30 DAYS;
Rewritten Query 14
with candidatesSubQuery as (
select
l_orderkey,
l_linenumber
from
lineitem,
part
where
l_partkey = p_partkey
and l_shipdate >= ’1995-09-01’
and l_shipdate < date(’1995-09-01’) + 30 DAYS
Appendix A. TPC-H Queries and their Rewritings 201
),
contribAllSubQuery as (
select
min(dividend) as low_dividend,
max(dividend) as up_dividend,
min(divisor) as low_divisor,
max(divisor) as up_divisor,
condWhereSat,
condWhereViol
from
(
select
l_orderkey,
l_linenumber,
100.00 * case
when p_type like ’PROMO%’
then l_extendedprice * (1 - l_discount)
else 0
end as dividend,
l_extendedprice * (1 - l_discount) as divisor,
sum(case when
(l_partkey = p_partkey
and l_shipdate >= ’1995-09-01’
and l_shipdate < date(’1995-09-01’) + 30 DAYS)
then 0 else 1 end)
over (partition by l_orderkey,l_linenumber) as condWhereViol,
case when
(l_partkey = p_partkey
and l_shipdate >= ’1995-09-01’
and l_shipdate < date(’1995-09-01’) + 30 DAYS)
then 1 else 0 end as condWhereSat
from
lineitem li LEFT OUTER JOIN part ON l_partkey = p_partkey
where
exists (select * from candidatesSubQuery sc
where li.l_orderkey=sc.l_orderkey
and li.l_linenumber=sc.l_linenumber)
) q
where
condWhereSat=1
group by l_orderkey,l_linenumber,condWhereSat,condWhereViol),
Appendix A. TPC-H Queries and their Rewritings 202
contribConsistentSubQuery as (
select
low_dividend,
up_dividend,
low_divisor,
up_divisor,
1 as countConsistent
from
contribAllSubQuery Cand
where condWhereViol = 0),
contribNonConsistentSubQuery as (
select
0 as low_dividend,
up_dividend,
0 as low_divisor,
up_divisor,
0 as countConsistent
from
contribAllSubQuery Cand
where condWhereViol >= 1)
select sum(low_dividend)/sum(up_divisor) as low_promo_revenue,
sum(up_dividend)/sum(low_divisor) as up_promo_revenue
from
(select * from contribConsistentSubQuery
union all
select * from contribNonConsistentSubQuery) q
having sum(countConsistent)>0;
TPC-H Query 19
select
sum(l_extendedprice* (1 - l_discount)) as revenue
from
lineitem,
part
where
(
p_partkey = l_partkey
and p_brand = ’Brand#12’
and p_container in (’SM CASE’, ’SM BOX’, ’SM PACK’, ’SM PKG’)
and l_quantity >= 1 and l_quantity <= 1 + 10
Appendix A. TPC-H Queries and their Rewritings 203
and p_size between 1 and 5
and l_shipmode in (’AIR’, ’AIR REG’)
and l_shipinstruct = ’DELIVER IN PERSON’
)
or
(
p_partkey = l_partkey
and p_brand = ’Brand#23’
and p_container in (’MED BAG’, ’MED BOX’, ’MED PKG’, ’MED PACK’)
and l_quantity >= 10 and l_quantity <= 10 + 10
and p_size between 1 and 10
and l_shipmode in (’AIR’, ’AIR REG’)
and l_shipinstruct = ’DELIVER IN PERSON’
)
or
(
p_partkey = l_partkey
and p_brand = ’Brand#34’
and p_container in (’LG CASE’, ’LG BOX’, ’LG PACK’, ’LG PKG’)
and l_quantity >= 20 and l_quantity <= 20 + 10
and p_size between 1 and 15
and l_shipmode in (’AIR’, ’AIR REG’)
and l_shipinstruct = ’DELIVER IN PERSON’
);
Rewritten Query 19
with candidatesSubQuery as (
select
l_orderkey,
l_linenumber
from
lineitem,
part
where
(
p_partkey = l_partkey
and p_brand = ’Brand#12’
and p_container in (’SM CASE’, ’SM BOX’, ’SM PACK’, ’SM PKG’)
and l_quantity >= 1 and l_quantity <= 1 + 10
and p_size between 1 and 5
Appendix A. TPC-H Queries and their Rewritings 204
and l_shipmode in (’AIR’, ’AIR REG’)
and l_shipinstruct = ’DELIVER IN PERSON’
)
or
(
p_partkey = l_partkey
and p_brand = ’Brand#23’
and p_container in (’MED BAG’, ’MED BOX’, ’MED PKG’, ’MED PACK’)
and l_quantity >= 10 and l_quantity <= 10 + 10
and p_size between 1 and 10
and l_shipmode in (’AIR’, ’AIR REG’)
and l_shipinstruct = ’DELIVER IN PERSON’
)
or
(
p_partkey = l_partkey
and p_brand = ’Brand#34’
and p_container in (’LG CASE’, ’LG BOX’, ’LG PACK’, ’LG PKG’)
and l_quantity >= 20 and l_quantity <= 20 + 10
and p_size between 1 and 15
and l_shipmode in (’AIR’, ’AIR REG’)
and l_shipinstruct = ’DELIVER IN PERSON’
)
),
contribAllSubQuery as (
select min(revenue) as low_revenue,
max(revenue) as up_revenue,
condWhereViol,
condWhereSat
from
(select
l_orderkey,
l_linenumber,
l_extendedprice* (1 - l_discount) as revenue,
sum (case when (
(
p_partkey = l_partkey
and p_brand = ’Brand#12’
and p_container in (’SM CASE’, ’SM BOX’, ’SM PACK’, ’SM PKG’)
and l_quantity >= 1 and l_quantity <= 1 + 10
and p_size between 1 and 5
Appendix A. TPC-H Queries and their Rewritings 205
and l_shipmode in (’AIR’, ’AIR REG’)
and l_shipinstruct = ’DELIVER IN PERSON’
)
or
(
p_partkey = l_partkey
and p_brand = ’Brand#23’
and p_container in (’MED BAG’, ’MED BOX’, ’MED PKG’, ’MED PACK’)
and l_quantity >= 10 and l_quantity <= 10 + 10
and p_size between 1 and 10
and l_shipmode in (’AIR’, ’AIR REG’)
and l_shipinstruct = ’DELIVER IN PERSON’
)
or
(
p_partkey = l_partkey
and p_brand = ’Brand#34’
and p_container in (’LG CASE’, ’LG BOX’, ’LG PACK’, ’LG PKG’)
and l_quantity >= 20 and l_quantity <= 20 + 10
and p_size between 1 and 15
and l_shipmode in (’AIR’, ’AIR REG’)
and l_shipinstruct = ’DELIVER IN PERSON’
)
) then 0 else 1 end)
over (partition by l_orderkey,l_linenumber) as condWhereViol,
case when (
(
p_partkey = l_partkey
and p_brand = ’Brand#12’
and p_container in (’SM CASE’, ’SM BOX’, ’SM PACK’, ’SM PKG’)
and l_quantity >= 1 and l_quantity <= 1 + 10
and p_size between 1 and 5
and l_shipmode in (’AIR’, ’AIR REG’)
and l_shipinstruct = ’DELIVER IN PERSON’
)
or
(
p_partkey = l_partkey
and p_brand = ’Brand#23’
and p_container in (’MED BAG’, ’MED BOX’, ’MED PKG’, ’MED PACK’)
and l_quantity >= 10 and l_quantity <= 10 + 10
Appendix A. TPC-H Queries and their Rewritings 206
and p_size between 1 and 10
and l_shipmode in (’AIR’, ’AIR REG’)
and l_shipinstruct = ’DELIVER IN PERSON’
)
or
(
p_partkey = l_partkey
and p_brand = ’Brand#34’
and p_container in (’LG CASE’, ’LG BOX’, ’LG PACK’, ’LG PKG’)
and l_quantity >= 20 and l_quantity <= 20 + 10
and p_size between 1 and 15
and l_shipmode in (’AIR’, ’AIR REG’)
and l_shipinstruct = ’DELIVER IN PERSON’
)
) then 1 else 0 end as condWhereSat
from
lineitem li
LEFT OUTER JOIN part ON p_partkey = l_partkey
where
exists (select * from candidatesSubQuery sc
where li.l_orderkey=sc.l_orderkey
and li.l_linenumber=sc.l_linenumber)
) q
where condWhereSat = 1
group by l_orderkey,l_linenumber,condWhereViol,condWhereSat),
contribConsistentSubQuery as (
select low_revenue,
up_revenue,
1 as countConsistent
from
contribAllSubQuery Cand
where condWhereViol = 0),
contribNonConsistentSubQuery as (
select
0 as low_revenue,
up_revenue,
0 as countConsistent
from
contribAllSubQuery Cand
where condWhereViol >= 1)
select sum(low_revenue) as sum_low_revenue,
Appendix A. TPC-H Queries and their Rewritings 207
sum(up_revenue) as sum_up_revenue
from
(select * from contribConsistentSubQuery
union all
select * from contribNonConsistentSubQuery) q
having sum(countConsistent)>0;
Appendix B
Design Advisor Indices
The following are the indices suggested by DB2’s Design Advisor for the inconsistent databases of the
scalability experiment.
Inconsistent database size 1 GB, p = 5%, n = 2
-- index[1], 4.118MB
CREATE INDEX "DB2ADMIN"."IDX609161412250000" ON "DB2ADMIN"."CUSTOMER" ("C_MKTSEGMENT" ASC, "C_CUSTKEY" ASC)
ALLOW REVERSE SCANS ;
-- index[2], 2.884MB
CREATE INDEX "DB2ADMIN"."IDX609161413270000" ON "DB2ADMIN"."CUSTOMER" ("C_NATIONKEY" ASC, "C_CUSTKEY" ASC)
ALLOW REVERSE SCANS ;
-- index[3], 28.802MB
CREATE INDEX "DB2ADMIN"."IDX609161413220000" ON "DB2ADMIN"."ORDERS" ("O_CUSTKEY" ASC, "O_ORDERKEY" ASC)
ALLOW REVERSE SCANS ;
-- index[4], 0.196MB
CREATE INDEX "DB2ADMIN"."IDX609161412310000" ON "DB2ADMIN"."SUPPLIER" ("S_NATIONKEY" ASC, "S_SUPPKEY" ASC)
ALLOW REVERSE SCANS ;
-- index[5], 10.751MB
CREATE INDEX "DB2ADMIN"."IDX609161415040000" ON "DB2ADMIN"."PART" ("P_TYPE" ASC, "P_PARTKEY" ASC)
ALLOW REVERSE SCANS ;
-- index[6], 0.013MB
CREATE INDEX "DB2ADMIN"."IDX609161413420000" ON "DB2ADMIN"."REGION" ("R_NAME" ASC, "R_REGIONKEY" ASC)
ALLOW REVERSE SCANS ;
-- index[7], 0.013MB
CREATE INDEX "DB2ADMIN"."IDX609161413510000" ON "DB2ADMIN"."NATION" ("N_REGIONKEY" ASC, "N_NATIONKEY" ASC)
ALLOW REVERSE SCANS ;
-- index[8], 28.282MB
CREATE INDEX "DB2ADMIN"."IDX609161415260000" ON "DB2ADMIN"."ORDERS" ("O_ORDERKEY" ASC, "O_ORDERDATE" ASC)
ALLOW REVERSE SCANS ;
-- index[9], 39.142MB
CREATE INDEX "DB2ADMIN"."IDX609161417270000" ON "DB2ADMIN"."ORDERS" ("O_ORDERDATE" ASC,
"O_ORDERKEY" ASC, "O_CUSTKEY" ASC) ALLOW REVERSE SCANS ;
-- index[10], 63.013MB
CREATE INDEX "DB2ADMIN"."IDX609161421380000" ON "DB2ADMIN"."LINEITEM" ("L_SHIPDATE" ASC, "L_LINENUMBER" ASC,
208
Appendix B. Design Advisor Indices 209
"L_ORDERKEY" ASC) ALLOW REVERSE SCANS ;
-- index[11], 4.118MB
CREATE INDEX "DB2ADMIN"."IDX609161422320000" ON "DB2ADMIN"."CUSTOMER" ("C_CUSTKEY" ASC, "C_MKTSEGMENT" ASC)
ALLOW REVERSE SCANS ;
-- index[12], 2.884MB
CREATE INDEX "DB2ADMIN"."IDX609161424190000" ON "DB2ADMIN"."CUSTOMER" ("C_CUSTKEY" ASC, "C_NATIONKEY" ASC)
ALLOW REVERSE SCANS ;
-- index[13], 28.802MB
CREATE INDEX "DB2ADMIN"."IDX609161424240000" ON "DB2ADMIN"."ORDERS" ("O_ORDERKEY" ASC, "O_CUSTKEY" ASC)
ALLOW REVERSE SCANS ;
-- index[14], 39.142MB
CREATE INDEX "DB2ADMIN"."IDX609161425360000" ON "DB2ADMIN"."ORDERS" ("O_ORDERKEY" ASC, "O_ORDERDATE" ASC,
"O_CUSTKEY" ASC) ALLOW REVERSE SCANS ;
-- index[15], 131.692MB
CREATE INDEX "DB2ADMIN"."IDX609161431410000" ON "DB2ADMIN"."LINEITEM" ("L_PARTKEY" ASC, "L_QUANTITY" ASC)
ALLOW REVERSE SCANS ;
-- index[16], 9.919MB
CREATE INDEX "DB2ADMIN"."IDX609161432260000" ON "DB2ADMIN"."PART" ("P_PARTKEY" ASC, "P_SIZE" ASC,
"P_BRAND" ASC, "P_CONTAINER" ASC) ALLOW REVERSE SCANS ;
Inconsistent database size 2 GB, p = 5%, n = 2
-- index[1], 103.747MB
CREATE INDEX "DB2ADMIN"."IDX609161434540000" ON "DB2ADMIN"."LINEITEM" ("L_DISCOUNT" ASC,
"L_SHIPDATE" ASC, "L_QUANTITY" ASC, "L_EXTENDEDPRICE" ASC) ALLOW REVERSE SCANS ;
-- index[2], 210.294MB
CREATE INDEX "DB2ADMIN"."IDX609161446330000" ON "DB2ADMIN"."LINEITEM" ("L_QUANTITY" ASC, "L_SHIPDATE" ASC,
"L_DISCOUNT" ASC, "L_EXTENDEDPRICE" ASC, "L_LINENUMBER" ASC, "L_ORDERKEY" ASC) ALLOW REVERSE SCANS ;
-- index[3], 396.028MB
CREATE INDEX "DB2ADMIN"."IDX609161447560000" ON "DB2ADMIN"."LINEITEM" ("L_PARTKEY" ASC, "L_SUPPKEY" ASC,
"L_ORDERKEY" ASC, "L_LINENUMBER" ASC) ALLOW REVERSE SCANS ;
-- index[4], 263.388MB
CREATE INDEX "DB2ADMIN"."IDX609161453480000" ON "DB2ADMIN"."LINEITEM" ("L_PARTKEY" ASC, "L_QUANTITY" ASC)
ALLOW REVERSE SCANS ;
Inconsistent database size 3 GB, p = 5%, n = 2
-- index[1], 12.341MB
CREATE INDEX "DB2ADMIN"."IDX609161458010000" ON "DB2ADMIN"."CUSTOMER" ("C_MKTSEGMENT" ASC,
"C_CUSTKEY" ASC) ALLOW REVERSE SCANS ;
-- index[2], 135.380MB
CREATE INDEX "DB2ADMIN"."IDX609161458080000" ON "DB2ADMIN"."LINEITEM" ("L_DISCOUNT" ASC, "L_SHIPDATE" ASC,
"L_QUANTITY" ASC, "L_EXTENDEDPRICE" ASC) ALLOW REVERSE SCANS ;
-- index[3], 8.646MB
CREATE INDEX "DB2ADMIN"."IDX609161459360000" ON "DB2ADMIN"."CUSTOMER" ("C_CUSTKEY" ASC,
"C_NATIONKEY" ASC) ALLOW REVERSE SCANS ;
-- index[4], 84.853MB
CREATE INDEX "DB2ADMIN"."IDX609161459280000" ON "DB2ADMIN"."ORDERS" ("O_CUSTKEY" ASC,
Appendix B. Design Advisor Indices 210
"O_ORDERKEY" ASC) ALLOW REVERSE SCANS ;
-- index[5], 0.583MB
CREATE INDEX "DB2ADMIN"."IDX609161458370000" ON "DB2ADMIN"."SUPPLIER" ("S_NATIONKEY" ASC,
"S_SUPPKEY" ASC) ALLOW REVERSE SCANS ;
-- index[6], 0.013MB
CREATE INDEX "DB2ADMIN"."IDX609161459480000" ON "DB2ADMIN"."REGION" ("R_NAME" ASC,
"R_REGIONKEY" ASC) ALLOW REVERSE SCANS ;
-- index[7], 0.013MB
CREATE INDEX "DB2ADMIN"."IDX609161459570000" ON "DB2ADMIN"."NATION" ("N_REGIONKEY" ASC,
"N_NATIONKEY" ASC) ALLOW REVERSE SCANS ;
-- index[8], 8.646MB
CREATE INDEX "DB2ADMIN"."IDX609161459580000" ON "DB2ADMIN"."CUSTOMER" ("C_NATIONKEY" ASC,
"C_CUSTKEY" ASC) ALLOW REVERSE SCANS ;
-- index[9], 32.243MB
CREATE INDEX "DB2ADMIN"."IDX609161501100000" ON "DB2ADMIN"."PART" ("P_TYPE" ASC,
"P_PARTKEY" ASC) ALLOW REVERSE SCANS ;
-- index[10], 84.853MB
CREATE INDEX "DB2ADMIN"."IDX609161501320000" ON "DB2ADMIN"."ORDERS" ("O_ORDERKEY" ASC,
"O_ORDERDATE" ASC) ALLOW REVERSE SCANS ;
-- index[11], 26.794MB
CREATE INDEX "DB2ADMIN"."IDX609161502520000" ON "DB2ADMIN"."PARTSUPP" ("PS_PARTKEY" ASC, "PS_SUPPKEY" ASC,
"PS_SUPPLYCOST" ASC) ALLOW REVERSE SCANS ;
-- index[12], 12.341MB
CREATE INDEX "DB2ADMIN"."IDX609161506500000" ON "DB2ADMIN"."CUSTOMER" ("C_CUSTKEY" ASC,
"C_MKTSEGMENT" ASC) ALLOW REVERSE SCANS ;
-- index[13], 84.853MB
CREATE INDEX "DB2ADMIN"."IDX609161508420000" ON "DB2ADMIN"."ORDERS" ("O_ORDERKEY" ASC,
"O_CUSTKEY" ASC) ALLOW REVERSE SCANS ;
-- index[14], 115.099MB
CREATE INDEX "DB2ADMIN"."IDX609161509520000" ON "DB2ADMIN"."ORDERS" ("O_ORDERDATE" ASC,
"O_ORDERKEY" ASC, "O_CUSTKEY" ASC) ALLOW REVERSE SCANS ;
-- index[15], 395.075MB
CREATE INDEX "DB2ADMIN"."IDX609161515590000" ON "DB2ADMIN"."LINEITEM" ("L_PARTKEY" ASC,
"L_QUANTITY" ASC) ALLOW REVERSE SCANS ;
-- index[16], 29.751MB
CREATE INDEX "DB2ADMIN"."IDX609161516440000" ON "DB2ADMIN"."PART" ("P_PARTKEY" ASC, "P_SIZE" ASC,
"P_BRAND" ASC, "P_CONTAINER" ASC) ALLOW REVERSE SCANS ;
Inconsistent database size 5 GB, p = 5%, n = 2
-- mqt[1], 1430.329MB
CREATE SUMMARY TABLE "DB2ADMIN"."MQT609161518000000" AS (SELECT Q4.C0 AS "C0", Q4.C1 AS "C1",
Q4.C2 AS "C2", Q4.C5 AS "C3", Q4.C4 AS "C4", Q4.C3 AS "C5", Q4.C6 AS "C6"
FROM TABLE(SELECT Q3.C0 AS "C0", SUM(Q3.C1) AS "C1", SUM(Q3.C2) AS "C2", Q3.C5 AS "C3", Q3.C4 AS "C4", Q3.C3 AS "C5", COUNT(* ) AS "C6" FROM TABLE(SELECT Q1.L_SHIPMODE AS "C0", CASE WHEN ((Q2.O_ORDERPRIORITY = ’1-URGENT ’) OR (Q2.O_ORDERPRIORITY = ’2-HIGH ’)) THEN 1 ELSE 0 END AS "C1", CASE WHEN ((Q2.O_ORDERPRIORITY <> ’1-URGENT ’) AND (Q2.O_ORDERPRIORITY <> ’2-HIGH ’)) THEN 1 ELSE 0 END AS "C2", Q1.L_RECEIPTDATE AS "C3", Q1.L_SHIPDATE AS "C4", Q1.L_COMMITDATE AS "C5" FROM DB2ADMIN.LINEITEM AS Q1, DB2ADMIN.ORDERS AS Q2 WHERE (Q2.O_ORDERKEY = Q1.L_ORDERKEY)) AS Q3 GROUP BY Q3.C3, Q3.C4, Q3.C5, Q3.C0) AS Q4) DATA INITIALLY DEFERRED REFRESH IMMEDIATE IN USERSPACE1 ;
-- index[1], 990.099MB
CREATE INDEX "DB2ADMIN"."IDX609161532510000" ON "DB2ADMIN"."LINEITEM" ("L_PARTKEY" ASC,
"L_SUPPKEY" ASC, "L_ORDERKEY" ASC, "L_LINENUMBER" ASC) ALLOW REVERSE SCANS ;
-- index[2], 49.587MB
CREATE INDEX "DB2ADMIN"."IDX609161539360000" ON "DB2ADMIN"."PART" ("P_PARTKEY" ASC, "P_SIZE" ASC,
"P_CONTAINER" ASC, "P_BRAND" ASC) ALLOW REVERSE SCANS ;
Appendix B. Design Advisor Indices 211
-- index[3], 164.485MB
CREATE INDEX "DB2ADMIN"."IDX609161539380000" ON "DB2ADMIN"."MQT609161518000000"
("C3" ASC, "C0" ASC, "C5" ASC, "C4" ASC) ALLOW REVERSE SCANS ;
Inconsistent database size 10 GB, p = 5%, n = 2
-- index[1], 41.126MB
CREATE INDEX "DB2ADMIN"."IDX609161543010000" ON "DB2ADMIN"."CUSTOMER" ("C_MKTSEGMENT" ASC,
"C_CUSTKEY" ASC) ALLOW REVERSE SCANS ;
-- index[2], 373.618MB
CREATE INDEX "DB2ADMIN"."IDX609161543080000" ON "DB2ADMIN"."LINEITEM" ("L_DISCOUNT" ASC,
"L_SHIPDATE" ASC, "L_QUANTITY" ASC, "L_EXTENDEDPRICE" ASC) ALLOW REVERSE SCANS ;
-- index[3], 1.923MB
CREATE INDEX "DB2ADMIN"."IDX609161543370000" ON "DB2ADMIN"."SUPPLIER" ("S_NATIONKEY" ASC,
"S_SUPPKEY" ASC) ALLOW REVERSE SCANS ;
-- index[4], 282.829MB
CREATE INDEX "DB2ADMIN"."IDX609161544310000" ON "DB2ADMIN"."ORDERS" ("O_ORDERKEY" ASC,
"O_CUSTKEY" ASC) ALLOW REVERSE SCANS ;
-- index[5], 28.802MB
CREATE INDEX "DB2ADMIN"."IDX609161544360000" ON "DB2ADMIN"."CUSTOMER" ("C_CUSTKEY" ASC,
"C_NATIONKEY" ASC) ALLOW REVERSE SCANS ;
-- index[6], 0.013MB
CREATE INDEX "DB2ADMIN"."IDX609161544480000" ON "DB2ADMIN"."REGION" ("R_NAME" ASC,
"R_REGIONKEY" ASC) ALLOW REVERSE SCANS ;
-- index[7], 0.013MB
CREATE INDEX "DB2ADMIN"."IDX609161544570000" ON "DB2ADMIN"."NATION" ("N_REGIONKEY" ASC,
"N_NATIONKEY" ASC) ALLOW REVERSE SCANS ;
-- index[8], 28.802MB
CREATE INDEX "DB2ADMIN"."IDX609161544580000" ON "DB2ADMIN"."CUSTOMER" ("C_NATIONKEY" ASC,
"C_CUSTKEY" ASC) ALLOW REVERSE SCANS ;
-- index[9], 107.474MB
CREATE INDEX "DB2ADMIN"."IDX609161546100000" ON "DB2ADMIN"."PART" ("P_TYPE" ASC,
"P_PARTKEY" ASC) ALLOW REVERSE SCANS ;
-- index[10], 230.290MB
CREATE INDEX "DB2ADMIN"."IDX609161547520000" ON "DB2ADMIN"."PARTSUPP" ("PS_PARTKEY" ASC, "PS_SUPPKEY" ASC,
"PS_SUPPLYCOST" ASC) ALLOW REVERSE SCANS ;
-- index[11], 282.829MB
CREATE INDEX "DB2ADMIN"."IDX609161546320000" ON "DB2ADMIN"."ORDERS" ("O_ORDERKEY" ASC,
"O_ORDERDATE" ASC) ALLOW REVERSE SCANS ;
-- index[12], 1565.091MB
CREATE INDEX "DB2ADMIN"."IDX609161550570000" ON "DB2ADMIN"."LINEITEM" ("L_ORDERKEY" ASC,
"L_LINENUMBER" ASC, "L_SHIPDATE" ASC) ALLOW REVERSE SCANS ;
-- index[13], 41.126MB
CREATE INDEX "DB2ADMIN"."IDX609161551500000" ON "DB2ADMIN"."CUSTOMER" ("C_CUSTKEY" ASC,
"C_MKTSEGMENT" ASC) ALLOW REVERSE SCANS ;
-- index[14], 282.829MB
CREATE INDEX "DB2ADMIN"."IDX609161552080000" ON "DB2ADMIN"."ORDERS" ("O_ORDERDATE" ASC,
"O_ORDERKEY" DESC) ALLOW REVERSE SCANS ;
-- index[15], 1.923MB
CREATE INDEX "DB2ADMIN"."IDX609161554170000" ON "DB2ADMIN"."SUPPLIER" ("S_SUPPKEY" ASC,
Appendix B. Design Advisor Indices 212
"S_NATIONKEY" ASC) ALLOW REVERSE SCANS ;
-- index[16], 383.646MB
CREATE INDEX "DB2ADMIN"."IDX609161554520000" ON "DB2ADMIN"."ORDERS" ("O_ORDERDATE" ASC,
"O_ORDERKEY" ASC, "O_CUSTKEY" ASC) ALLOW REVERSE SCANS ;
-- index[17], 99.169MB
CREATE INDEX "DB2ADMIN"."IDX609161601440000" ON "DB2ADMIN"."PART" ("P_PARTKEY" ASC,
"P_SIZE" ASC, "P_BRAND" ASC, "P_CONTAINER" ASC) ALLOW REVERSE SCANS ;
Inconsistent database size 20 GB, p = 5%, n = 2
-- index[1], 990.228MB
CREATE INDEX "DB2ADMIN"."IDX609161604350000" ON "DB2ADMIN"."ORDERS" ("O_ORDERDATE" ASC,
"O_SHIPPRIORITY" ASC, "O_ORDERKEY" ASC, "O_CUSTKEY" ASC) ALLOW REVERSE SCANS ;
-- index[2], 82.243MB
CREATE INDEX "DB2ADMIN"."IDX609161605050000" ON "DB2ADMIN"."CUSTOMER" ("C_MKTSEGMENT" ASC,
"C_CUSTKEY" ASC) ALLOW REVERSE SCANS ;
-- index[3], 575.935MB
CREATE INDEX "DB2ADMIN"."IDX609161606040000" ON "DB2ADMIN"."ORDERS" ("O_CUSTKEY" ASC,
"O_ORDERKEY" ASC) ALLOW REVERSE SCANS ;
-- index[4], 3.841MB
CREATE INDEX "DB2ADMIN"."IDX609161605130000" ON "DB2ADMIN"."SUPPLIER" ("S_NATIONKEY" ASC,
"S_SUPPKEY" ASC) ALLOW REVERSE SCANS ;
-- index[5], 57.599MB
CREATE INDEX "DB2ADMIN"."IDX609161606090000" ON "DB2ADMIN"."CUSTOMER" ("C_NATIONKEY" ASC,
"C_CUSTKEY" ASC) ALLOW REVERSE SCANS ;
-- index[6], 214.939MB
CREATE INDEX "DB2ADMIN"."IDX609161607460000" ON "DB2ADMIN"."PART" ("P_TYPE" ASC,
"P_PARTKEY" ASC) ALLOW REVERSE SCANS ;
-- index[7], 0.013MB
CREATE INDEX "DB2ADMIN"."IDX609161606240000" ON "DB2ADMIN"."REGION" ("R_NAME" ASC,
"R_REGIONKEY" ASC) ALLOW REVERSE SCANS ;
-- index[8], 0.013MB
CREATE INDEX "DB2ADMIN"."IDX609161606330000" ON "DB2ADMIN"."NATION" ("N_REGIONKEY" ASC,
"N_NATIONKEY" ASC) ALLOW REVERSE SCANS ;
-- index[9], 178.575MB
CREATE INDEX "DB2ADMIN"."IDX609161609310000" ON "DB2ADMIN"."PARTSUPP" ("PS_PARTKEY" ASC,
"PS_SUPPKEY" ASC, "PS_SUPPLYCOST" ASC) ALLOW REVERSE SCANS ;
-- index[10], 575.935MB
CREATE INDEX "DB2ADMIN"."IDX609161608110000" ON "DB2ADMIN"."ORDERS" ("O_ORDERKEY" ASC,
"O_ORDERDATE" ASC) ALLOW REVERSE SCANS ;
-- index[11], 782.731MB
CREATE INDEX "DB2ADMIN"."IDX609161610160000" ON "DB2ADMIN"."ORDERS" ("O_ORDERDATE" ASC,
"O_ORDERKEY" ASC, "O_CUSTKEY" ASC) ALLOW REVERSE SCANS ;
-- index[12], 919.810MB
CREATE INDEX "DB2ADMIN"."IDX609161614260000" ON "DB2ADMIN"."LINEITEM" ("L_PARTKEY" ASC,
"L_SHIPINSTRUCT" ASC, "L_SHIPMODE" ASC, "L_QUANTITY" ASC) ALLOW REVERSE SCANS ;
-- index[13], 82.243MB
CREATE INDEX "DB2ADMIN"."IDX609161615250000" ON "DB2ADMIN"."CUSTOMER" ("C_CUSTKEY" ASC,
"C_MKTSEGMENT" ASC) ALLOW REVERSE SCANS ;
-- index[14], 1462.771MB
Appendix B. Design Advisor Indices 213
CREATE INDEX "DB2ADMIN"."IDX609161615450000" ON "DB2ADMIN"."LINEITEM" ("L_LINENUMBER" ASC,
"L_RECEIPTDATE" ASC, "L_COMMITDATE" ASC, "L_ORDERKEY" ASC) ALLOW REVERSE SCANS ;
-- index[15], 575.935MB
CREATE INDEX "DB2ADMIN"."IDX609161615430000" ON "DB2ADMIN"."ORDERS" ("O_ORDERDATE" ASC,
"O_ORDERKEY" DESC) ALLOW REVERSE SCANS ;
-- index[16], 57.599MB
CREATE INDEX "DB2ADMIN"."IDX609161617140000" ON "DB2ADMIN"."CUSTOMER" ("C_CUSTKEY" ASC,
"C_NATIONKEY" ASC) ALLOW REVERSE SCANS ;
-- index[17], 3.841MB
CREATE INDEX "DB2ADMIN"."IDX609161617540000" ON "DB2ADMIN"."SUPPLIER" ("S_SUPPKEY" ASC,
"S_NATIONKEY" ASC) ALLOW REVERSE SCANS ;
-- index[18], 0.013MB
CREATE INDEX "DB2ADMIN"."IDX609161622150000" ON "DB2ADMIN"."NATION" ("N_NATIONKEY" ASC,
"N_NAME" ASC) ALLOW REVERSE SCANS ;
-- index[19], 198.333MB
CREATE INDEX "DB2ADMIN"."IDX609161625390000" ON "DB2ADMIN"."PART" ("P_PARTKEY" ASC,
"P_SIZE" ASC, "P_BRAND" ASC, "P_CONTAINER" ASC) ALLOW REVERSE SCANS ;