Conference PaperPDF Available

Consistent Query Answers in Inconsistent Databases

Authors:

Abstract

In this paper we consider the problem of the logical characterization of the notion of consistent answer in a relational database that may violate given integrity constraints. This notion is captured in terms of the possible repaired versions of the database. A method for computing consistent answers is given and its soundness and completeness (for some classes of constraints and queries) proved. The method is based on an iterative procedure whose termination for several classes of constraints is proved as well.
Consistent Query Answers in Inconsistent Databases
Marcelo Arenas
Pontificia Universidad Cat´olica de Chile
Escuela de Ingenier´ıa
Departamento de Ciencia de Computaci´on
Casilla 306, Santiago 22, Chile
marenas@ing.puc.cl
Leopoldo Bertossi
Pontificia Universidad Cat´olica de Chile
Escuela de Ingenier´ıa
Departamento de Ciencia de Computaci´on
Casilla 306, Santiago 22, Chile
bertossi@ing.puc.cl
Jan Chomicki
Monmouth University
Department of Computer Science
West Long Branch, NJ 07764
chomicki@monmouth.edu
Abstract
In this paper we consider the problem of the logical char-
acterization of the notion of consistent answer in a relational
database that may violate given integrity constraints. This
notion is captured in terms of the possible repairedversions
of the database. A method for computing consistent an-
swers is given and its soundness and completeness (for some
classes of constraints and queries) proved. The method is
based on an iterative procedure whose termination for sev-
eral classes of constraints is provedas well.
Integrity constraints capture an important normative aspect
of every database application. However, it is often the case
that their satisfaction cannot be guaranteed, allowing for the
existence of inconsistent database instances. In that case,
it is important to know which query answers are consistent
with the integrity constraints and which are not. In this pa-
per, we provide a logical characterization of consistent query
answers in relational databases that may be inconsistent with
the given integrity constraints. Intuitively, an answer to a
query posed to a database that violates the integrity con-
straints will be consistent in a precise sense: It should be the
same as the answer obtained from any minimally repaired
version of the original database. We also provide a method
for computing such answers and proveits properties. On the
basis of a query Q, the method computes, using an iterative
procedure, a new query TωQwhose evaluation in an arbi-
trary, consistent or inconsistent, database returns the set of
consistent answers to the original query Q. We envision the
application of our results in a number of areas:
Data warehousing. A data warehouse contains data com-
ing from many different sources. Some of it typically does
not satisfy the given integrity constraints. The usual ap-
proach is thus to clean the data by removing inconsistencies
before the data is stored in the warehouse [6]. Our results
make it possible to determine which data is already clean
and proceed to safely remove unclean data. Moreover,a dif-
ferent scenario becomes possible, in which the inconsisten-
cies are not removed but ratherquery answers are marked as
“consistent” or “inconsistent”. In this way, information loss
due to data cleaning may be prevented.
Database integration. Often many different databases
are integrated together to provide a single unified view for
the users. Database integration is difficult since it requires
the resolution of many different kinds of discrepancies of
the integrated databases. One possible discrepancy is due
to different sets of integrity constraints. Moreover, even if
every integrated database locally satisfies the same integrity
constraint, the constraint may be globally violated. For ex-
ample, different databases may assign different addresses to
the same student. Such conflicts may fail to be resolved at all
and inconsistent data cannot be “cleaned” because of the au-
tonomy of different databases. Therefore, it is important to
be able to find out, givena set of local integrity constraints,
which query answers returned from the integrated database
are consistent with the constraints and which are not.
Active and reactive databases. A violation of integrity
constraints may be acceptable under the provision that it will
be repaired in the near future. For example, the stock level in
a warehouse may be allowed to fall below the required min-
imum if the necessary replenishments have been ordered.
During this temporary inconsistency, however, query answers
should give an indication whether they are consistent with
the constraints or not. This problem is particularly acute in
active databases that allow such consistency lapses. The re-
sult of evaluating a trigger condition that is consistent with
the integrity constraints should be treated differently from
the one that isn’t.
The following example presents the basic intuitions be-
hind the notion of consistent query answer.
Example 1. Consider a database subject to the following IC:
xPx Qx
The instance
P a P b Q a Q c
violates this constraint. Now if the query asks forall xsuch
that Qx, only ais returned as an answer consistent with the
integrity constraint.
The plan of this paper is as follows. In section 2 we in-
troduce the basic notions of our approach, including those of
repair and consistent query answer. In section 3 we show a
method how to compute the query TωQfor a given first-
order query Q. In subsequent sections, the propertiesof this
method are analyzed: soundness in section 4, completeness
in section 5, and termination in section 6. In section 7 we
discuss related work. In section 8 we conclude and outline
some of the prospects for futurework in this area. The proofs
are given in the appendix.
In this paper we assume we have a fixed database schema
and a fixed infinite database domain D. We also have a first
order language based on this schema with names for the ele-
ments of D. We assume that elements of the domainwith dif-
ferent names are different. The instances of the schema are
finite structures for interpreting the first order language. As
such they all share the given domain D, nevertheless, since
relations are finite, every instance has a finite active domain
which is a subset of D. As usual, we allow built-in predi-
cates that have infinite extensions, identical for all database
instances. There is also a set of integrity constraints IC,ex-
pressed in that language, which the database instances are
expected to satisfy. We will assume that IC is consistent in
the sense that there is a database instance that makes it true.
Definition 1. (Consistency) A database instance ris consis-
tent if rsatisfies IC in the standard model-theoretic sense,
that is, r IC;ris inconsistent otherwise.
This paper addresses the issue of obtaining meaningful
and useful query answers in any, consistent or inconsistent,
database. It is well known how to obtain query answers in
consistent databases. Therefore, the challengingpart is how
to deal with the inconsistent ones.
Given a database instance r, we denote by Σrthe set of
formulas P¯a r P ¯a, where the Ps are relation names
and ¯ais ground tuple.
Definition 2. (Distance) The distance Δr r between data-
base instances rand ris the symmetric difference:
Δr r ΣrΣrΣrΣr
Definition 3. For the instances r r r ,rrrif Δr r
Δr r , i.e., if the distance between rand ris less than or
equal to the distance between rand r.
Notice that built-in predicates do not contribute to the
Δs because they have fixed extensions, identical in every
database instance.
Definition 4. (Repair) Given database instances rand r,we
say that ris a repair of rif r IC and ris r-minimal in
the class of database instances that satisfy the ICs.
Clearly, what constitutes a repair depends on the given
set of integrity constraints. In the following we assume that
this set is fixed.
Example 2. Let us consider a database schema with two
unary relations Pand Qand domain D a b c . Assume
that for an instance r,Σr P a P b Q a Q c , and
let IC x P x Q x . Clearly, rdoes not satisfy IC
because r Pb Qb.
In this case we have two possibles repairs for r. First,
we can falsify Pb, obtaining an instance rwith Σr
P a Q a Q c . As a second alternative, we can make
Q b true,obtaining an instancerwith Σr P a P b
Q a Q b Q c .
The definition of a repair satisfies certain desirable and
expected properties. Firstly, a consistent database does not
need to be repaired, because if rsatisfies IC, then, by the
minimality condition wrt the relation r,ris the only repair
of itself (since Δr r is empty). Secondly, any database r
can always be repaired because there is a database rthat
satisfies IC, and Δr r is finite.
Example 3. (motivated by [19]) Consider the IC sayingthat
Cis the only supplier of items of class T4:
x y z Supply x y z Class z T4x C (1)
The following database instance r1violates the IC:
Supply Class
CD
1I1I1T4
DD
2I2I2T4
The only repairs of this database are
Supply Class
CD
1I1I1T4
I2T4
and
Supply Class
CD
1I1I1T4
DD
2I2
Example 4. (motivated by [19]) Consider the IC:
x y Supply x y I1Supply x y I2(2)
saying that item I2is supplied whenever item I1is supplied;
and the following inconsistent instance, r2, of the database
Supply
CD
1I1
CD
1I3
This instance has two repairs:
Supply
CD
1I1
CD
1I2
CD
1I3
and
Supply
CD
1I3
Example 5. Consider a student database. Student x y z
means that xis the student number, yis the student’s name,
and zis the student’s address. The two following ICs state
that the first argument is a key of the relation
x y z u v Student x y z Student x u v y u
x y z u v Student x y z Student x u v z v
The inconsistent database instance r3
Student Course
S1N1D1S1C1G1
S1N2D1S1C2G2
has two repairs:
Student Course
S1N1D1S1C1G1
S1C2G2
and
Student Course
S1N2D1S1C1G1
S1C2G2
We assume all queries are in prefix disjunctive normal form.
Definition 5. A formula Qis a query if it has the following
syntactical form:
¯
Qs
i1
mi
j1Pi j ¯ui j
ni
j1Qi j ¯vi j ψi
where ¯
Qis a sequence of quantifiers and every ψicontains
only built-in predicates. If ¯
Qcontains only universal quanti-
fiers, then we say that Qis a universal query.If ¯
Qcontains
existential (and possibly universal) quantifiers, we say that
Qis non-universal query.
Definition 6. (Query answer) A (ground) tuple ¯
tis an an-
swer to a query Q¯xin a database instance rif r Q ¯
t.A
(ground)tuple ¯
tis an answer to a set of queries Q1Qn
if r Q1Qn.
Definition 7. (Consistent answer) Given a set of integrity
constraints, we say that a (ground) tuple ¯
tis a consistent
answer to a query Q¯xin a database instance r, and we
write rcQ¯
t(or rcQ¯x¯
t), if for every repair rof
r,r Q ¯
t.IfQis a sentence, then true (false)isaconsis-
tent answer to Qin r, and we write rcQ(rcQ), if for
every repair rof r,r Q (r Q).
Example 6. (example 3 continued) The only consistent an-
swer to the query Class z T4, posed to the database instance
r1,isI1because r1cClass z T4I1.
Example 7. (example 4 continued) The only consistent an-
swer to the query Supply C D1z, posed to the database in-
stance r2,isI3because r2cSupply C D1z I3.
Example 8. (example 5 continued) By considering all the re-
pairs of the database instance r3, we obtain C1and C2as the
consistent answers to the query zCourse S1y z , posed to
r3. For the query u v Student u N1v Course u x y ,
we obtain no (consistent) answers.
We present here a method to compute consistent answers to
queries. Given a query Q, the query TωQis defined based
on the notion of residue developedin the context of seman-
tic query optimization (SQO) [5]. In the context of deductive
databases, SQO is used to optimize the process of answering
queries using the semantic knowledge about the domain that
is contained in the ICs. In this case, the basic assumption is
that the ICs are satisfied by the database. In our case, since
we allow inconsistent databases, we do not assume the sat-
isfaction of the ICs while answering queries. A first attempt
to obtain consistent answers to a query Q¯xmay be to use
query modification, i.e., ask the query Q¯x IC. However,
this does not work, as we obtain false as the answer if the
DB is inconsistent. Instead, we iteratively modify the query
Qusing the residues. As a result, we obtain the query TωQ
with the property that the set of all answers to TωQis the
same as as the set of consistent answers to Q. (As shown
later, the property holds only for restricted classes of queries
and constraints.)
We consider only universal constraints. We begin by trans-
forming every integrity constraint to the standard format (ex-
pansion step).
Definition 8. An integrity constraint is in standard format if
it has the form
m
i1Pi¯xi
n
i1Qi¯yiψ
where represents the universal closure of the formula, ¯xi,
¯yiare tuples of variables and ψis a formula that mentions
only built–in predicates, in particular, equality.
Notice that in such an IC there are no constants in the
PiQi; if they are needed they can be pushed into ψ.
Many usual ICs that appear in DBs can be transformedto
the standard format, e.g. functional dependencies, set inclu-
sion dependencies of the form ¯x P ¯x Q ¯x, transitiv-
ity constraints of the form xyzPxy Pyz Pxz .
The usual ICs that appear in SQO in deductive databases
as rules [5] can be also accommodated in this format, in-
cluding rules with disjunction and logical negation in their
heads. An inclusion dependency of the form ¯x P ¯x
yQ ¯x y cannot be transformed to the standard format.
After the expansion of IC,rules associated with the database
schema are generated. This could be seen as considering
an instance of the database as an extensional database ex-
panded with new rules, and so obtaining an associated de-
ductive database where semantical query optimization can
be used.
For each predicate, its negative and positive occurrences
in the ICs (in standard format) will be treated separatelywith
the purpose of generating corresponding residues and rules.
First, a motivatingexample.
Example 9. Consider the IC x P x Q x .IfQ x is
false, then P x must be true. Then, when asking about
Qx, we make sure that P x becomes true. That is,
we generate the query Q x P x where P x is the
residue attached to the query.
For each IC in standard format
m
i1Pi¯xi
n
i1Qi¯yiψ(3)
and each positive occurrence of a predicate Pj¯xjin it, the
following residue for Pj¯xjis generated
¯
Qj1
i1Pi¯xi
m
i j 1Pi¯xi
n
i1Qi¯yiψ(4)
where ¯
Qis a sequence of universal quantifiers over all the
variables in the formula not appearing in ¯xj.
If R1Rrare all the residues for Pj, then the follow-
ing rule is generated:
Pj¯w Pj¯w R1¯w Rr¯w
where ¯ware new variables. If there are no residues for Pj,
then the rule Pj¯w Pj¯wis generated.
For each negativeoccurrence of a predicate Qj¯yjin (3),
the following residue for Qj¯yjis generated
¯
Qm
i1Pi¯xi
j1
i1Qi¯yi
n
i j 1Qi¯yiψ
where ¯
Qis a sequence of universal quantifiers over all the
variables in the formula not appearing in ¯yj.
If R1Rsare all the residues for Qj¯yj, the following
rule is generated:
Qj¯u Qj¯u R1¯u Rs¯u
If there are no residues for Qj¯yj, then the rule Qj¯u
Qj¯uis generated. Notice that there is exactly one new rule
for each positive predicate, and exactly one rule for each
negative predicate.
If there are more than one positive (negative)occurrences
of a predicate, say P, in an IC, then more then one residue
is computed for P. In some cases, e.g., for functional de-
pendencies, the subsequent residues will be redundant. In
other cases cases, e.g., for transitivity constraints, multiple
residues are not redundant.
Example 10. If we have the following ICs in standard for-
mat
IC x R x P x Q x x P x Q x
the following rules are generated:
P x P x R x Q x
Q x Q x R x P x P x
R x R x
P x P x Q x
Q x Q x
R x R x P x Q x
Notice that no rules are generated for built-in predicates,
but such predicates may appear in the residues. They have
fixed extensions and thus cannot contribute to the violation
of an IC or be modified to make an IC true. For example, if
we have the IC xyz Pxy Pxz y z, and the
database satisfies P1 2 P1 3 , the IC cannot be made
true by making 2 3.
Once the rules have been generated, it is possible to sim-
plify the associated residues. In every new rule of the form
P¯u P ¯u R1¯u Rr¯uthe auxiliary quantifica-
tions introduced in the expansion step are eliminated (both
the quantifier and the associated variable in the formula)
from the residues by the process inverse to the one applied
in the expansion. The same is done with rules of the form
P P .
TωQ
In order to determine consistent answers to queries in arbi-
trary databases, we will make use of a family of operators
consisting of Tn,n0, and Tω.
Definition 9. The application of an operator Tnto a query is
defined inductively by means of the following rules
1. Tn:,T
n: , for every n0( is the
empty clause).
2. T0ϕ:ϕ.
3. For each predicate P¯u, if there is a rule P¯u
P¯u R1¯u Rr¯u, then
Tn1P¯u:P¯ur
i1TnRi¯u
If P¯udoes not have residues, then Tn1P¯u:
P¯u.
4. For each negated predicate Q¯v, if there is a rule
Q¯v Q ¯v R1¯v Rs¯v, then
Tn1Q¯v:Q¯vs
i1TnRi¯v
If Q¯vdoes not haveany residues, then Tn1Q¯u:
Q¯u.
5. If ϕis a formula in prenex disjunctive normal form,
that is,
ϕ¯
Qs
i1
mi
j1Pi j ¯ui j
ni
j1Qi j ¯vi j ψi
where ¯
Qis a sequence ofquantifiers and ψiis a formula
that includes only built–in predicates, then for every
n0:
Tnϕ:¯
Qs
i1
mi
j1TnPi j ¯ui j
ni
j1TnQi j ¯vi j ψi
Definition 10. The application of operator Tωon a query is
defined as Tωϕ
nω
Tnϕ.
Example 11. (example 10 continued) For the query R x
we have T1R x R x P x Q x ,T
2R x
R x P x Q x Q x andfinally T3R x
T2R x . We have reached a fixed point and then
TωR x R x R x P x Q x
R x P x Q x Q x
We show first that the operator Tωconservativelyextends
standard query evaluation on consistent databases.
Proposition 1. Given a database instance rand a set of in-
tegrity constraints IC, such that rIC, then for every query
Q¯xand every natural number n:r¯x Q ¯xTnQ¯x.
Corollary 1. Given a database instance rand a set of in-
tegrity constraints IC, such that r IC, then for every query
Q¯xand every tuple ¯
t:r Q ¯
tif and only if rTωQ¯
t.
Now we will show the relationship between consistent an-
swers to a query Qin a database instance r(definition 7) and
answers to the query TωQ(definition 6). We show that
TωQreturns only consistent answers to Q.
Theorem 1. (Soundness) Let rbe a database instance, IC a
set of integrity constraints and Q¯xa query (see definition 5)
such that rTωQ¯x¯
t.IfQis universal or non-universal
and domain independent[20], then ¯
tis a consistent answer to
Qin r(in the sense of definition 7), that is, rcQ¯
t.
Thesecond condition inthe theoremexcludes non-universal,
but domain dependent queries like x Px.
Example 12. (example 6 continued) The IC (1) transformed
into the standard format becomes
x y z w Supply x y z
Class z w w T4x C
The following rule is generated:
Class z w Class z w
x y Supply x y z w T4x C
Given the database instance r1that violates the IC as before,
if we pose the query Class z T4, asking for the items of
class T4, directly to r1, we obtain I1and I2. Nevertheless, if
we pose the query TωClass z T4, that is
Class z T4
Class z T4x y Supply x y z x C
we obtain only I1, eliminating I2.I1is the only consistent
answer.
Example 13. (example 8 continued) In the standard format,
the ICs take the form
x y z u v Student x y z
Student x u v y u
x y z u v Student x y z
Student x u v z v
The following rule is generated
Student x y z Student x y z
u v Student x u v y u
u v Student x u v z v
Given the inconsistent database instance r3, if we pose the
query zCourse S1y z , asking for the names of the courses
of the student with number S1, we obtain C1and C2.Ifwe
pose the query
TωzCourse S1y z zCourse S1y z
we obviously obtain the same answers which, in this case,
are the consistent answers. Intuitively, in this case the Tω
operator helps us to establish that even whenthe name of the
student with number S1is undetermined, it is still possible
to obtain the list of courses in which he/she is registered. On
the other hand, if we pose the query
u v Student u N1v Course u x y
about the courses and grades for a student with name N1,to
r3, we obtain C1G1and C2G2. Nevertheless, if we ask
Tωu v Student u N1v Course u x y
we obtain, in conjunction with the original query, the for-
mula:
u v Student u N1v
y z Student u y z y N1
y z Student u y z z v Course u x y
from this we obtain the empty set of tuples. This answer
is intuitively consistent, because the number of the student
with name N1is uncertain, and in consequence it is not pos-
sible to find out in which courses he/she is registered. The
set of answers obtained with the Tωoperator coincides with
the set of consistent answers which is empty.
Definition 11. Abinary integrity constraint (BIC) is a sen-
tence of the form
l1¯x1l2¯x2ψ¯x
where l1and l2are literals, and ψis a formula that only
contains built-in predicates.
Examples of BICs include: functional dependencies,sym-
metry constraints, set inclusions dependencies of the form
¯x P ¯x Q ¯x.
Definition 12. Given a set of sentences Σin the language
of the database schema DB, and a sentence ϕ, we denote by
ΣDB ϕthe fact that, for every instance rof the database, if
rΣ, then rϕ.
Theorem 2. (Completeness for BICs) Given a set IC of bi-
nary integrity constraints, if for every literal l¯a,IC DB
l¯a, then the operator Tωis complete, that is, for every
ground literal l¯
t,ifrcl¯
tthen rTωl¯
t.
The theorem says that everyconsistent answer to a query
of the form l¯xis captured by the Tωoperator. Actually,
proposition 2 in the appendix and the completeness theorem
can be easily extended to the case of queries that are con-
junctions of literals. Notice that the finiteness Tωl¯xis
not a part of the hypothesis in this theorem. The hypoth-
esis of the theorem requires that the ICs are not enough to
answer a literal query by themselves; they do not contain
definite knowledge about the literals.
Example 14. We can see in the example 12 where BICs and
queries which are conjunctions of literals appear, that the
operator Tωgave us all the consistent answers, as implied
by the theorem.
Corollary 2. If IC is a set of functional dependencies (FDs)
IC P1¯x1y1P1¯x1z1y1z1(5)
Pn¯xnynPn¯xnznynzn
then the operator Tωis complete for consistent answers to
queries that are conjunctions of literals.
Example 15. In example 13 we had FDs that are also BICs.
Thus the operator Tωfound all the consistent answers, even
for some queries that are not conjunctions of literals, show-
ing that this is not a necessary condition.
Example 16. Here we will show that in general complete-
ness is not obtained for queries that are not conjunctions of
literals. Consider the IC: xyzPxy Pxz y z
andthe inconsistent instancerwith Σr P a b P a c .
This database has two repairs: rwith Σr P a b ; and
rwith Σr P a c . We have that rcxPax, be-
cause the query is true in the two repairs.
Now, it is easy to see that TωuP a u is logically equiv-
alent to u P a u z P a z z u . So, we have r
TωxP a x . Thus, the consistent answer true is not cap-
tured by the operator Tω.
The following theorem applies to arbitrary ICs and general-
izes Theorem 2.
Theorem 3. (Completeness) Let IC be a set of integrity con-
straints, l¯xa literal, and Tnl¯xof the form
l¯xm
i1¯xi¯yiCi¯x¯xiψi¯x¯yi
If for every n0, there is S1msuch that
1. for every j S and every tuple ¯a:IC DB Cj¯a, and
2. ¯xi¯yiCi¯x¯xiψi¯x¯yii S implies
¯xi¯yiCi¯x¯xiψi¯x¯yi1i m
then rcl¯
timplies rTωl¯
t.
This theorem can be extended to conjunctionsof literals.
Notice that the theorem requires a condition for every n.
Its application is obviously simplified if we know that the
iteration terminates. This is an issue to be analyzed in the
next section.
Termination means that the operator Tωreturns a finite set
of formulas. It is clearly important because then the set of
consistent answers can be computed by evaluating a single,
finite query. We distinguish between three different notions
of termination.
Definition 13. Given a set of ICs and a query Q¯x, we say
that TωQ¯xis
1. syntacticallyfinite if there is an an nsuch that TnQ¯x
and Tn1Q¯xare syntactically the same.
2. semantically finite if there is an nsuch that for all m
n, ¯xTnQ¯xTmQ¯xis valid.
3. semantically finite in an instance r, if there is an nsuch
that for all m n,r¯xTnQ¯xTmQ¯x.
The number nin cases 2 and 3 is called a point of finite-
ness. It is clear that 1 implies 2 and 2 implies 3. In the full
version we will show that all these implications are proper.
In all these cases, evaluating TωQ¯xgives the same result
as evaluating TnQ¯xfor some n(in the instance rin case
3). If TωQ¯xis semantically finite, sound and complete,
then the set of consistent answers to Qis first-order defin-
able.
The notion of syntactical finiteness is important because then
for some nand all m n,T
mQ¯xwill be exactly the same.
In consequence, TωQwill be a finite set of formulas. In
addition, a point of finiteness ncan be detected (if it exists)
by syntactically comparing every two consecutive steps in
the iteration. No simplification rules need to be considered,
because the iterative procedure is fully deterministic.
Here we introduce a necessary and sufficient condition
for syntactical finiteness.
Definition 14. A set of integrity constraints IC is acyclic if
there exists a function ffrom predicate names plus negations
of predicate names in the database to the natural numbers,
that is, f:p1pnp1pn, such that for
everyintegrity constraint k
i1li¯xiψ¯x IC as in (3),
and every iand j(1 i j k), if i j, then f lif lj.
(Here liis the literal complementary to li.)
Example 17. The set of ICs
IC x P x Q x S x
x y Q x S y T x y
is acyclic, because the function fdefined by
f P 2f Q 2f P 0f Q 0
f S 1f T 0f S 1f T 2, sat-
isfies the condition of definition 14.
Example 18. The set of ICs
IC x P x Q x S x
x y Q x S y T x y
is not acyclic, because for any function fthat we may at-
tempt to use to satisfy the condition in definition 14, from
the first integrity constraint we obtain f Q f S , andfrom
the second, we would obtain f S f Q ; a contradiction.
Theorem 4. A set of integrity constraints IC is acyclic iff
for every literal name lin the database schema, Tωl¯xis
syntactically finite.
The theorem can be extended to any class of queries sat-
isfying Definition 5.
Example 19. The set of integrity constraints in example 18
is not acyclic. In that case TωQ x is infinite.
Example 20. The ICs in example 17 are acyclic. There we
have
TωP u
P u
P u Q u S u
P u Q u S u v Q v T v u
TωQ u
Q u
Q u P u S u v S v T u v
Q u P u S u w Q w T w u
v S v P v Q v T u v
TωS u S u S u Q v T v u
TωT u v T u v
TωP u P u
TωQ u Q u
TωS u S u S u P u Q u
TωT u v
T u v
T u v Q u S v
T u v Q u S v P v Q v
Corollary 3. For functional dependencies and a query Q¯x,
TωQ¯xis always syntactically finite.
Definition 15. A constraint Cin clausal form is uniform if
for every literal l¯xin it, the set of variables in l¯xis the
same as the set of variables in C l ¯x. A set of constraints
is uniform if all the constraints in it are uniform.
Examples of uniform constraints include set inclusion
dependencies of the form ¯x P ¯x Q ¯x, e.g., Example
4.
Theorem 5. If a set of integrity constraints IC is uniform,
thenfor everyliteral name lin the database schema, Tωl¯x
is semantically finite. Furthermore, a point of finiteness n
can be bounded from above by a function of the number of
variables in the query, and the number of predicates (and
their arities) in the query and IC.
Theorem 6. Let lbe a literal name. If for some n,
¯xTnl¯x Tn1l¯x
is valid, then for all m n,
¯xTnl¯xTml¯x
is valid.
According to Theorem 6, we can detect a point of finite-
ness by comparing every two consecutive steps wrt logical
implication. Although this is undecidable in general, we
might try to apply semidecision procedures, for example,
automated theorem proving. We have successfully made use
of OTTER [17] in some cases that involve sets of constraints
that are neither acyclic nor uniform. Examples include mul-
tivalued dependencies, and functionaldependencies together
with set inclusion dependencies. For multivalued dependen-
cies, Theorem 6 together with Theorem 3 gives complete-
ness of Tωl¯xwhere l¯xis a negative literal. The cri-
terion from Theorem 6 is also applicable to uniform con-
straints by providing potentially faster termination detection
than the proof of Theorem 5.
Theorem 7. If Q¯xis a domain independent query, then
for every database instance rthere is an n, such that for all
m n,r¯x TnQ¯x TmQ¯x.
Notice that this theorem does not include the case of neg-
ative literals, as in the case of theorem 5.
Bry [4] was, to our knowledge, the first author to consider
the notion of consistent query answer in inconsistent data-
bases. He defined consistent query answers based on prov-
ability in minimal logic, without giving, however, a proof
procedure or any other computationalmechanism for obtain-
ing such answers. He didn’t address the issues of of seman-
tics, soundness or completeness.
It has been widely recognized that in database integra-
tion the integrated data may be inconsistentwith the integrity
constraints. A typical (theoretical) solution is to augment the
data model to represent disjunctive information. The follow-
ing example explains the need for a solution of this kind.
Example 21. Consider the functional dependency
x y z P x y P x z y z
If the integrated database contains both Pab and Pac,
then the functional dependency is violated. Each of Pab
and Pac may be coming from a different database that
satisfies the dependency. Thus, both facts are replaced by
their disjunction Pab Pac in the integrated database.
Now the functional dependencyis no longer violated.
To solve this kind of problems [1] introducedthe notion
of flexible relation, a non-1NF relation that contains tuples
with sets of non-key values (with such a set standing for one
of its elements). This approach is limited to primary key
functional dependencies and was subsequently generalized
to other key functional dependencies [9]. In the same con-
text, [3, 12] proposed to use disjunctive Datalog and [16]
tables with OR-objects. [1] introduced flexible relational al-
gebra to query flexible relations, and [9] - flexible relational
calculus (whose subset can be translated to flexible relational
algebra). The remaining papers did not discuss query lan-
guage issues, relying on the existing approaches to query
disjunctive Datalog or tables with OR-objects. There are
several important differences between the above approaches
and ours. First, they rely on the construction of a single (dis-
junctive) instance and the deletion of conflicting tuples. In
our approach, the underlying databases are incorporated into
the integrated one in toto, without any changes. There is no
need for introducing disjunctive information. It would be
interesting to compare the scope and the computational re-
quirements of both approaches. For instance, one should
note that the single-instance approach is not incremental:
Any changes in the underlying databases require the recom-
putation of the entire instance. Second, our approach seems
to be unique, in the context of database integration, in con-
sidering tuple insertions as possible repairs for integrity vi-
olations. Therefore, in some cases consistent query answers
may be different from query answers obtained from the cor-
responding single instance.
Example 22. Consider the integrity constraint p q and a
fact p. The instance consisting of palone does not satisfy
the integrity constraint. The common solution for remov-
ing this violation is to delete p. However, in our approach
inserting qis also a possible repair. This has consequences
for the inferences about pand q. Our approach returns
false in both cases, as p(resp. q) is true in a possible repair.
Other approaches return true (under CWA) or undefined (un-
der OWA).
Our work has connections with research done on belief
revision [10]. In our case, we have an implicit notion of re-
vision that is determined by the set of repairs of the database,
and corresponds to revising the database (or a suitable cat-
egorical theory describing it) by the set of integrity con-
straints. Thus, querying the inconsistent database expect-
ing only correct answers corresponds to querying the revised
theory without restrictions.
It is easy to see that our notion of repair of a relational
database is a particular case of the local semantics intro-
duced in [8], restricted to revision performed starting from
a single model (the database). From this we obtain that our
revision operator satisfies the postulates (R1) – (R5),(R7),
(R8) in [13]. For each given database r, the relation rin-
troduced in definition 3 provides the partial order between
models that determines the (models of the) revised database
as described in [13]. [8] concentrates on the computation
of the models of the revised theory, i.e. the repairs in our
case, whereas we do not compute the repairs,but keep query-
ing the original, non-revised database and pose a modified
query. Therefore, we can view our methodology as a way
of representing and querying simultaneously all the repairs
of the database by means of a new query. Nevertheless, our
motivation and starting point is quite different from belief
revision. We attempt to take direct advantage of the seman-
tic information contained in the integrity constraints in order
to answer queries, rather than revising the database. Revis-
ing the database means repairing all the inconsistencies in it,
instead we are interested in the information related to par-
ticular queries. For instance, a query referring only to the
consistent portion of the database can be answered without
repairing the database.
Reasoning in the presence of inconsistency has been an
important research problem in the area of knowledgerepre-
sentation. The goal is to design logical formalisms that limit
what can be inferred from an inconsistent set of formulas.
One does not want to infer all formulas (as required by the
classical two-valued logic). Also, one prefers not to infer a
formula together with its negation. The formalisms satisfy-
ing the above properties, e.g., [15], are usually propositional.
Moreover, they do not distinguish between integrity con-
straints and database facts. Thus, if the data in the database
violates an integrity constraint, the constraint itself can no
longer be inferred (which is not acceptable in the database
context).
Example 23. Assume the integrity constraint is p q
and the database contains the facts pand q. In the approach
of [15], p q can be inferred (minimal change is captured
correctly) but p,qand p q can no longer be inferred
(they are all involved in an inconsistency).
Because of the above-mentioned limitations, such methods
are not directly applicable to the problem of computing con-
sistent query answers.
Deontic logic [18, 14], a modal logic with operators cap-
turing permission and obligation, has been used for the spec-
ification of integrity constraints. [14] used the obligation op-
erator Oto distinguish integrity constraints that have to hold
always from database facts that just happen to hold. [18]
used deontic operators to describe policies whose violations
can then be caught and handled. The issues of possible re-
pairs of constraint violations, their minimality and consistent
query answers are not addressed.
Gertz [11] described techniques and algorithms for com-
puting repairs of constraint violations. The issue of query
answering in the presence of aninconsistency is not addressed
in his work.
This paper represents a first step in the development of a
new research area dealing with the theory and applications
of consistent query answers in arbitrary, consistent or incon-
sistent, databases.
The theoretical results presented here arepreliminary. We
have proveda general soundness result but the results about
completeness and termination are still partial. Also, one
needs to look beyond purely universal constraints to include
general inclusion dependencies. In a forthcoming paper we
will also describe our methodology for using automated the-
orem proving, in particular, OTTER, for proving termina-
tion.
It appears that in order to obtain completeness for dis-
junctive and existentially quantified queries oneneeds to move
beyond the Tωoperator on queries. Also, the upper bounds
on the size of Tωand the lower bounds on the complexity of
computing consistent answers for different classes of queries
and constraints need to be studied. In [2] it is shown that in
the propositional case, SAT is reducible in polynomialtime
to the problem of deciding if an arbitrary formula evaluated
in the propositional database does not give true as a correct
answer, that is it becomes false in some repair. From this it
follows that this problem is NP-complete.
There is an interesting connection to modal logic. Con-
sider the definition 7. We could write rQ¯
t, meaning
that Q¯
tis true in all repairs of r, the database instances
that are “accessible” from r. This is even more evident from
example 16, where, in essence, it is shown that xQ ¯xis
not logically equivalent to x Q ¯x, which is what usually
happens in modal logic.
This research has been partially supported by FONDECYT
Grants (1971304 & 1980945)and NSF Grant (IRI-9632870).
Part of this research was done when the second author was
on sabbatical at the Technical University of Berlin (CIS Group)
with the financial support from DAAD and DIPUC.
[1] S. Agarwal, A.M. Keller, G. Wiederhold, and
K. Saraswat. Flexible Relation: An Approach for
Integrating Data from Multiple, Possibly Inconsistent
Databases. In IEEE International Conference on Data
Engineering, 1995.
[2] M. Arenas, L. Bertossi, and M. Kifer. APC and Query-
ing Inconsistent Databases. In preparation.
[3] C. Baral, S. Kraus, J. Minker, and V.S. Subrahma-
nian. Combining Knowledge Bases Consisting of First-
Order Theories. Computational Intelligence, 8:45–71,
1992.
[4] F. Bry. Query Answering in Information Systems with
Integrity Constraints. In IFIP WG 11.5 Working Con-
ference on Integrity and Control in Information Sys-
tems. Chapman &Hall, 1997.
[5] U.S. Chakravarthy, J. Grant, and J. Minker. Logic-
Based Approach to Semantic Query Optimization.
ACM Transactions on Database Systems, 15(2):162–
207, 1990.
[6] S. Chaudhuri and U. Dayal. An Overview of
Data Warehousing and OLAP Technology. SIGMOD
Record, 26, March 1997.
[7] J. Chomicki and G. Saake, editors. Logics for
Databases and Information Systems. Kluwer Aca-
demic Publishers, Boston, 1998.
[8] T. Chou and M. Winslett. A Model-Based Belief Re-
vision System. J. Automated Reasoning, 12:157–208,
1994.
[9] Phan Minh Dung. Integrating Data from Possibly In-
consistent Databases. In International Conference on
Cooperative Information Systems, Brussels, Belgium,
1996.
[10] P. Gaerdenforsand H. Rott. Belief Revision. In D. M.
Gabbay, J. Hogger, C, and J. A. Robinson, editors,
Handbook of Logic in Artificial Intelligence and Logic
Programming, volume 4, pages 35–132. Oxford Uni-
versity Press, 1995.
[11] M. Gertz. Diagnosis and Repair of Constraint Vio-
lations in Database Systems. PhD thesis, Universit¨at
Hannover, 1996.
[12] P. Godfrey, J. Grant, J. Gryz, and J. Minker. Integrity
Constraints: Semantics and Applications. In Chomicki
and Saake [7], chapter 9.
[13] H. Katsuno and A. Mendelzon. Propositional Knowl-
edge Base Revision and Minimal Change. Artificial
Intelligence, 52:263–294, 1991.
[14] K.L. Kwast. A Deontic Approach to Database In-
tegrity. Annals of Mathematics and Artificial Intelli-
gence, 9:205–238, 1993.
[15] J. Lin. A Semantics for Reasoning Consistently in the
Presence of Inconsistency. Artificial Intelligence, 86(1-
2):75–95, 1996.
[16] J. Lin and A. O. Mendelzon. Merging Databases un-
der Constraints. International Journal of Cooperative
Information Systems, 7(1):55–76, 1996.
[17] W.W. McCune. OTTER 3.0 Reference Manual and
Guide. Argonne National Laboratory, Technical Re-
port ANL-94/6, 1994.
[18] J.-J. Meyer, R. Wieringa, and F. Dignum. The Role
of Deontic Logic in the Specification of Information
Systems. In Chomicki and Saake [7], chapter 4.
[19] Jean-Marie Nicolas. Logic for Improving Integrity
Checking in Relational Data Bases. Acta Informatica,
18:227–253, 1982.
[20] J. Ullman. Principles of Database and Knowledge-
Base Systems, Vol. I. Computer Science Press, 1988.
Some technical lemmas are stated without proof. Full proofs
can be found in the file in
.
Lemma 1. If rTωl¯a, where l¯ais a ground literal,
then for every repair rof r, it holds r l ¯a.
Lemma 2. If rTωn
i1li¯ai, where li¯aiis a ground
literal, then for every repair rof r, it holds rn
i1li¯ai.
Lemma 3. If rTωn
i1Ci¯ai, with Ci¯aia conjunction
of literals, then for every repair rof r,rn
i1Ci¯ai.
Lemma 4. Let Q¯xa universal query. If rTωQ¯
t, for
a ground tuple ¯
t, then for every repair rof r,r Q ¯
t.
Lemma 5. Let Q¯xa domain independent query. If r
TωQ¯
t, for a ground tuple ¯
t, then for every repair rof r,
r Q ¯
t.
Proof of Theorem 1: Lemmas 4 and 5.
Proposition 2. Given a set IC of integrity constraints, a
ground clause m
i1li¯
ti,ifIC DB m
i1li¯
tiand, for every
repair rof r,rm
i1li¯
ti, then rm
i1li¯
ti.
Proof of Proposition 2: Assume that rm
i1li¯
ti.By
hypothesis IC DB m
i1li¯
ti, thus there exists an instance
of the database rsuch that r IC m
i1li¯
ti. Let us
consider the set of database instances
R r r IC and Δr r Δr r
We know that Δr r is finite, therefore there exists r0R
such that Δr r0is minimal. Then, r0is a repair of r.
For every 1 i m,ifli¯
tiis p¯
tor p¯
t, then p¯
t
Δr r . Using this fact we conclude that p¯
tΔr r0,
Therefore, rm
i1li¯
tiif and only if r0m
i1li¯
ti. But
we assumed that rm
i1li¯
ti, then r0m
i1li¯
ti;a
contradiction.
Proof of Theorem 2: From theorem 3.
Proof of Corollary 2: In this case it holds:
1. For every tuple ¯a,IC DB Pi¯a, because the empty
database instance (which has only empty base rela-
tions) satisfies IC, but not P¯a.
2. For every tuple ¯a,IC DB Pi¯a, since the database
instance ri¯a, where the relation Picontains only the tu-
ple ¯aand the other relations are empty, satisfies IC,but
not Pi¯a.
Proof of Theorem 3: Suppose that rcl¯
t. Let ra repair
of r, we have that r l ¯
t. By proposition 1 we have that
r Tnl¯
t, that is
rl¯
tm
i1
mi
j1li j ¯
t¯xi j ψi¯
t¯xi(6)
We want to prove that for every iand for every sequence
of ground tuples ai,ai1, , ai mi:
rmi
j1li j ¯
t¯ai j ψi¯
t¯ai(7)
To do this, first we are going to prove that forevery i S
and for every sequence of ground tuples ai,ai1, , ai mi:
rmi
j1li j ¯
t¯ai j ψi¯
t¯ai(8)
This is immediately obtained when rψi¯
t¯ai. As-
sume that rψi¯
t¯ai. We know that ψionly mentions
built-in predicates, thus for every repair rof rwe have that
rψi¯
t¯ai. Therefore, by (6) we conclude that for every
repair rof r:
rmi
j1li j ¯
t¯ai j ψi¯
t¯ai
By proposition 2 we conclude(8). Thus we have that
r l ¯
ti S
mi
j1li j ¯
t¯xi j ψi¯
t¯xi
but by the second condition in the hypothesis of the theorem
we conclude that:
r l ¯
tm
i1
mi
j1li j ¯
t¯xi j ψi¯
t¯xi
Proof of Theorem 4: Suppose that IC is acyclic, then
there exists fas in the definition 14. We are going to prove
by induction on kthat for every literal name l,iff l k,
then Tk1l¯x Tk2l¯x
(I) If k0. We know that that for every literal name l,
f l 0. Therefore, everyintegrityconstraint containing l
is of the form l¯xψ¯y, where ψonly mentions built-
in predicates. This is because if there were any other literal
lin the integrity constraint, we would have f l f l 0.
Then T1l¯x T2l¯x.
(II) Suppose that the property is true for every m k.We
know that Tk2l¯xis of the form:
l¯xm
i1
¯
Qi
mi
j1Tk1li j ¯xi j ψi¯xi
where ¯
Qiis a sequence of quantifiers over all the variables
¯xi1,xi mixinot appearing in ¯x, and Tk1l¯xis of the
form:
l¯xm
i1
¯
Qi
mi
j1Tkli j ¯xi j ψi¯xi
By definition of f, we know that for every literal name li j
in the previous formulas, f li j k. Then by induction
hypothesis Tkl¯xi j Tk1li j ¯xi j (since if Tml¯x
Tm1l¯x, then for every n m,Tnl¯x Tn1l¯x).
( ) Suppose that for every literal name l,T
ωl¯xis fi-
nite. The for every literal name lthere exists a first natu-
ral number ksuch that Tkl¯xTk1l¯x. Let us de-
fine a function f, from the literal names into the natural
number, by f l k (kas before). We can show that this
is a well defined function that behaves as in definition 14:
since if m
i1li¯xiψ¯y IC, then for every 1 s m,
Tf lsls¯xsis of the form
ls¯xs¯
Qs1
i1Tf ls1li¯xi
m
is1Tf ls1li¯xiψ¯yθ¯xs(9)
where ¯
Qis a sequence of quantifiers over all the variables
¯x1,xmy, not appearing in ¯xs, and Tf ls1ls¯xsis
of the form
ls¯xs¯
Qs1
i1Tf lsli¯xi
m
is1Tf lsli¯xiψ¯yθ¯xs(10)
By definition of f,Tf lsls¯xsTf ls1ls¯xs. Then,
by the form of (9) and (10), we conclude that for every i s,
Tf ls1li¯xiTf lsli¯xi, and then, again by defini-
tion of f,f lif ls.
Proof of Corollary 3: The following stratification function
from literals to can be defined: f Pi0 and f Pj1,
where PiPjare relation names.
Proof of Theorem 5: For uniform constraints the residues
do not contain quantifiers. Therefore Tnl¯xfor every n
0 is quantifier-free and contains only the variables that occur
in ¯x. There are only finitely many inequivalent formulas with
this property,and thus Tωl¯xis finite.
Lemma 6. If Tnl¯xis of the form:
l¯xm
i1¯xi¯yiCi¯x¯xiψi¯x¯yi
then Tn1l¯xis of the form:
l¯xm
i1¯xi¯yiT1Ci¯x¯xiψi¯x¯yi
Lemma 7. If for a ground tuple ¯a,Tnl¯ak
j1lj¯a¯zj,
then Tn1l¯ak
j1T1lj¯a¯zj.
Proof of Theorem 6: Suppose that for a natural number n,
¯xTnl¯xTn1l¯xis a valid sentence. We are going
to prove that for every m n, ¯xTml¯xTm1l¯xis
a valid sentence, by induction on m.
(I) If m n, by hypothesis.
(II) Suppose that ¯xTml¯xTm1l¯xis a valid sen-
tence. For every clause k
j1lj¯x¯zjψ¯x¯zin Tm1l¯x
and for every ground tuple ¯awe have that
Tml¯ak
j1lj¯a¯zjψ¯a¯z
By lemma 7 and considering that ψonly mentions built-in
predicates we have that Tm1l¯ak
j1T1lj¯a¯zj
ψ¯a¯z, and from this and lemma 6 we can conclude that
¯xTm1l¯xTm2l¯xis a validsentence.
Proof of Theorem 7: Let Q¯xbe a domain independent
queryand ra databaseinstance. DefineAn¯
t r TnQ¯
t.
We know that for everyn:An1An, therefore A Aii
ωis a family of subsets of A0. But A0is finite because Q¯x
is a domain independent query. Thus, there exists a minimal
element Amin A. For this element, it holds that for every
k m:AmAk, since AkAm.
... Other approaches, instead, prefer to restore consistency via corrective actions after an illegal update-typically a rollback-but several works compute a repair that changes, adds, or deletes tuples of the database in order to satisfy the integrity constraints again. The generation of repairs is a nontrivial issue; see, e.g., [40][41][42][43][44] for surveys on the topic. ...
... An orthogonal research avenue is that of allowing inconsistencies to occur in databases but to filter the data during query processing so as to provide a consistent query answer [40], i.e., the set of tuples that answer a query in all possible repairs (without, of course, actually having to compute all the repairs). This, however, has repercussions on the complexity of query answering. ...
Article
Full-text available
Data integrity is crucial for ensuring data correctness and quality and is maintained through integrity constraints that must be continuously checked, especially in data-intensive systems like OLTP. While DBMSs handle very simple cases of constraints (such as primary key and foreign key constraints) well, more complex constraints often require ad hoc solutions. Research since the 1980s has focused on automatic and simplified integrity constraint checking, leveraging the assumption that databases are consistent before updates. This paper presents program transformation operators to generate simplified integrity constraints, focusing on complex constraints expressed in denial form. In particular, we target a class of integrity constraints, called extended denials, which are more general than tuple-generating dependencies and equality-generating dependencies. One of the main contributions of this study consists in the automatic treatment of such a general class of constraints, encompassing the all the most useful and common cases of constraints adopted in practice. Another contribution is the applicability of the proposed technique with a “preventive” approach; unlike all other methods for integrity maintenance, we check whether an update will violate the constraints before executing it, so we never have to undo any work, with potentially huge savings in terms of execution overhead. These techniques can be readily applied to standard database practices and can be directly translated into SQL.
... A related yet distinct domain, with its primary focus on addressing inconsistent information, involves repairing knowledge bases and consistent query answering (CQA) (Arenas, Bertossi, and Chomicki 1999a;Chomicki 2007;Leopoldo and Bertossi 2011;Bienvenu and Rosati 2013). The goal there is to identify and repair inconsistencies in the data and obtain a consistent knowledge base Copyright © 2025, Association for the Advancement of Artificial Intelligence (www.aaai.org). ...
Article
The connection between inconsistent databases and Dung’s abstract argumentation framework has recently drawn growing interest. Specifically, an inconsistent database, involving certain types of integrity constraints such as functional and inclusion dependencies, can be viewed as an argumentation framework in Dung’s setting. Nevertheless, no prior work has explored the exact expressive power of Dung’s theory of argumentation when compared to inconsistent databases and integrity constraints. In this paper, we close this gap by arguing that an argumentation framework can also be viewed as an inconsistent database. We first establish a connection between subset-repairs for databases and extensions for AFs considering conflict-free, naive, admissible, and preferred semantics. Further, we define a new family of attribute-based repairs based on the principle of maximal content preservation. The effectiveness of these repairs is then highlighted by connecting them to stable, semi-stable, and stage semantics. Our main contributions include translating an argumentation framework into a database together with integrity constraints. Moreover, this translation can be achieved in polynomial time, which is essential in transferring complexity results between the two formalisms.
... In other words, the reasoning is happening in a repair of the knowledge graph. The notion of repair-based reasoning has been initially introduced by Arenas et al. (1999) for query answering over relational databases (actually, they called their approach consistent query answering). This idea was later adapted by Lembo et al. (2010) for ontological reasoning, leading to the development of the so-called AR semantics for various DLs where AR stands for ABox repair. ...
Preprint
Full-text available
In Knowledge Graphs (KGs), where the schema of the data is usually defined by particular ontologies, reasoning is a necessity to perform a range of tasks, such as retrieval of information, question answering, and the derivation of new knowledge. However, information to populate KGs is often extracted (semi-) automatically from natural language resources, or by integrating datasets that follow different semantic schemas, resulting in KG inconsistency. This, however, hinders the process of reasoning. In this survey, we focus on how to perform reasoning on inconsistent KGs, by analyzing the state of the art towards three complementary directions: a) the detection of the parts of the KG that cause the inconsistency, b) the fixing of an inconsistent KG to render it consistent, and c) the inconsistency-tolerant reasoning. We discuss existing work from a range of relevant fields focusing on how, and in which cases they are related to the above directions. We also highlight persisting challenges and future directions.
Article
This paper studies cost-effective graph cleaning with a single machine. We adopt a rule-based method that may embed machine learning models as predicates in the rules. Graph cleaning with the rules involves rule discovery, error detection and correction. These tasks are both computation-heavy and I/O-intensive as they repeatedly invoke costly graph pattern matching, and produce a large amount of a large volume of intermediate results, among other things. In light of these, no existing single-machine system is able to carry out these tasks even on not-too-large graphs, even using GPUs. Thus we develop MiniClean, a single-machine system for cleaning large graphs. It proposes (1) a workflow that better fits a single machine by pipelining CPU, GPU and I/O operations; (2) memory footprint reduction with bundled processing and data compression; and (3) a multi-mode parallel model for SIMD, pipelined and independent parallelism, and their scheduling to maximize CPU--GPU synergy. Using real-life graphs, we empirically verify that MiniClean outperforms the SOTA single-machine systems by at least 65.34× and multi-machine systems with 32 nodes by at least 8.09×.
Article
This paper studies incremental rule discovery. Given a dataset D, rule discovery is to mine the set of the rules on D such that their supports and confidences are above thresholds 𝜎 and 𝛅 , respectively. We formulate incremental problems in response to updates Δ𝜎 and/or Δ𝛅, to compute rules added and/or removed with respect to 𝜎 + Δ𝜎 and 𝛅 + Δ𝛅. The need for studying the problems is evident since practitioners often want to adjust their support and confidence thresholds during discovery. The objective is to minimize unnecessary recomputation during the adjustments, not to restart the costly discovery process from scratch. As a testbed, we consider entity enhancing rules, which subsume popular data quality rules as special cases. We develop three incremental algorithms, in response to Δ𝜎 , Δ𝜎 and both. We show that relative to a batch discovery algorithm, these algorithms are bounded, i.e., they incur the minimum cost among all incrementalizations of the batch one, and parallelly scalable, i.e., they guarantee to reduce runtime when given more processors. Using real-life data, we empirically verify that the incremental algorithms outperform the batch counterpart by up to 658× when Δ𝜎 and Δ𝜎 are either positive or negative.
Article
We embark on a study of the consistent answers of queries over databases annotated with values from a naturally ordered positive semiring. In this setting, the consistent answers of a query are defined as the minimum of the semiring values that the query takes over all repairs of an inconsistent database. The main focus is on self-join free conjunctive queries and key constraints, which is the most extensively studied case of consistent query answering over standard databases. We introduce a variant of first-order logic with a limited form of negation, define suitable semiring semantics, and then establish the main result of the paper: the consistent query answers of a self-join free conjunctive query under key constraints are rewritable in this logic if and only if the attack graph of the query contains no cycles. This result generalizes an analogous result of Koutris and Wijsen for ordinary databases, but also yields new results for a multitude of semirings, including the bag semiring, the tropical semiring, and the fuzzy semiring. Further, for the bag semiring, we show that computing the consistent answers of any self-join free conjunctive query whose attack graph has a strong cycle is not only NP-hard but also it is NP-hard to even approximate the consistent answers with a constant relative approximation guarantee.
Preprint
Full-text available
We present an approach to computing consistent answers to analytic queries in data warehouses operating under a star schema and possibly containing missing values and inconsistent data. Our approach is based on earlier work concerning consistent query answering for standard, non-analytic queries in multi-table databases. In that work we presented polynomial algorithms for computing either the exact consistent answer to a standard, non analytic query or bounds of the exact answer, depending on whether the query involves a selection condition or not. We extend this approach to computing exact consistent answers of analytic queries over star schemas, provided that the selection condition in the query involves no keys and satisfies the property of independency (i.e., the condition can be expressed as a conjunction of conditions each involving a single attribute). The main contributions of this paper are: (a) a polynomial algorithm for computing the exact consistent answer to a usual projection-selection-join query over a star schema under the above restrictions on the selection condition, and (b) showing that, under the same restrictions the exact consistent answer to an analytic query over a star schema can be computed in time polynomial in the size of the data warehouse.
Article
Denial constraints (DCs) are well-known to express business rules on data. They subsume other integrity constraints (ICs), such as key constraints or functional dependencies. One can use traditional DBMS or specialized algorithms to validate such dependencies on a dataset. However, no known approach exists to detect DC violations incrementally. Data typically changes over time, and recomputing the entire violation set after every update is wasteful. Alerting data practitioners of data quality issues immediately, enables them to take measures earlier and can help prevent follow-up issues. We present Weever, the first incremental approach to detect all violations of a given set of DCs. It uses a novel index structure to process inequality predicates and a new method to plan the execution order of predicates depending on their selectivity, reducing redundant computations when handling multiple DCs. Our evaluation shows that Weever outperforms a DBMS-based baseline by up to two orders of magnitude. And in the same time that a state-of-the-art static approach takes to analyze an entire dataset, Weever processes up to 200 000 insertions.
Article
Given a dataset with incomplete data (e.g., missing values), training a machine learning model over the incomplete data requires two steps. First, it requires a data-effective step that cleans the data in order to improve the data quality (and the model quality on the cleaned data). Second, it requires a data-efficient step that selects a core subset of the data (called coreset) such that the trained models on the entire data and the coreset have similar model quality, in order to save the computational cost of training. The first-data-effective-then-data-efficient methods are too costly, because they are expensive to clean the whole data; while the first-data-efficient-then-data-effective methods have low model quality, because they cannot select high-quality coreset for incomplete data. In this paper, we investigate the problem of coreset selection over incomplete data for data-effective and data-efficient machine learning. The essential challenge is how to model the incomplete data for selecting high-quality coreset. To this end, we propose the GoodCore framework towards selecting a good coreset over incomplete data with low cost. To model the unknown complete data, we utilize the combinations of possible repairs as possible worlds of the incomplete data. Based on possible worlds, GoodCore selects an expected optimal coreset through gradient approximation without training ML models. We formally define the expected optimal coreset selection problem, prove its NP-hardness, and propose a greedy algorithm with an approximation ratio. To make GoodCore more efficient, we propose optimization methods that incorporate human-in-the-loop imputation or automatic imputation method into our framework. Moreover, a group-based strategy is utilized to further accelerate the coreset selection with incomplete data given large datasets. Experimental results show the effectiveness and efficiency of our framework with low cost.
Chapter
Full-text available
This is the penultimate version of a review article. Contents: 1 Introduction - 1.1 The problem of belief revision: An example - 1.2 The methodological problems of belief revision - 1.3 Belief revision in science - 1.4 Different kinds of belief change - 1.5 Two approaches to describing belief revisions - 1.6 Related areas -- 2 Representing belief states - 2.1 Preliminaries - 2.2 Belief sets - 2.3 Belief bases - 2.4 Justifications vs. coherence models - 2.5 Possible-worlds models -- 3 Rationality postulates for belief revisions - 3.1 The AGM postulates for revision - 3.2 The AGM postulates for contraction - 3.3 From contractions to revisions and vice versa - 3.4 Postulates for contraction -- 4 Constructive models and representation theorems - 4.1 Partial meet contraction - 4.2 Epistemic entrenchment - 4.3 Safe contraction - 4.4 Minimal changes of models - 4.5 Ordinal conditional functions -- 5 Base contractions and revisions - 5.1 Full meet contraction - 5.2 Partial meet contraction - 5.3 Maxichoice contraction - 5.4 Safe contraction - 5.5 Epistemic entrenchment for bases - 5.6 Base revisions - 5.7 Computational complexity -- 6 Connections with non-monotonic logic - 6.1 The basic idea - 6.2 Translating postulates for belief revision into non-monotonic logic - 6.3 Translating conditions on non-monotonic inference - 6.4 Comparing models of belief revision and models of non-monotonic logic -- 7 Truth maintenance systems - 7.1 Justification-based truth maintenance systems - 7.2 Other kinds of truth maintenance systems
Book
Full-text available
Chapter
When an “updating” operation occurs on the current state of a data base, one has to ensure the new state obeys the integrity constraints. So, some of them have to be evaluated on this new state. The evaluation of an integrity constraint can be time consuming, but one can improve such an evaluation by taking advantage from the fact that the integrity constraint is satisfied in the current state. Indeed, it is then possible to derive a simplified form of this integrity constraint which is sufficient to evaluate in the new state in order to determine whether the initial constraint is still satisfied in this new state. The purpose of this paper is to present a simplification method yielding such simplified forms for integrity constraints. These simplified forms depend on the nature of the updating operation which is the cause of the state change. The operations of inserting, deleting, updating a tuple in a relation as well as transactions of such operations are considered. The proposed method is based on syntactical criteria and is validated through first order logic. Examples are treated and some aspects of the method application are discussed.
Article
Numerous belief revision and update semantics have been proposed in the literature in the past few years, but until recently, no work in the belief revision literature has focussed on the problem of implementing these semantics, and little attention has been paid to algorithmic questions. In this paper, we present and analyze our update algorithms built in Immortal, a model-based belief revision system. These algorithms can work for a variety of model-based belief revision semantics proposed to date. We also extend previously proposed semantics to handle updates involving the equality predicate and function symbols and incorporate these extensions in our algorithms. As an example, we discuss the use of belief revision semantics to model the action-augmented envisioning problem in qualitative simulation, and we show the experimental results of running an example simulation in Immortal.
Article
The logical theory of database integrity is developed as an application of deontic logic. After a brief introduction to database theory and Kripke semantics, a deontic operator X, denoting what should hold, non-trivially, given a set of constraints, is defined and axiomatized. The theory is applied to updates, to dynamic constraints and to databases extended with nulls.
Article
A great deal of research has been devoted to nontrivial reasoning in the presence of inconsistency. However, previous formalisms on this account do not permit consistent reasoning in the presence of inconsistency—they may conclude a statement on one hand and the negation of the statement on the other. In this paper, we propose a logic that allows an agent to reason consistently, even though there are inconsistencies in the agent's beliefs. We first give the semantics of the logic and then present a simple, sound and complete axiomatization for the logic, thus forming the formal basis of reasoning consistently in the presence of inconsistency.
Article
The semantics of revising knowledge bases represented by sets of propositional sentences is analyzed from a model-theoretic point of view. A characterization of all revision schemes that satisfy the Gärdenfors rationality postulates is given in terms of minimal change with respect to an ordering among interpretations. Revision methods proposed by various authors are surveyed and analyzed in this framework. The correspondences between Gärdenfors-like rationality postulates and minimal changes with respect to other orderings are also investigated.
Conference Paper
Integrity constraints are introduced in a logical framework. Examples are given to illustrate the expressiveness of integrity constraints. Various definitions for the semantics of integrity constraints are defined and compared. Additional types of constraints are also mentioned. Techniques of reasoning with integrity constraints, including model elimination and the residue method, are explained. Applications of integrity constraints considered in detail, including semantic query optimization, cooperative answering, combining databases, and view updates. Additional applications to order optimization, query folding, object-oriented databases, and database security are sketched. The conclusion lists areas of integrity constraints that need to be investigated.