Available via license: CC BY 4.0
Content may be subject to copyright.
arXiv:2412.09125v1 [cs.AI] 12 Dec 2024
Goal-Driven Query Answering over First- and Second-Order
Dependencies with Equality
Efthymia Tsamouraa∗, Boris Motikb
aSamsung AI, Cambridge, United Kingdom
bDepartment of Computer Science, University of Oxford, Oxford, United Kingdom
Abstract
Dependencies are logical statements that formally describe a domain of interest. In databases, dependencies are used to ex-
press integrity constraints, deal with data incompleteness, or specify transformations between database schemas. In knowledge
representation and reasoning, dependencies are sometimes called existential rules, and they are used to describe background
knowledge about an application domain. Dependencies are usually expressed in fragments of first-order logic, but second-order
dependencies have received considerable attention as well: they support existentially quantified function variables, which are
necessary to express composition of certain classes of schema mappings [39, 7] and can generally simplify domain modelling.
When second-order dependencies are combined with equality reasoning (i.e., the ability to derive equality statements), the re-
sulting language can express composition of very expressive classes of schema mappings [70, 6].
Query answering over data with dependencies plays a central role in most applications of dependencies. The problem is
commonly solved by using a suitable variant of the chase algorithm to compute a universal model of the dependencies and the
data and thus explicate all knowledge implicit in the dependencies. After this preprocessing step, an arbitrary conjunctive query
over the dependencies and the data can be answered by evaluating it the computed universal model. If, however, the query to be
answered is fixed and known in advance, computing the universal model is often inefficient as many inferences made during this
process can be irrelevant to a given query. In such cases, a goal-driven approach, which avoids drawing unnecessary inferences,
promises to be more efficient and thus preferable in practice.
In this paper we present what we believe to be the first technique for goal-driven query answering over first- and second-
order dependencies with equality reasoning. Our technique transforms the input dependencies so that applying the chase to the
output avoids many inferences that are irrelevant to the query. The transformation proceeds in several steps, which comprise the
following three novel techniques. First, we present a variant of the singularisation technique by Marnette [60] that is applicable
to second-order dependencies and that corrects an incompleteness of a related formulation by ten Cate et al. [74]. Second, we
present a relevance analysis technique that can eliminate from the input dependencies that provably do not contribute to query
answers. Third, we present a variant of the magic sets algorithm [19] that can handle second-order dependencies with equality
reasoning. We also present the results of an extensive empirical evaluation, which show that goal-driven query answering can be
orders of magnitude faster than computing the full universal model.
1. Introduction
The need to describe a domain of interest using formal statements naturally arises in many areas of databases and knowledge
representation. Such descriptions are usually formulated in a suitable fragment of first- or second-order logic, and are, depending
on one’s perspective and background, called dependencies [20], ∀∃-rules [12], existential rules [69], or Datalog±[28]. For the
sake of consistency, we use the term ‘dependency’ throughout this paper, but our results apply equally in all of these contexts.
1.1. Background: Dependencies
Dependencies are used in databases for many different purposes. They are often used to formulate integrity constraints—
statements that describe valid database states [38, 32]. They can also be used to complete an incomplete database with missing
facts and thus provide richer answers to users’ queries [45]. Finally, theyhave been used extensively in declarative data integration
[56, 48, 55] and to specify mappings between database schemas [37]—that is, how to transform any database expressed in one
schema to a database in another schema. In knowledge representation, it was shown that ontology languages such as EL++ [10]
and certain languages of the DL-Lite family [31, 9] can be expressed as dependencies [28].
∗Corresponding author
Preprint submitted to Elsevier December 13, 2024
Dependencies were initially expressed using ad hoc languages [38, 32], but Nicolas [71] observed that most dependency
classes can be expressed in first-order logic. Beeri and Vardi [20] generalised this idea and proposed tuple and equality generating
dependencies (TGDs and EGDs), arguably the most widely used dependency classes in the literature. Intuitively, TGDs specify
that the presence of certain tuples in the database implies existence of other tuples, and EGDs specify that the presence of certain
tuples implies uniqueness of certain database values. TGDs and EGDs are fragments of first-order logic with equality, which was
found sufficient for a large number applications of dependencies. In certain cases, however, the expressive power of first-order
logic is insufficient. Fagin et al. [39] initially observed this in the context of schema mappings using the following example.
Example 1 ([39]).Consider a mapping scenario where the source schema S1consists of a relation Takes associating student
names with courses. TGD (1) specifies a mapping that copies relation Takes into relation Takes1of schema S2, and TGD (2)
specifies a mapping where relation Takes is restructured into a relation Student of schema S2that associates a student name with
a student ID. Finally, TGD (3) specified a mapping from S2into a schema S3, where relation Enrollment is analogous to relation
Takes, but it associates student IDs (rather than names) with courses.
∀n∀c.[Takes(n,c)→Takes1(n,c)] (1)
∀n∀c.[Takes(n,c)→ ∃s.Student(n,s)] (2)
∀n∀s∀c.[Student(n,s)∧Takes1(n,c)→Enrollment(s,c)] (3)
A source database over schema S1can be mapped into a database over schema S3by first applying TGDs (1) and (2)
to produce a database over schema S2, and then applying TGD (3) to produce the target database. However, for the sake of
efficiency, it is desirable to produce the target database directly from the source database. One might intuitively expect TGD (4)
to express such a transformation, but Fagin et al. [39] pointed out that this is not the case.
∀n∀c.[Takes(n,c)→ ∃s.Enrollment(s,c)] (4)
Intuitively, the problem arises because the student ID in TGD (2) depends on the student name, whereas in TGD (4) it depends
both on the name and the course. Fagin et al. [39] showed that the language of TGDs cannot express the composition of these
two mappings—that is, no finite set of TGDs exists that can, in all cases, produce a target database that is equivalent (in a
well-defined sense) to the result of the two-step process.
Fagin et al. [39] also showed that the second-order (SO) dependency (5) correctly expresses the composition of the two
mappings.
∃f∀n∀c.[Takes(n,c)→Enrollment(f(n),c)] (5)
Instead of existentially quantified variables, second-order dependencies support existentially quantified variables in implication
consequents. In the SO dependency (5), f is a function variable, a nd term f (n)can be understood as ‘the ID of the student with
name n’. This increased expressive power allows SO dependencies to express compositions of any sets of TGDs, and moreover
SO dependencies are closed under composition too.
Following this seminal result, properties of composition have been studied for other kinds of dependencies too. Nash et al.
[70] studied the composition of mappings that do not distinguish the source and the target schema, and they identified cases that
do and do not support composition; some of these involve second-order dependencies. Furthermore, Arenas et al. [6] studied
composition of mappings that distinguish the source from the target schemas, but where the target schema also uses TGDs
and EGDs. They showed that second-order dependencies alone cannot express composition of such mappings, but the latter
can be achieved using dependencies containing equality in the consequents. The resulting formalism is called source-to-target
SO dependencies, and they generalise both TGDs and EGDs. Finally, Arenas et al. [7] introduced plain source-to-target SO
dependencies and showed that this language behaves well under the inversion operator.
We are not aware of any attempts to study second-order dependencies in a knowledge representation context. However, based
on our experience of applying dependencies in practice, function variables often allow for much more intuitive modelling than
existential quantifiers. For example, in TGD (2), using f(n) instead of the existentially quantified variable sis often more natural
and convenient because it allows one to directly refer to the ID of n. Moreover, several dependencies can use f(n) to refer to
the same object: term f(n) always ‘produces’ the same student ID for the same name n. We believe such a modelling style has
the potential to overcome some issues of pure first-order modelling we outlined in our earlier work [64]. Consequently, we see
second-order dependencies as the natural next step in the evolution of logic-based knowledge representation formalisms.
1.2. Background: Query Answering over Dependencies
We have thus far discussed dependency languages only from the point of expressivity; however, another key issue is their
ability to support effective query answering over data with dependencies. Given a set of input dependencies, a dataset (i.e., a
2
set of relational facts), and a query, the objective is to answer the query while taking into account the facts that are logically
implied by the dataset and the dependencies. This computational problem plays a central role in numerous applications such as
declarative data integration [56, 48, 55], answering querying using views [47], accessing data sources with restrictions [34, 63],
and answering queries over ontologies [28]. Thus, identifying dependency classes that are sufficiently expressive but also support
effective query answering has received considerable attention in the literature.
The standard way to solve the query answering problem is to use a suitable variant of the chase algorithm to extended the
dataset with additional facts in a way that satisfies all dependencies; the resulting set of facts is sometimes called a universal
model. Any conjunctive query can be evaluated in the universal model, and the answers consisting of constants only provide
the solution to the query answering problem. To the best of our knowledge, Johnson and Klug [50] were the first to introduce
the chase algorithm. Numerous variants of this idea have been developed since, such as the restricted [20, 37], oblivious [27],
semioblivious [26], Skolem [60, 73], core [33], parallel [33], and frugal [52] chase. Benedikt et al. [24] discuss at length the
similarities and differences of many chase variants for first-order dependencies. The chase was also extended to handle second-
order dependencies without [39] and with [6, 70] equality in the consequents.
For a chase-based approach to query answering to be feasible in practice, the chase algorithm should terminate and produce
a finite universal model. Termination is not guaranteed in general, and furthermore checking whether the chase terminates or all
data sets is undecidable [41]. However, many sufficient conditions that guarantee chase termination have been proposed, such
as weak [37], superweak [60], joint [54], argument restricted [57], and model-summarising and model-faithful [44] acyclicity, to
name a few. Chase termination is also guaranteed for all classes of second-order dependencies used in the literature to express
compositions of schema mappings [39, 70, 6, 7].
Note that, if the syntactic structure of dependencies is restricted adequately, then the query answering problem can be solved
even if the chase does not terminate. One possibility is to develop a finite representation of the infinite universal model. Such finite
representations can be constructed using chase variants with blocking [28, 22, 49, 11], or one can construct a tree automaton that
accepts such finite representations [43]. Another possibility is to rewrite the query and the dependencies into another formalism
that is simpler to evaluate over the data but that produces the same answer. Depending on the dependency class, the target
formalism can be first-order [31, 12, 29, 18] or Datalog [61, 15, 16, 75] queries. In this paper, however, we focus on query
answering for dependencies where the chase algorithm terminates.
1.3. The Need for Goal-Driven Query Answering
When terminating chase is used to answer queries, the universal model can be computed once and subsequently used to
answer any conjunctive query, which is clearly beneficial in many use cases. However, query answers in practice often rely
only on a relatively small portion of the universal model, and moreover universal models sometimes cannot be computed due
to their size. Hence, if the query workload is known in advance, computing the chase in full can be inefficient in the sense
that many inferences made by the chase algorithm do not contribute to the query answers. This problem is exacerbated if the
data changes frequently, so the universal model needs to be recomputed after each change. When dependencies do not contain
existential quantifiers (and are thus equivalent to Datalog possible extended with equality), this inefficiency can be mitigated by
using an incremental maintenance algorithm that can efficiently update the universal model [58, 68, 66]; however, to the best of
our knowledge, no such algorithm is known for general first- and second-order dependencies with equalities.
To overcome these drawbacks, in this paper we turn out attention to goal-driven query answering techniques, whose under-
pinning idea is to start from the query and work backwards through dependencies to identify the relevant inferences. Some of the
rewriting techniques mentioned in Section 1.2, such as the query rewriting algorithm for DL-Lite [31] or the piece-based back-
ward chaining [12], can be seen as being goal-driven; however, these are applicable only to syntactically restricted dependency
classes that can be insufficiently expressive in certain applications. SLD resolution [53] provides goal-driven query answering
for logic programs. Furthermore, the magic sets technique for logic programs [14, 19, 13] optimises the tuple-at-a-time style of
processing of SLD resolution. The idea behind the magic sets is to analyse the program’s inferences and modify the program so
that applying the chase to the transformation resulting simulates backward chaining. This is achieved by introducing auxiliary
magic predicates to accumulate the bindings that would be produced during backward chaining, and by using these predicates
as guards to restrict the program’s rules to the relevant bindings. This idea has been adapted to many contexts, such as finitely
recursive programs [30], programs with aggregates [3], disjunctive Datalog programs [4], and Shy dependencies [5].
As we argued in Section 1.1, second-order dependencies with equality atoms are necessary to capture many relevant data
management and knowledge representation tasks. However, to the best of our knowledge, none of the goal-driven techniques
we outlined thus far are applicable to this class of dependencies. A naïve approach might be to explicitly axiomatise equality as
an ordinary predicate [40] (see Section 2), and to use the standard magic sets technique for logic programs (possibly containing
function symbols) [19]; however, reasoning with the explicit axiomatisation of equality can be very inefficient in practice [67].
Consequently, efficient and practically successful goal-driven query answering over first- and second-oder dependencies with
equality remains an open problem.
3
1.4. Our Contribution
In this paper, we present what we believe to be the first goal-driven approach to answering queries over first- and second-
order dependencies with equality. Our technique takes as input a dataset, a set of dependencies, and a query, and it modifies the
dependencies so that applying the chase to the dataset and the transformed dependencies avoid many inferences that are irrelevant
to the query. Our technique is inspired by the magic sets variants outlined in Section 1.3, but is considerably more involved. The
key technical problem is due to the fact that equality inferences are prolific (i.e., they can affect any predicate in any dependency)
and are highly redundant (i.e., the same conclusion is often be derived in many different ways). Both of these issues make the
analysis of equality inferences very hard. Our solution is based on the following three main contributions.
First, to facilitate a precise and efficient analysis of equality inferences, we use the singularisation technique by Marnette
[60], which axiomatises equality without the congruence axioms [40] and thus avoids a key source of inefficiency in practice [67].
To compensate for the lack of congruence axioms, the dependencies need to be modified too. One can intuitively understand
this as ‘pruning’ redundant inferences from the original dependencies, which in turn allows for an efficient analysis of equality
inferences. Marnette [60] introduced singularisation for first-order dependencies, and ten Cate et al. [74] applied this technique
to second-order dependencies; however, the result is not complete: in Section 3 we present an example where singularisation by
ten Cate et al. [74] does not preserve all query answers. This problem arises because singularisation does not take into account
functional reflexivity of equality, which in turn ensures that function variables behave like functions: if we derive that aand bare
equal, then we should also ensure that f(a) and f(b) are equal as well. Completeness of singularisation can be easily recovered
by axiomatising functional reflexivity, but this prevents chase termination: if f(a) and f(b) are equal, then f(f(a)) and f(f(b))
should be equal as well, and so on ad infinitum. We overcome this by presenting a novel singularisation variant where functional
reflexivity is constrained to derive ‘just the right’ equalities: sufficient to derive all relevant answers, but without necessarily
making the universal model infinite.
We stress that singularisation is used only to facilitate our transformation: it is ‘undone’ at the end of our transformation
and equality is treated as ‘true’ equality in the result. In other words, the final chase step (which is likely to be critical to the
performance of query answering) does not suffer from any overheads associated with axiomatising equality.
Second, we present a relevance analysis technique that can identify and eliminate dependencies for which no conclusion
is relevant to the query. Roughly speaking, this technique computes an abstraction of a universal model—that is, a model that
contains a homomorphic image of a universal model computed by the chase on the given dataset. This abstraction is then used
to perform a backward analysis of the inferences and dependencies that contribute to query answers.
Third, we present a modification of the magic sets technique for logic programs that we optimised to handle equality more
efficiently. Roughly speaking, our technique takes into account the reflexivity, symmetry, and transitivity of equality to reduce
the number of rules produced. In particular, a careful handling of reflexivity is essential: an equality of the form t≈tcan be
derived from any predicate, so the standard magic sets transformation necessarily identifies each dependency in the input. In
contrast, if the input dependencies are safe (which intuitively ensures that the conclusions of the dependencies do not depend on
the interpretation domain), we show that reflexivity does not need to be taken into account during the magic sets transformation,
which usually reduces the number of rules in the output.
Our three techniques are complementary. In particular, singularisation facilitates the use of the relevance analyses and the
magic sets transformation, and neither of the latter two techniques subsumes the other. Thus, all three techniques are required to
facilitate efficient goal-driven query answering.
Our objective was to show that our techniques are practical, but we faced a significant obstacle: whereas we could find suitable
first-order benchmarks (e.g., by Benedikt et al. [23]), we are unaware of any publicly available benchmarks that use second-order
dependencies with equalities. To overcome this problem, in Section 5 we present a new technique that can randomly generate
such benchmarks. Several generators of dependencies and/or datasets have been proposed in practice. For example, iBench [8]
can produce dependencies capturing mapping scenarios; furthermore, ToXgene [17] can generate XML data (which can easily
be converted into relational form), and WatDiv [2] can generate RDF data. However, to the best of our knowledge, no existing
system can generate second-order dependencies. Furthermore, when dependencies and data are generated in isolation (as is
usually the case in practice), it is difficult to guarantee a certain level of ‘reasoning hardness’—that is, there is no guarantee that
the dependencies will produce interesting inferences on the generated datasets. To address these issues, our technique randomly
generates derivation trees of instantiated dependencies—that is, dependencies with variable-free terms of a bounded depth. Then,
it systematically replaces certain terms with variables to obtain standard second-order dependencies, and it produces a dataset
from the tree leaves. In this way, the resulting dependencies are guaranteed to perform at least the inferences that were considered
when the derivation trees were generated, which allows us to control a minimum level of ‘reasoning hardness’.
In Section 6, we present the results of an extensive evaluation of our techniques on a range of existing and new test scenarios
involving first- and second-order dependencies. Our objective was to determine to what extent a query can be answered more
efficiently than by using the chase to compute a universal model. Moreover, to isolate the contributions of different techniques, we
compared answering the query by relevance analysis only, magic sets only, and with both techniques combined. Our results show
that goal-driven query answering can be very effective: on certain scenarios, we were able to answer certain queries orders of
magnitude faster than by computing a universal model. Our relevance analysis technique seems to be the main reason behind the
4
performance improvements, but magic sets can be of benefit too. Furthermore, both techniques have the potential to considerably
reduce the number of facts derived in practice. This leads to conclude that our techniques provide an invaluable addition to the
set of techniques for practical query answering over first- and second-order dependencies.
1.5. Summary of Contributions and Paper Structure
The results presented in this paper extend our earlier work published at the 2018 AAAI conference [25], and the main novelty
can be summarised as follows:
•an extension of the singularisation technique by Marnette [60] to second-order dependencies, which corrects the incom-
pleteness in the work by ten Cate et al. [74];
•an extension of the relevance analysis and magic sets techniques to second-order dependencies;
•a generator of second-order dependencies and data that can control the lever of ‘reasoning hardness’ in more detail;
•the first implementation of the chase for second-order dependencies with equality; and
•an extensive empirical evaluation showing that our techniques can be effective in practice.
The rest of our paper is structured as follows. In Section 2 we recapitulate the well-known definitions, terminology, and
notation that we use throughout the paper. In Section 3 we present a running example that illustrates some of the difficulties we
need to overcome in our work, and we present a high-level overview of our approach. In Section 4 we present our approach
in detail and prove its correctness. In Section 6 we discuss the results of our experimental evaluation. Finally, in Section 7 we
recapitulate our main findings and discuss possible avenues for further work.
2. Preliminaries
We use first-order and second-order logic, as well as logic programming rules in the presentation of our results. To avoid
defining each formalism separately, we ground all of them in second-order logic as presented by Enderton [36]. To make this
paper self-contained, in the rest of this section we recapitulate the terminology, the notation, and the basic definitions of first- and
second-order logic. Next, we recapitulate the definitions of second- and first-order dependencies, and we introduce the problem
of query answering. Finally, we recapitulate the notions of logic programming, and we discuss how our definitions relate to the
definitions of dependencies commonly used in the literature.
First- and Second-Order Logic: Syntax. We fix arbitrary, countably infinite, and mutually disjoint sets of constants,individ-
ual variables,function symbols,function variables, and predicates. Each function symbol, function variable, and predicate is
associated with an nonnegative integer arity. A term is inductively defined as a constant, an individual variable, or an expression
of the form f(t1,...,tn) where t1,...,tnare terms and fis an n-ary function symbol or an n-ary function variable. An atom is an
expression of the form R(t1,...,tn) where Ris an n-ary atom and t1,...,tnare terms called the atom’s arguments. We assume that
the set of predicates contains a distinct binary equality predicate ≈. Atoms of the form ≈(t1,t2) are typically written as t1≈t2
and are called equality atoms (or just equalities); moreover, all atoms with a predicate different from ≈are called relational.
Formulas of second-order logic are constructed as usual using Boolean connectives ∧,∨, and ¬, first-order quantifiers ∃xand ∀x
where xis an individual variable, and second-order quantifiers ∀fand ∃fwhere fis a function variable. As usual, an implication
ϕ→ψabbreviates ¬ϕ∨ψ. A first-order formula is a second-order formula that does not contain function variables. A sentence
is a formula with no free (individual or function) variables. In the context of first-order formulas, individual variables are typi-
cally called just variables. Unless otherwise stated, possibly subscripted letters a,b,c,...denote constants, s,t,...denote terms,
x,y,z,...denote individual variables, and f,g,h,... denote either function symbols of function variables; in the latter case, the
intended use will always be clear from the context.
First- and Second-Order Logic: Semantics. The notion of an interpretation plays a central role in the definition of the
semantics of first- and second-order logic. An interpretation I =(∆I,·I) consists of a nonempty domain set ∆Iand a function ·I
that maps each constant ato a domain element aI∈∆I, each n-ary function symbol fto a function fI: (∆I)n→∆I, and each
n-ary predicate Rto a relation RI⊆(∆I)n. A valuation πon Imaps each individual variable xto a domain element xπ∈∆I, and
each n-ary function variable fto an n-ary function fπ: (∆I)n→∆I. Given an interpretation Iand a valuation πin I, each term t
is assigned a value tI,π ∈∆Ias follows:
tI,π =
aIif tis a constant a,
xπif tis an individual variable x,
fI(tI,π
1,...,tI,π
n) if t=f(t1,...,tn) with fan n-ary function symbol, and
fπ(tI,π
1,...,tI,π
n) if t=f(t1,...,tn) with fan n-ary function variable.
Let ϕbe a first- or second-order formula (possibly containing free first- and/or second-order variables), let Ibe an interpreta-
tion, and let πbe a valuation defined on all free (individual and function) variables of ϕ. We can determine whether ϕis satisfied
5
in Iand π, written I, π |=≈ϕ, using the standard definitions of first- and second-oder logic [35, 36]; we recapitulate below only
the cases for the existential first- and second-order quantifiers.
I, π |=≈∃x.ψ iffthere exists a domain element α∈∆Isuch that I, π′|=≈ψwhere π′is obtained from πby mapping xto α.
I, π |=≈∃f.ψ iffthere exists a function α: (∆I)n→∆Isuch that I, π′|=≈ψwhere π′is obtained from πby mapping fto α.
The subscript ≈in |=≈stipulates that equality predicate is interpreted as ‘true equality’—that is, ≈I={hα, αi | α∈∆I}, meaning
that two domain elements are equal if and only if they are identical. When ϕis a sentence, then the truth of ϕin Idoes not
depend on a valuation, so we simply write I|=≈φand call Iamodel of φ. Moreover, a sentence ϕis satisfiable if it has a model; a
sentence ϕentails a sentence ψ, written ϕ|=≈ψ, if each model of ϕis also a model of ψ; finally, sentences ϕand ψare equivalent
if they are satisfied in exactly the same models (so ϕ|=≈ψand ψ|=≈ϕboth hold). A model Iof a sentence ϕis universal if, for
each model J=(∆J,·J) of ϕ, there exists a mapping µ:∆I→∆Jsuch that
•µ(aI)=aJfor each constant a,
•µ(fI(α1,...,αn)) =fJ(µ(α1),...,µ(αn)) for each n-ary function symbol fand each n-tuple hα1,...,αni ∈ (∆I)n, and
•hµ(α1),...,µ(αn)i ∈ RJfor each n-ary predicate Rand each n-tuple hα1,...,αni ∈ RI.
The semantics of first- and second-order logic typically does not assume the unique name assumption (UNA)—that is,
distinct constants are allowed to be interpreted as the same domain element. UNA is also typically not assumed in the knowl-
edge representation literature. In contrast, UNA is commonly used in the database literature: an attempt to equate two con-
stants results in a contradiction. For the sake of generality, we do not assume UNA in this paper. For example, sentence
ϕ=R(a,b)∧ ∀x,y.[R(x,y)→x≈y] is satisfiable and it implies that aand bare the same constant—that is, ϕ|=≈a≈b. If
desired, one can check whether UNA is satisfied by explicitly querying the equality predicate.
Auxiliary Definitions. A term t1is a subterm of a term t2if t1syntactically occurs inside t2(note that this definitions allows
t1=t2); moreover t1is a proper subterm of t2if t1is a subterm of t2and t1,t2. The depth dep(t) of a term tis inductively defined
as follows: dep(t)=0 if tis a variable or a constant, and dep(t)=1+max{dep(ti)|1≤i≤n}if t=f(t1,...,tn). The depth of
an atom is equal to the maximum depth of its arguments. We often abbreviate a tuple t1, . . ., tnof terms as t. We sometimes
abuse the notation and treat tas a set; for example, we write t∈tto indicate that term toccurs in vector t. Analogously, we
often abbreviate a tuple f1,..., fnof function symbols as f. For αa term, an atom, or a set thereof, vars(α) is the set containing
precisely all variables that occur in α. A term is ground if it does not contain a variable, and an atom is ground if all of its
arguments are ground. A ground atom is often called a fact, and a base fact is a fact that neither uses the ≈predicate nor contains
a function symbol. An instance is a (possibly infinite) set of facts, and a base instance is a finite set of base facts.
Generalised Second-Order Dependencies. Ageneralised second-order (SO) dependency is a second-order formula of the
form
∃f.h∀x1.ϕ1(x1)→ ∃y1.ψ1(x1,y1)∧ · · · ∧ ∀xn.ϕn(xn)→ ∃yn.ψn(xn,yn)i,(6)
where fis a tuple of function variables and, for each i∈ {1,...,n},
•xiand yiare tuples of distinct individual variables,
•ϕi(xi) is a conjunction consisting of (i) relational atoms whose arguments are constructed using constants and individual
variables in xi, and (ii) equality atoms whose arguments are terms of depth at most one constructed using constants,
individual variables in xi, and function variables in f,
•ψi(xi,yi) is a conjunction of atoms whose arguments are individual variables in yi, or terms of depth at most one constructed
using constants, individual variables in xi, and function variables in f, and
•each individual variable in xiappears in ϕi(xi) in a relational atom.
The last condition is called safety and is needed to ensure domain independence—that is, that the satisfaction of a formula in an
interpretation does not depend on the choice of the interpretation domain [39]. Since formula (6) can contain an arbitrary number
of conjuncts, it suffices to consider just one generalised second-order dependency (instead of a set of dependencies). However, to
simplify the notation, we often write down a generalised SO dependency as a set of formulas of the form ϕi(xi)→ ∃yi.ψi(xi,yi)
where quantifiers ∃fand ∀xiare left implicit. For each conjunct δ=∀xi.[ϕi(xi)→ ∃yi.ψ(xi,yi)], formulas ∃y.ψ(x,y) and ϕ(x)
are called the head and body of δ, respectively, and are denoted by h(δ) and b(δ).
Unlike most definitions of SO dependencies in the literature, we assume that all terms occurring in a generalised SO depen-
dency are of depth at most one—that is, terms with nested function variables, such as f(g(x)) are disallowed. This assumption
allows us to simplify our results, and it is without loss of generality: Arenas et al. [6, Theorems 7.1 and 7.2] have shown that,
for each generalised SO dependency Σ, there exists an equivalent generalised SO dependency Σ′where all terms are of depth
at most one. Apart from this technical assumption, the syntactic form of (6) generalises most notions of dependencies we are
familiar with. In particular, standard tuple-generating and equality-generating dependencies (TGDs and EGDs) [37] are obtained
6
by disallowing function variables. The existential second-order dependencies (∃SOEDs) by Nash et al. [70] are obtained by
disallowing first-order existential quantifiers (∃yi). The source-to-target SO dependencies by Arenas et al. [6] are obtained by
further disallowing constants in all (relational and equality) atoms and requiring all equality atoms in ψi(xi,yi) to be of the form
x≈x′. The SO-TGDs by Fagin et al. [39] are obtained by further disallowing equality atoms in formulas ψi(xi,yi). Finally, the
plain SO-TGDs by Arenas et al. [7] are obtained by further disallowing equality atoms in ϕi(xi).
Generalised First-Order Dependencies. In our approach, we shall eliminate second-order quantifiers and reduce the query
answering problem to reasoning in first-order logic over finite sets of generalised first-order (FO) dependencies, which are
formulas of the form ∀x.[ϕ(x)→ ∃y.ψ(x,y)] where ϕ(x) and ψ(x,y) satisfy the same conditions as in SO dependencies, but
terms are constructed using function symbols instead of function variables. Thus, the only difference between FO and SO
dependencies is that FO dependencies can contain function symbols, whereas SO dependencies can contain quantified function
variables; we discuss the subtleties of this distinction in Section 4.1. The notions of the head and body of an FO dependency
are analogous to the SO case. Finally, formula ∀x.[ϕ(x)→ ∃y1,y2.(ψ1(x,y1)∧ψ2(x,y2))] is equivalent to the conjunction of
formulas ∀x.[ϕ(x)→ ∃y1.ψ1(x,y1)] and ∀x.[ϕ(x)→ ∃y2.ψ2(x,y2)], so, without loss of generality, we assume that the head of
each generalised first-order dependency is normalised by applying these equivalences as long as possible.
Queries and Query Answers. Conjunctive queries and unions of conjunctive queries are the most common query languages for
dependencies considered in the literature [37]. In our work, it is convenient to ‘absorb’ the notion of a query into dependencies.
Thus, we assume that there exists a distinguished query predicate Q, and that a query is defined as part of the formula (6) using
one or more conjuncts of the form ∀xi.[ϕi(xi)→Q(x′
i)] where x′
i⊆xi, and where the query predicate Qdoes not occur in the
body of any conjunct of the formula (6). In other words, atoms with the query predicate cannot occur in a conjunct body, they
can occur as the only atom of a conjunct head, and they are not allowed to contain constants, existentially quantified variables,
or function variables. Thus, a conjunctive query of the form ∃y.ϕ(x,y) as defined by Fagin et al. [37] corresponds to a formula
∀x,y.[ϕ(x,y)→Q(x)] in our setting.
For Σa generalised SO dependency, Qa query defined by Σ, and Ba base instance, a tuple of constants ais an answer to Q
over Σand B, written {Σ} ∪ B|=≈Q(a), if I|=≈Q(a) for each interpretation Isuch that I|=≈{Σ} ∪ B.1Analogously, for Σa set of
generalised FO dependencies, Qa query defined by Σ, and Ba base instance, a tuple of constants ais an answer to Qover Σand
B, written Σ∪B|=≈Q(a), if I|=≈Q(a) for each interpretation Isuch that I|=≈Σ∪B.
Skolemisation. The Skolemisation of a conjunct δ=∀x.[ϕ(x)→ ∃y.ψ(x,y)] of a generalised SO dependency is the formula
δ′=∀x.[ϕ(x)→ψ′(x)], where ψ′(x′) is obtained from ψ(x,y) by replacing each existentially quantified individual variable y∈y
with a term g(x′) where gis a fresh function variable and x′contains all variables of xthat occur free in ψ(x,y). For all such
δand δ′, let fbe the tuple of function variables occurring in δ, and let gbe the tuple of function variables occurring in δ′but
not δ; then, formulas ∃f.δ and ∃f,g.δ′are equivalent. Thus, first-order quantification is redundant in second-order dependencies
because first-order quantifiers can always be eliminated via Skolemisation; however, as we discuss in Section 4.3, distinguishing
first- and second-order quantification allows us to optimise our query answering approach.
Skolemisation can also be applied to generalised FO dependencies, in which case it introduces fresh function symbols (rather
than function variables). These fresh function symbols cannot be quanfied (as they are not variables), so a set Σof generalised
FO dependencies is not necessarily equivalent to its Skolemisation Σ′. However, for each base instance Band each fact of the
form Q(a), we have {Σ} ∪ B|=≈Q(a) if and only if {Σ′} ∪ B|=≈Q(a); that is, Skolemisation does not affect the query answers.
Throughout this paper, we call the function symbols (resp. function variables) occurring in the input FO or SO dependen-
cies true function symbols (resp. function variables), and we call the function symbols (resp. function variables) introduced by
Skolemization Skolem function symbols (resp. function variables). We discuss the rationale for this distinction in Section 4.3.
Equality as an Ordinary Predicate. Our algorithms will need to analyse inferences that use the equality predicate, which,
as we discuss in detail in Section 3, can be very challenging. We overcome this issue by explicitly axiomatising the properties
of equality and treating ≈as an ordinary predicate. To clearly distinguish the two uses of equality, we shall use the symbol |=
for satisfaction and entailment whenever we assume that ≈is an ordinary predicate without any special meaning. For example,
let ϕ=R(a,b)∧S(a,c)∧ ∀x.[R(x,x1)∧S(x,x2)→x1≈x2]. Then, ϕ|=≈b≈cand ϕ|=≈c≈b; the latter holds because |=≈
interprets ≈as a symmetric predicate. In contrast, ϕ|=b≈c, but ϕ6|=c≈b; the latter holds because |=interprets ≈as just
another predicate whose semantics is specified only by the formula ϕ. Even when ≈is interpreted as an ordinary predicate, we
shall still syntactically distinguish relational and equality atoms as outlined earlier.
Explicit Axiomatisation of Equality. When equality is treated as an ordinary predicate, the properties of equality can be
explicitly axiomatised so that there is no distinction in the query answers. In particular, for Σa generalised SO dependency and B
a base instance, let EQ(Σ) be a conjunction containing a domain dependency (7) instantiated for each constant coccurring in Σ,
adomain dependency (8) instantiated for each n-ary predicate Roccurring in Σdistinct from ≈and Qand each i∈ {1,...,n}, the
1Note that, in the case of second-order dependencies, Σis a single formula and not a set; thus, Σ∪Bis ill-defined, so we write {Σ} ∪ Binstead.
7
reflexivity dependency (9), the symmetry dependency (10), the transitivity dependency (11), a functional reflexivity dependency
(12) instantiated for each true n-ary function variable foccurring in Σ, and a congruence dependency (13) instantiated for each
n-ary predicate Rdistinct from ≈and each i∈ {1,...,n}.
→D(c) (7)
R(x1,...,xn)→D(xi) (8)
D(x)→x≈x(9)
x1≈x2→x2≈x1(10)
x1≈x2∧x2≈x3→x1≈x3(11)
x1≈x′
1∧ · · · ∧ xn≈x′
n→f(x1,...,xn)≈f(x′
1,...,x′
n) (12)
R(x1,...,xn)∧xi≈x′
i→R(x1,...,xi−1,x′
i,xi+1,...,xn) (13)
Intuitively, dependencies (7) and (8) ensure that the Dpredicate enumerates the domain of {Σ} ∪ EQ(Σ)∪B, dependencies (9)–
(11) and (13) axiomatise equality as a congruence relation, and dependencies (12) ensure the function variables indeed behave
like function variables. Note that dependencies (12) need to be instantiated only for true function variables, but not for the
function variables introduced by Skolemisation. In Section 4.3 we show that this important detail allows us to optimise our query
answering approach in certain cases.
Note that formula EQ(Σ) has free function variables, and Σis a sentence containing second-order quantifiers over function
variables; hence, combining the two formulas as Σ∧EQ(Σ) is incorrect because the function variables of EQ(Σ) are still free and
not under the scope of the second-order quantifiers of Σ. Thus, let ΣfEQ(Σ) be the generalised SO dependency where second-
order quantifiers cover all conjuncts of Σand EQ(Σ). It is well known that {Σ} ∪ B|=≈Q(a) if and only if {ΣfEQ(Σ)} ∪ B|=Q(a)
[40] for each fact of the form Q(a)—that is, explicit axiomatisation of equality preserves query answers.
For Σa set of generalised first-order dependencies, we define EQ(Σ) analogously as a set of dependencies of the form (7)–
(13); again, dependencies (12) need to be instantiated for the true function symbols only. Since Σand EQ(Σ) are sets with no
second-order quantifiers, union Σ∪EQ(Σ) is correctly defined and, analogously to the second-order case, we have Σ∪B|=≈Q(a)
if and only if Σ∪EQ(Σ)∪B|=Q(a) for each fact of the form Q(a).
While equality axiomatisation may seem convenient, reasoning algorithms that take into account special properties of equal-
ity, such as the one we present next, are typically much more efficient than computing the fixpoint of Σ∪EQ(Σ)∪B[67].
The Chase. A standard way to answer the query Qover a second-order dependency Σand a base instance Bis to compute a
universal model for Σand Busing an appropriate variant of the chase algorithm, and then to simply ‘read off’ the facts that use
the Qpredicate and consist of constants only. Numerous chase variants have been proposed [59, 21, 37, 27, 60, 73, 62, 33, 39, 6].
To the best of our knowledge, Arenas et al. [6] presented the first chase variant that is applicable to second-order dependencies
with equality atoms in the heads. We next present a variant that can handle both first- and second-order quantification, and that
distinguishes true and Skolem function variables. Both of these points allow us to optimise reasoning in certain cases.
The algorithm relies on the set of labelled nulls—objects whose existence is implied by first- and second-order quantifiers.
In particular, we distinguish a countably infinite set of base labelled nulls disjoint with the set of constants, and a countably
infinite set of functional labelled nulls defined inductively as the smallest set containing a distinct object ntfor each term of the
form t=f(u1,...,un) where fis a (true or Skolem) function variable and u1,...,unare constants or (base or functional) labelled
nulls. Finally, we assume that all constants and labelled nulls are totally ordered using an arbitrary, but fixed ordering ≺where
all constants precede all labelled nulls.
The chase algorithm for SO dependencies takes as input a generalised second-order dependency Σof the form (6) and a base
instance B. We allow Σto contain both true and Skolem function variables. To simplify the presentation, we assume that no
constant occurs in the body of a conjunct of Σ; thus, the bodies of the conjuncts of Σconsist of relational atoms with variable ar-
guments, as well as of equalities with terms of depth at most one. The algorithm manipulates facts whose arguments are constants
or (base or functional) labelled nulls. Furthermore, for each true n-ary function variable f, the algorithm introduces a distinct
n+1-ary predicate Vf. Let Ibe an instance consisting of facts constructed using the predicates of Σand B, the equality predicate
≈, and the predicates Vfwhere fis a true function variable, and furthermore assume that {Vf(u1,...,un,v),Vf(u1,...,un,v′)} ⊆ I
implies v=v′for all fand u1,...,un. For ta term of depth at most one, we define the value of tin I, written tI, as follows.
•If tis a constant or a (base or functional) labelled null, then tI=t.
•If t=f(u1,...,un), then tI=vif Icontains a fact of the form Vf(u1,...,un,v), and tI=ntotherwise.
Moreover, let F=R(t1,...,tn) be an equality or a relational ground atom where terms t1, . . . , tnare all of depth at most one.
Whether Fis true in I, written I⊢F, is defined as follows.
•If Fis an equality t1≈t2, then I⊢t1≈t2if tI
1=tI
2.
•Otherwise, I⊢R(t1,...,tn) if R(tI
1, . . . , tI
n)∈I.
8
Given Σand B, the chase algorithm constructs a sequence of pairs hI0, µ0i,hI1, µ1i,.... Each Iiis an instance containing facts
constructed using constants and labelled nulls, and using the predicates of Σand B, the equality predicate ≈, and the predicates
Vf. Each µiis a mapping from constants to constants, and it will record constant representatives. For αan atom, µi(α) is the
result of replacing each occurrence of a constant c