Page 1

COLLABORATIVE DATA SHARING WITH MAPPINGS AND

PROVENANCE

Todd J. Green

A DISSERTATION

in

Computer and Information Science

Presented to the Faculties of the University of Pennsylvania in Partial

Fulfillment of the Requirements for the Degree of Doctor of Philosophy

2009

Zachary G. Ives

Supervisor of Dissertation

Val Tannen

Supervisor of Dissertation

Jianbo Shi, Associate Professor, Computer and Information Science

Graduate Group Chairperson

Dissertation Committee

Susan B. Davidson, Professor, Computer and Information Science

Sanjeev Khanna, Professor, Computer and Information Science

Jeffrey F. Naughton, Professor, Computer Science, University of Wisconsin-Madison

Benjamin C. Pierce, Professor, Computer and Information Science

Page 2

COPYRIGHT

Todd J. Green

2009

Page 3

To Elisabeth

iii

Page 4

ABSTRACT

COLLABORATIVE DATA SHARING WITH MAPPINGS AND PROVENANCE

Todd J. Green

Supervisors: Zachary G. Ives and Val Tannen

A key challenge in science today involves integrating data from databases managed by differ-

ent collaborating scientists. In this dissertation, we develop the foundations and applications of

collaborative data sharing systems (CDSSs), which address this challenge. A CDSS allows collab-

orators to define loose confederations of heterogeneous databases, relating them through schema

mappings that establish how data should flow from one site to the next. In addition to simply

propagating data along the mappings, it is critical to record data provenance (annotations describ-

ing where and how data originated) and to support policies allowing scientists to specify whose

data they trust, and when. Since a large data sharing confederation is certain to evolve over time,

the CDSS must also efficiently handle incremental changes to data, schemas, and mappings.

We focus in this dissertation on the formal foundations of CDSSs, as well as practical issues

of its implementation in a prototype CDSS called Orchestra. We propose a novel model of data

provenance appropriate for CDSSs, based on a framework of semiring-annotated relations. This

framework elegantly generalizes a number of other important database semantics involving an-

notated relations, including ranked results, prior provenance models, and probabilistic databases.

We describe the design and implementation of the Orchestra prototype, which supports update

propagation across schema mappings while maintaining data provenance and filtering data ac-

cording to trust policies. We investigate fundamental questions of query containment and equiv-

alence in the context of provenance information. We use the results of these investigations to

develop novel approaches to efficiently propagating changes to data and mappings in a CDSS.

Our approaches highlight unexpected connections between the two problems and with the prob-

lem of optimizing queries using materialized views. Finally, we show that semiring annotations

also make sense for XML and nested relational data, paving the way towards a future extension

of CDSS to these richer data models.

iv

Page 5

Contents

1 Introduction1

1.1 Overview of CDSS and Orchestra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3

1.2 Overview of Technical Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . .12

1.3 Roadmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .25

2Provenance Semirings 27

2.1 Queries on Annotated Relations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .28

2.2 Positive Relational Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .30

2.3 Polynomials for Provenance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .34

2.4 A Hierarchy of Provenance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .36

2.5 Datalog on K-Relations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .39

2.6 Formal Power Series for Provenance . . . . . . . . . . . . . . . . . . . . . . . . . . . .44

2.7 Computing Provenance Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .46

2.8 Application to Incomplete/Probabilistic Databases. . . . . . . . . . . . . . . . . . .49

2.9 Provenance-Annotated Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .49

3Update Exchange in Orchestra 53

3.1 CDSS Update Exchange . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .55

3.2 Update Exchange Formalized . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .60

3.3 Performing Update Exchange . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .67

3.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

3.5Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

4Optimizing Queries on Annotated Relations 85

4.1Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

4.2 Containment Mappings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .89

4.3 Bounds from Semiring Homomorphisms . . . . . . . . . . . . . . . . . . . . . . . . .90

v

Page 6

4.4Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

4.5 Datalog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

5 Ring-Annotated Relations and Differences 106

5.1 Applications of Differences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

5.2Z-Relations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

5.3 Reformulation Using Views . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

5.4 Finding Query Rewritings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

5.5 Applications to Bag and Set Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

5.6 Built-in Predicates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

5.7Z[X]-relations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

6 Semiring-Annotated XML131

6.1 Semiring Annotations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

6.2 Annotated and Unordered XML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

6.3 A Security Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

6.4 Incomplete and Probabilistic K-UXML . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

6.5 Semantics via Complex Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

6.6 Commutation with Homomorphisms . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

6.7 Semantics via Relations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

7 Related Work 158

7.1 Paradigms for Data Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158

7.2 Provenance and Annotated Data Models . . . . . . . . . . . . . . . . . . . . . . . . . 164

7.3 Update Exchange . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165

7.4 Query Containment and Equivalence . . . . . . . . . . . . . . . . . . . . . . . . . . . 166

7.5 Ring-Annotated Relations and Updates . . . . . . . . . . . . . . . . . . . . . . . . . . 167

7.6 Semiring-Annotated XML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168

7.7 Further Work on Orchestra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168

8Conclusions and Future Directions170

8.1 Immediate Next Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171

8.2Longer-Term Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172

Bibliography 174

vi

Page 7

List of Figures

1.1Mappings among three bioinformatics databases . . . . . . . . . . . . . . . . . . . . . .5

1.2 Bioinformatics database instances before and after data exchange . . . . . . . . . . . . .7

1.3 Two views of provenance for bioinformatics example . . . . . . . . . . . . . . . . . . . .11

1.4 Tabular provenance representation using system of equations . . . . . . . . . . . . . . .16

1.5 Complexity of containment and equivalence with provenance annotations . . . . . . . .21

1.6 Update propagation and query reformulation . . . . . . . . . . . . . . . . . . . . . . . .22

1.7 Mapping evolution and query reformulation . . . . . . . . . . . . . . . . . . . . . . . . .22

1.8 Representing and computing updates with ring-annotated relations . . . . . . . . . . .23

1.9 Semiring-annotated XML example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .25

2.1 A maybe-table and a query result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.2Result of Imielinski-Lipski computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.3Bag semantics example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

2.4Probabilistic example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

2.5 Lineage and provenance polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . .34

2.6 Comparison of provenance annotations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

2.7Provenance hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

2.8Datalog with bag semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

2.9 Datalog example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .42

2.10 Derivation trees for Datalog example . . . . . . . . . . . . . . . . . . . . . . . . . . . . .43

2.11 Probabilistic data integration example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

3.1Mappings among three bioinformatics databases . . . . . . . . . . . . . . . . . . . . . .55

3.2Dataflow at a single peer in a CDSS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

3.3Local edit logs and update translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

3.4Provenance graph for bioinformatics example . . . . . . . . . . . . . . . . . . . . . . . . 60

3.5Effects of a peer’s edit log . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

vii

Page 8

3.6 Provenance graph for derivability testing example. . . . . . . . . . . . . . . . . . . . . .71

3.7 Example provenance graph showing relationships between tuples. . . . . . . . . . . . .74

3.8 Relative performance of deletion propagation algorithms . . . . . . . . . . . . . . . . . .80

3.9 Time to join and relation instance sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . .81

3.10 Scalability of incremental update propagation algorithms . . . . . . . . . . . . . . . . .82

3.11 Scalability for increasing base data sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

3.12 Effects of mapping fan-in/fan-out on storage overhead and performance . . . . . . . .84

4.1 Complexity of containment and equivalence . . . . . . . . . . . . . . . . . . . . . . . . .86

4.2 Logical implications of containment and equivalence . . . . . . . . . . . . . . . . . . . .88

5.1 Algebraic identities for the difference operator under Z-semantics . . . . . . . . . . . . 113

6.1Simple for Example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

6.2

K-UXQuery Syntax. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

6.3

K-UXQuery Typing Rules. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

6.4 XPath Example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

6.5 Relational (encoded) example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

6.6 Extended Annotations Example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

6.7 Translation from positive relational algebra to K-UXQuery . . . . . . . . . . . . . . . . . 140

6.8 Security Clearance Example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

6.9 Syntax and Typing rules for NRC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

6.10 Semantic Equations for NRCK+ srt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

6.11 Compilation of UXQuery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

6.12 Compilation of XPath to NRCK+ srt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

7.1 Data warehouse architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159

7.2Feature comparison of data integration paradigms . . . . . . . . . . . . . . . . . . . . . 160

7.3Virtual data integration architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161

7.4 Peer-to-peer data integration architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 162

7.5Data exchange architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163

viii

Page 9

Acknowledgments

I would like to thank my advisers, Zack Ives and Val Tannen, for their generous support and

guidance during my studies at Penn. It was a privilege to have been able to work with them.

Most of the work described in this dissertation was performed in collaboration with others,

including my advisors and my fellow PhD students Grigoris Karvounarakis and Nate Foster. I

would like to thank them here, and emphasize that many of the contributions presented in the

present document were shaped by several of us working together. In particular, the material in

Chapter 2 is based on a paper [62] by the author with Grigoris Karvounarakis and Val Tannen,

as part of the larger Orchestra project headed by Zack Ives; Chapter 3 is based on a paper [60]

by the author with Grigoris Karvounarakis, Zack Ives, and Val Tannen; Chapter 5 is based on a

paper [59] by the author with Zack Ives and Val Tannen; and Chapter 6 is based on a paper [48]

by the author with Nate Foster and Val Tannen.

The work described in this dissertation benefited at many points from useful feedback and

suggestions from others. I would like to thank in particular Olivier Biton, James Cheney, Jan

Chomicki, Sarah Cohen-Boulakia, Nilesh Dalvi, Floris Geerts, Giorgio Ghelli, Ren´ ee Miller, Tova

Milo, Kristoffer Rose, J´ erˆ ome Sim´ eon, Nicholas Taylor, Stijn Vansummeren, and the members of

the Penn database group. I also would like to thank the members of my dissertation committee

for their thoughtful feedback and comments that helped improve this document.

Finally, I would like to thank my wife, Elisabeth Dubin, for her love and support over the past

six years.

ix

Page 10

Chapter 1

Introduction

Sharing structured data is one of the most basic yet persistently vexing problems of data manage-

ment, and has been studied since the earliest days of database research. In recent years, with the

growth of the Internet and explosion of data on the Web, there has been a renewed and sustained

push to develop solutions for integrating data from heterogeneous sources, translating data be-

tween different schemas, and other forms of data sharing. Much progress has been made in both

establishing solid theoretical foundations and developing practical systems. These efforts range

from pre-loading all data into a single database instance after transforming it to a common for-

mat, then answering queries over this instance (as with data exchange [45]); to transforming data

“on demand” or “on the fly” to answer specific queries (as with virtual data integration [135]).

Commercially available tools using these techniques are available in several forms, including data

warehousing systems and enterprise information integration systems.

Despite this progress, the practical impact of data sharing solutions proposed to date has been

surprisingly modest, limited mainly to relatively small-scale efforts centralized inside a single

corporation or controlling entity. At the same time, the need for sharing data across broader

communities is increasing, especially in the sciences, which have become highly data-driven as

they attempt to tackle larger questions. In bioinformatics, for example, there is a plethora of

different databases, each providing a different perspective on a collection of organisms, genes,

proteins, diseases, and so on. The data in these databases is highly interrelated — e.g., there are

links between genes and proteins, and gene homologs between species — but the databases have

proved very difficult to integrate using existing technologies.

One major reason for this is that conventional data integration tools require the development

of a single global schema and assume that the data is globally consistent. In practice, both require-

ments are unrealistic. Designing a single schema for an entire community is arduous, involves

1

Page 11

many revisions, and requires a central administrator. Data from different sources can be often

contradictory, reflecting different points of view, while current data integration tools assume there

should be a consensus or “clean” version of the data. Moreover, users of integrated data often

prefer to carefully curate it, manually or semi-automatically, rejecting some items and modifying

others. The goals of curation are to eliminate inconsistencies, to fit data to user assumptions, as

well as to enforce different degrees of trust in the sources. Tolerating inconsistency and support-

ing curation requires a measure of decentralization in data sharing.

Another key requirement in many data sharing scenarios, not supported by current tools, is

to track and record where data has come from and how it has come to be where it is, in other

words, its provenance. In scientific data sharing scenarios, this is not optional: data provenance

is an essential part of the scientific record as well as crucial information used in curation. How-

ever, as data sharing typically involves complex transformations across the different schemas of

participants, formulating the “right” notion of data provenance to capture these complexities is a

challenging problem in itself.

Finally, most data sharing scenarios have important dynamic aspects. Data is constantly being

updated – refined, corrected, expanded – and the participants naturally desire the most recent

version of it. As has been observed since the earliest data integration investigations, this creates

a tension between the efficiency of answering queries and the freshness of the obtained answers,

with data warehousing favoring the former and virtual integration the latter. Moreover, it is not

just the data that is dynamic, but also the set of participants, their schemas, and the relationships

among them (as we shall see, the mappings which relate data sources). Especially in the “boot-

strapping” phase of a collaboration — where the complex relationships among participants and

data are only partially understood — such changes can occur as often as changes to the source

data. We need a platform which copes with changes both to data and to the mappings among

data sources.

To address these practical requirements, Ives, Khandelwal, and Kapur [82] described a vision

for a collaborative data sharing system (CDSS), an approach that supports looser, decentralized

confederations of participants (peers) with local curation abilities and tolerates inconsistencies in

the data while still preserving a notion of semantic consistency. The goal of this dissertation is to

realize this vision through a host of novel concepts and algorithms, developing both theoretical

foundations and implementation techniques for supporting data provenance, dealing efficiently

with the dynamic aspects of data and mappings through incrementality and optimization, and

finally, evaluating these ideas in a prototype system called Orchestra.

The main contributions of the dissertation are as follows:

2

Page 12

• We develop a powerful formal model of provenance for relational data that fulfills CDSS

requirements and captures as special cases many other provenance models proposed in

earlier work.

• We implement these ideas in the Orchestra prototype CDSS, solving key engineering chal-

lenges such as the representation of provenance and efficient propagation of data updates in

an off-the-shelf DBMS, and we also validate experimentally the feasibility of our approach.

• We make a systematic study of the fundamental questions of query containment and equiv-

alence in the presence of provenance information, and obtain positive decidability results

and complexity characterizations for five different models of provenance.

• We use these results to develop algorithms for efficient propagation of updates to mappings

in a CDSS. The algorithms highlight unexpected connections with the problem of data up-

date translation, and make fundamental use of provenance information. We also implement

and evaluate these algorithms in the Orchestra prototype.

• We show that our provenance model extends to richer data models such as XML and nested

relations, giving further evidence of the robustness of our approach and paving the way

towards a future extension of CDSS to these data models.

The development of Orchestra and CDSS has itself been a collaborative effort, and we em-

phasize that most of the contributions described here were developed in collaboration with others,

including Nate Foster, Zack Ives, Grigoris Karvounarakis, Val Tannen, and the members of the

Penn database group. We provide a more detailed attribution of the shared contributions in

Section 1.2.

In the remainder of this chapter, we will give an informal overview of the the main function-

alities of the Orchestra CDSS, touching on the key issues mentioned above (Section 1.1), then

elaborate on the technical contributions of the dissertation (Section 1.2). We finish with an outline

of the rest of this dissertation (Section 1.3).

1.1 Overview of CDSS and Orchestra

The CDSS model builds upon the fundamentals of data integration, peer data management sys-

tems [73] (PDMS), and data exchange [45], while incorporating major new functionalities. As in

a PDMS, the CDSS contains a set of peers, each representing an autonomous domain of control.

Each peer’s administrator has full control over a local DBMS, its schema, and the conditions un-

der which the peer trusts data from other peers. In most of this dissertation we assume relational

3

Page 13

schemas, but our model extends naturally to other data models such as XML or nested relations

(see Chapter 6). In this section, we give first an overview of how the peers’ data is related by

schema mappings, as used in PDMS and data exchange. Then we discuss the semantics of query

answers where Orchestra follows the principles of data exchange [45]. Most importantly, we de-

scribe Orchestra’s fundamentally novel features regarding update exchange, data provenance,

and dynamic update strategies through incrementality and optimization.

The topology of data sharing: specifying how peers are related

Earlier work on PDMS pioneered the notion of supporting multiple mediated schemas, thereby

relaxing some aspects of administration, and we adopt this perspective as a starting point in

CDSS. As in PDMS, a CDSS collaboration begins with a set of peer databases, each with their own

schema and local data instance.

Example 1.1. Consider (see Figure 1.1) a bioinformatics collaboration scenario based on databases

of interest to affiliates of the Penn Center for Bioinformatics. In general, GUS, the Genomics

Unified Schema1covers gene expression, protein, and taxon (organism) information; BioSQL,

affiliated with the BioPerl project2, covers very similar concepts; and a third schema, uBio3, es-

tablishes synonyms among taxa. Instances of these databases contain taxon information that is

autonomously maintained but of mutual interest to the others. For the purposes of our exam-

ple we show only one relational table in each of the schemas, as follows. Peer GUS associates

taxon identifiers, scientific names, and what it considers canonical scientific names via relation

G(gid,nam,can); peer BioSQL associates its own taxon identifiers with scientific names via rela-

tion B(bid,nam); and peer uBio records synonyms of scientific names via relation U(nam1,nam2).

Again as in PDMS, the participants of the CDSS collaboration specify the relationships among

their databases using schema mappings.

Example 1.2. Continuing with Example 1.1, suppose it is agreed in this collaboration that certain

data in GUS should also be in BioSQL. This represented in Figure 1.1 by the arc labeled m1. The

specification G(g,n,c) → ∃b B(b,n) associated with m1is read as follows: if (g,n,c) in table G,

the value n must also be in some tuple (b,n) of table B, although the value b in a such a tuple is

not determined. The specification just says that there must be such a b and this is represented by

the existential quantification ∃b. Here m1is an example of schema mapping. Two other mappings

appear in Figure 1.1. Peer uBio should also have some of GUS’s data, as specified by m2. Mapping

1http://www.gusdb.org

2http://bioperl.org

3http://www.ubio.org

4

Page 14

G(gid,nam,can)

GUS

B(bid,nam)

BioSQL

U(nam1,nam2)

uBio

m3

m1

m2

(m1)

(m2)

(m3)

G(g,n,c) → ∃b B(b,n)

G(g,n,c) → U(n,c)

B(b,n1) ∧ U(n1,n2) → B(b,n2)

Figure 1.1: Mappings among three bioinformatics databases

m3is quite interesting: it stipulates data in BioSQL based on data in uBio but also on data already

in BioSQL. As seen in m3, relations from multiple peers may occur on either side. We also see that

individual mappings can be “recursive” and that cycles are allowed in the graph of mappings.

Schema mappings are logical assertions that the data instances at various peers are expected

to jointly satisfy. We shall see (Section 1.2) that they correspond to the well-known formalism of

tuple-generating dependencies (tgds) [7]. We shall also see (Chapter 3) that, as in data exchange [45],

a large class of mapping graph cycles can be handled safely, while certain complex examples

cause problems.

Thus, every CDSS specification begins with a collection of peers/participants, each with its

own relational schema, and a collection of schema mappings between some of these peers. Like

the schemas, the mappings are designed by the participants’ administrators. By joining Orches-

tra, the participants agree to share the data from their local databases. The sharing can be further

modulated through the mappings, which should therefore be subject to agreement between par-

ticipants.

The semantics of query answers

Given a CDSS configuration of peers and schema mappings, the question arises of how data

should be propagated using the mappings, and what should be the answer to a query asked by

one of the peers. The whole point of data integration is for such an answer to use data from

all the peers. In CDSS, we wish to accomplish this by materializing at every peer an instance

containing not only the peer’s locally contributed data, but also additional facts that must be true,

given the data at the other peers along with the constraints specified by the mappings. Queries

at a peer will be answering using this local materialized instance. However, while the mappings

relating the peers tell us which peer instances are together considered “acceptable,” they do not

fully specify the complete peer instances to materialize.

5

Page 15

In CDSS we follow established practice in data integration, data exchange and incomplete

information databases [1, 45] and use certain answers semantics: a tuple is “certain” if it appears

in the query answer no matter what data instances (satisfying the mappings) we apply the query

to. In virtual data integration, the certain answers to a query are computed by reformulating the

query across all peer instances using the mappings, and combining the answers together from the

local results computed at each peer. In CDSS, as in data exchange, we materialize special local

instances that can be used to compute the certain answers. This makes query evaluation a fast,

local process.

We illustrate this with our running example.

Example 1.3. Suppose the contents of G, U, and B are as shown in Figure 1.2(a). Note that the

mappings of Figure 1.1 are not satisfied: for example, G contains a tuple

(828917,“Oscinella frit”,“Drosophila melanogaster”)

but B does not contain any tuple with “Oscinella frit” which is a violation of m1. Let us patch

this by adding to B a tuple (⊥1,“Oscinella frit”) where ⊥1represents the unknown value specified

by ∃b in the mapping m1. We call ⊥1a labeled null. Adding just enough patches to eliminate

all violations results in the data in Figure 1.2(b). (Note that sometimes patching one violation

may introduce a new violation, which in turn must be patched; hence this process is generally

an iterative one, however under certain constraints it always terminates.) Observe that we can

replace ⊥1and the other labeled nulls with combinations of arbitrary values and we get an entire

class of data instances that satisfy the mappings.

Now, consider two relational algebra queries:

Q1

def

=

σnam=“Oscinella frit”(B)

Q2

def

=

πnam(σnam=“Oscinella frit”(B))

and let us see which answers are certain when we apply these queries to the data instances ob-

tained by replacing the labeled nulls with various values. There is no tuple in common among the

various answers Q1produces, hence its certain answer semantics is empty. However, all answers

produced by Q2, have the tuple (“Oscinella frit”) in common. This is a certain answer for Q2.

The procedure we used in the example to resolve mapping violations by patching instances is

intuitive but it is not clear that (1) it always works, and (2) the data instances obtained by replacing

labeled nulls with arbitrary values are representative of all instances satisfying the mappings and

therefore give us the certain answers. In fact, the theory of data exchange [45] has resolved

both these problems (see Section 1.2 and Chapter 3). Moreover, it has established the following

6

Page 16

G :

gid

828917

2616529

nam

Oscinella frit

Musca domestica

can

Drosophila melanogaster

Musca domestica

U : nam1

(a) Tables from GUS, uBio, and BioSQL before data exchange (U is empty)

nam2

B :

bid

4472

nam

Periplaneta americana

U :

nam1

Oscinella frit

Musca domestica

nam2

Drosophila melanogaster

Musca domestica

B :

bid

4472

⊥1

⊥1

⊥2

nam

Periplaneta americana

Oscinella frit

Drosophila melanogaster

Musca domestica

(b) Updated tables after data exchange. Shaded rows indicate newly-inserted tuples. (G is unchanged.)

Figure 1.2: Bioinformatics database instances before and after data exchange

convenient query answering algorithm: to obtain the certain answers to a query it suffices to

evaluate the query over a specifically computed data instance with labeled nulls (as if the labeled

nulls were ordinary values) and then to discard any tuples in the result containing labeled nulls.

As a consequence, Orchestra works with peer instances in which tuples may contain labeled

nulls (actually, in the slightly more complicated form of Skolem terms; see Section 1.2).

Dynamic operation and update exchange

We have seen that Orchestra performs query answering on locally stored instances rather than

on-the-fly. This raises the question of data freshness in the peer instances built via data exchange.

Data exchange is essentially a static procedure: it focuses on the one-time computation of entire

peer instances satisfying the mappings. Across an entire CDSS this can be disruptive, as not all

peers may wish to refresh their data at the same time.

In contrast, the CDSS design sees data sharing as a fundamentally dynamic process, with

frequent “refreshing” data updates that need to be propagated efficiently. This process of dynamic

update propagation is called update exchange. It is closely related to the classical view maintenance

problem, and we shall discuss these connections at length in later chapters.

Operationally, CDSS functions in a manner reminiscent of revision control systems, but with

peer-centric conflict resolution strategies. The users located at a peer P query and update the

local instance in an “offline” fashion. Their updates are recorded in a local edit log. Periodically,

upon the initiative of P’s administrator, P requests that the CDSS perform an update exchange

operation. This publishes P’s local edit log — making it globally available via central or distributed

storage [131]. This also subjects P to the effects of the updates that the other peers have published

7

Page 17

(since the last time P participated in an update exchange). To determine these effects, the CDSS

performs incremental update translation using the schema mappings to compute corresponding

updates over P’s local instance: the translation finds matches of incoming tuples to the LHS of

the mapping, and applies these matchings to the RHS to produce outgoing tuples (recall the

“patching” process in Example 1.3).

Doing this only on updates means performing data exchange incrementally, with the goal of

maintaining peer instances satisfying the mappings. Therefore, in CDSS the mappings are more

than just static specifications. They can be seen as underlying the dynamic process of propagation

of updates.

Example 1.4. Refer again to Figure 1.2(b), and suppose now that the curator of uBio updates her

database by adding another synonym for the fruit fly: U(“Oscinella frit”,“Oscinella frit Linnaeus”).

When this update is published, it introduces a violation of mapping m3that must again be

“patched.” The update translation process therefore inserts a corresponding tuple into BioSQL:

B(⊥1,“Oscinella frit Linnaeus”).

Local curation and trust policies

The CDSS paradigm also differs from traditional data exchange in its incorporation of mechanisms

allowing local curation of database instances. This allows CDSS users to “override” the system

and modify or delete any data in their local database, even data that has been imported from

elsewhere via schema mappings. In this way, CDSS users retain full control over the contents

of their local database. The technical obstacle in supporting such a feature is how to preserve

a notion of semantic consistency with respect to the mappings, when such modifications may

introduce violations of the mappings.

Example 1.5. Refer again to Figure 1.2(b), and suppose that the curator of BioSQL decides that

she is not interested in house flies, and wishes to delete tuple B(⊥2,“Musca domestica”) from her

local instance. Since as we have seen, the presence of this tuple was forced by the mappings, the

deletion of the tuple leads to a violation of the mappings.

To allow such local curation, CDSS stores deleted tuples in rejection tables, and converts user-

specified mappings into internal mappings that take the rejection tables explicitly into account.

The notion of “solution” adopted in CDSS is with respect to these converted mappings. Details

are discussed in Chapter 3. Note that, sometimes, local curation corrects mistakes in the imported

data, and it is desirable for the corrections to be propagated back to the sources from which

the data came. To support this, the Orchestra prototype incorporates facilities for bidirectional

8

Page 18

mappings. We do not discuss bidirectional mappings in this dissertation, but the interested reader

can refer to [87, 61, 86] for details.

CDSS incorporates a second mechanism for local control of database instances in the form of

provenance-based trust policies. These are essentially selection predicates, but of a special kind that

operates on data provenance. They allow CDSS administrators to specify which data is “trusted,”

depending on its provenance (Orchestra’s mechanism for managing data provenance is dis-

cussed in Section 1.2).

Example 1.6. Some possible trust policies in our bioinformatics example:

• Peer BioSQL distrusts any tuple B(b,n) if the data came from GUS and n = “Musca domestica”,

and trusts any tuple from uBio.

• Peer BioSQL distrusts any tuple B(b,n) that came from mapping m2.

Once specified, the trust policies are incorporated in the update exchange process: when the

updates are being translated into a peer P’s schema they are accepted or rejected based on P’s

trust conditions.

The example above illustrates Boolean trust policies, which specify a black-or-white classification

of tuples as either (completely) trusted or (completely) untrusted, depending on their provenance

and contents. In fact, CDSS also allows richer forms of ranked trust policies in which trust scores are

computed for tuples indicating various “degrees of trust.” When a conflict is is detected among

data from multiple sources (for example, by a primary key violation), these scores can be used

to resolve the conflict by selecting the tuple with the highest trust score and discarding those

with which it conflicts. We focus mainly on Boolean trust policies in this dissertation, but briefly

discuss ranked trust policies in Chapter 3.

Provenance

A major novel feature of CDSS, and a central focus of this dissertation, is the tracking of data

provenance. This is the information which records “how data was propagated” during the update

exchange process. Data provenance is needed for several purposes in CDSS: as the foundation of

provenance-based trust policies, for update exchange itself, as a means of guiding efficient incremen-

tal algorithms, and as additional information that can be displayed and searched by CDSS users

as they perform data curation and other manual tasks. Chapter 2 develops the formal machinery

of the provenance model used in CDSS; here we just give an informal overview highlighting the

main features.

9

Page 19

The most intuitive way to picture CDSS provenance is as a graph having two kinds of nodes:

tuple nodes, one for each tuple in the system, and mapping nodes, where several such nodes can be

labeled by the same mapping name. Edges in the graph represent derivations of tuples from other

tuples using mappings. Special mapping nodes labeled “+” are used to identify original source

tuples.

Example 1.7. Refer to Figure 1.3, which depicts the provenance graph corresponding to the bioin-

formatics example from Figure 1.2 (we have abbreviated the data values to save space). Ob-

serve there is a “+” mapping node pointing to G(26,Musc.,Musc.); this indicates that it is one

of the source tuples from Figure 1.2(a), present before data exchange was performed. Next, ob-

serve that G(26,Musc.,Musc.) is connected to U(Musc.,Musc.) via a mapping node labeled m2.

This indicates that U(Musc.,Musc.) was derived using G(26,Musc.,Musc.) with m2. Also, notice

that B(⊥2,Musc.) has a derivation from G(26,Musc.,Musc.) via mapping m1. Finally, note that

B(⊥2,Musc.) also has a second derivation, from itself and U(Musc.,Musc.) via mapping m3. The

graph is thus cyclic, with tuples involved in their own derivations. In general, when mappings

are recursive, the provenance graph may have cycles.

The graphical model of provenance is convenient for several purposes in CDSS, including the

system implementation that we shall describe in Chapter 3. However, most of the theoretical

development in this dissertation will focus on another, equivalent perspective on CDSS prove-

nance based on a powerful framework of semiring-annotated relations. As we will explain in

Section 1.2 (and Chapter 2), semirings are algebraic structures that arise naturally with database

provenance. This framework has the virtue of uniformly capturing as special cases many other

provenance models that have been proposed in the literature (such as the why-provenance of [17],

the data warehousing lineage of [35], the Trio lineage of [8], and the event tables used in probabilistic

databases [50, 138]), as well as the traditional set and bag semantics. By putting all these models

on equal footing, we are able to make precise comparisons of various kinds — we can compare

their relative informativeness, for example, or study their various interactions with the query opti-

mization process. In particular, it allows us to explain precisely why the Orchestra provenance

is strictly more informative than any of these. Provenance annotations also turn out to be useful

in formulating efficient algorithms for update exchange and mapping evolution (as overviewed

in Section 1.2).

Briefly, the main idea in the semiring-annotations framework is as follows. Tuples in source

relations are annotated with elements from some domain of annotations K. During query process-

ing, when tuples are joined together to produce another tuple, their annotations are combined us-

ing an abstract ⊗ operation. When output tuples are produced by alternative combinations (such

10

Page 20

+

+

G :

82

Osci. Dros.

26 Musc. Musc.

m2

m2

U :

Osci.Dros.

Musc. Musc.

m1

m1

m3

m3

B :

44

Peri.

⊥1

Osci.

⊥1

Dros.

⊥2

Musc.

+

(a) Graphical provenance representation

G :

gid

82

26

nam

Osci.

Musc.

can

Dros.

Musc.

g1

g2

U :

nam1

Osci.

Musc.

nam2

Dros.

Musc.

m2⊗ g1

m2⊗ g2

B :

bid

44

⊥1

⊥1

⊥2

nam

Peri.

Osci.

Dros.

Musc.

b1

m1⊗ g1

m3⊗ (m1⊗ g1) ⊗ (m2⊗ g1)

(m1⊗ g2) ⊕ (m3⊗ (m1⊗ g2) ⊗ (m2⊗ g2)) ⊕ ···

(b) Tabular provenance representation

Figure 1.3: Two views of provenance for bioinformatics example

as via union or projection operators), the alternatives are recorded using an abstract ⊕ operation.

Example 1.8. Refer to Figure 1.3(b), which shows the tables corresponding to Figure 1.3(a) with

their semiring provenance annotations. Observe that the annotation of G(26,Musc.,Musc.) is g2,

indicating that this is a source tuple (cf. the “+” in Figure 1.3(a)). Next observe that the annota-

tion of U(Musc.,Musc.) is m2⊗ g2, indicating that the tuple was derived from G(26,Musc.,Musc.)

using mapping m2. Finally, observe that the annotation of B(⊥2,Musc.) is an infinite sum (us-

ing ⊕), one term for each of the (infinitely many) derivations of that tuple (cf. the cycle in the

provenance graph for that tuple). The first term in the sum, m1⊗ g2, indicates its derivation

from G(26,Musc.,Musc.) using mapping m1. The rest of the terms in the sum correspond to the

derivations using the tuple itself and U(Musc.,Musc.) repeatedly with mapping m3.

Mapping evolution

One of the most difficult tasks for users of a CDSS is to formulate the mappings that relate the

peers. In practice, mappings are not static, but are subject to frequent revisions, corrections,

additions, etc. which occur as peers join and leave the system, as the schemas of peers evolve over

11

Page 21

time, and as the understanding of relationships among peers becomes clarified over time. This

phenomenon has been referred to as mapping evolution in [136], and handling mapping evolution

efficiently (i.e., updating instances in an incremental fashion as mappings change) is another key

requirement in CDSS. In early phases of a collaboration’s development, changes to mappings can

occur at least as often as changes to data in the peers, while in later phases, changes are less

frequent but potentially even more disruptive (we would prefer to avoid recomputing instances

from scratch every time, e.g., a new peer joins a collaboration).

Example 1.9. Consider the bioinformatics example from Figure 1.1, and suppose that the adminis-

trator of BioSQL notices that mapping m1is incorrect, because the “id” field of GUS needs to be

translated into the style used in BioSQL. To accomplish this, she introduces a correspondence table4

T(gid,bid) and modifies mapping m1to translate ids using T:

(m?

1)

G(i,n,c) ∧ T(i,b) → B(b,n)

In order to process this update, note that some – but not all! – of the data in B will need to be

updated (only those tuples which came from G), and since the mapping from B to U drops the id

field, the data in U should be wholly unaffected.

Handling mapping evolution efficiently in a CDSS is closely related to the so-called view adap-

tation problem [65], in which a materialized view must be updated after the view definition

changes.

1.2 Overview of Technical Contributions

Having surveyed the main functions of Orchestra and CDSS in the previous section, let us

now examine some of the details of how these functionalities are accomplished. These details

constitute the novel technical contributions of this dissertation. For the contributions resulting

from joint work, we give paper references in each section below.

Before covering the technical contributions, we first need to recall some necessary background

concepts from previous work on data exchange.

Background: data exchange, chase, and universal solutions

We saw in Example 1.3 that a basic task in data exchange and CDSS involves “patching” database

instances related by schema mappings, in order to satisfy the mappings. We begin this section by

4In practice, correspondence tables are often implemented using user-defined functions; but conceptually a user-

defined function may be viewed as a relation.

12

Page 22

explaining more about the kind of schema mappings used in CDSS, and how this fundamental

task of “patching” is accomplished.

Tuple-generating dependencies (tgds) are a widely-used means of specifying constraints and

mappings [45, 38] in data integration and data exchange.5A tgd is a first-order logical assertion

of the form

∀¯ x ∀¯ y (ϕ(¯ x, ¯ y) → ∃¯ z ψ(¯ x, ¯ z)),

where the left hand side (LHS) of the implication, ϕ, is a conjunction of relational atoms over

variables ¯ x and ¯ y, and the right hand side (RHS) of the implication, ψ, is a conjunction of relational

atoms over variables ¯ x and ¯ z. For readability, we will generally omit the universal quantifiers and

simply write

ϕ(¯ x, ¯ y) → ∃¯ z ψ(¯ x, ¯ z),

as in the mappings in Figure 1.1. The tgd expresses a constraint about the existence of tuples in

the relations on the RHS, given a particular combination of tuples satisfying the constraint of the

LHS.

Tgds by themselves merely describe the database instances which together are considered

“acceptable.” To put the mappings to work in data sharing, CDSS builds upon earlier work

in data exchange where the well-known chase procedure can be used to propagate data among

databases, in accordance with the schema mappings, until the mappings are satisfied. Essentially,

this procedure involves repeatedly looking for violations of mappings, and adding new tuples

to the database instances to repair the violations. If mappings contain existentially quantified

variables, the new tuples will contain special placeholder labeled null values. The addition of new

tuples may in turn introduce new violations, so the procedure must be repeated until all the

mappings are satisfied. This process is guaranteed to terminate in time polynomial in the size of

the input database instances, so long as the topology of the mappings is weakly acyclic (explained

in Chapter 3).

The peer database instances computed by the chase together constitute a solution: a (joint)

database instance satisfying the mappings which is obtained from the starting (joint) database

instance by adding tuples to it.6In general, for a given starting joint instance and set of mappings,

there are infinitely many possible solutions. The one that is computed by the chase, however, has

a special property: it is a universal solution, essentially, a solution that can be homomorphically

embedded in any other solution. Because of this, a universal solution can be used to obtain the

certain answers to positive relational algebra queries (Section 1.1).

5Tgds are related to so-called global-local-as-view or GLAV mappings [49, 73], which in turn generalize the earlier

global-as-view (GAV) and local-as-view (LAV) mapping formulations [100].

6CDSS does not draw the sharp distinction between source and target instances that is usually made in data exchange.

We justify this perspective in Chapter 3.

13

Page 23

Contribution 1: mappings as Datalog programs

(with Karvounarakis, Ives, and Tannen [60])

Although so far we have focused on mappings as logical constraints, another point of view proves

useful in developing much of the theoretical machinery for CDSS, as well as its practical imple-

mentation. This is the perspective where we interpret a set of tgds as a Datalog program. Recall

the tgds of Figure 1.1(b):

(m1)

G(g,n,c) → ∃b B(b,n)

G(g,n,c) → U(n,c)

B(b,n1) ∧ U(n1,n2) → B(b,n2)

(m2)

(m3)

We can interpret these tgds as Datalog rules (possibly with Skolem functions), where the RHS of the

implication becomes the head of the rule, and the LHS becomes the body. Thus the set of mappings

above corresponds to the Datalog program:

(r1)

B(f(n),n) :- G(g,n,c)

(r2)

U(n,c) :- G(g,n,c)

(r3)

B(b,n2) :- B(b,n1),U(n1,n2)

Note that rule r1contains a Skolem function f. This corresponds to the existential quantifier in

m1. Intuitively, the Skolem function introduces a unique labeled null f(n) whose value depends

on f and n.

Although it was not illustrated in the above example, some tgds may have multiple relational

atoms in the RHS. For example, we could have written a mapping like

(m4)

G(g,n,c) → ∃b B(b,n),C(b,c)

where C is also a relation of peer BioSQL, used to store canonical names. In this case, m5would

be interpreted as a pair of Datalog rules

(r4)

B(g(n,c),n) :- G(g,n,c)

(r5)

C(g(n,c),c) :- G(g,n,c)

where g is a Skolem function. Note that both rules use the same Skolem function g, with the same

parameters; this ensures that the association between scientific names and canonical names in G

and C is recorded and can be recovered later by a query joining G with C.

The key fact supporting this Datalog-based perspective on mappings is that, given a set of

tgds, evaluating the corresponding Datalog program also produces a universal solution. This

14

Page 24

enables us to move freely between the two perspectives. Most of this dissertation will adopt the

Datalog-based perspective.

Contribution 2: semiring-annotated relations and provenance

(with Karvounarakis and Tannen [62])

Provenance information plays a central role in many of the basic tasks of collaborative data shar-

ing, including the specification of trust policies and algorithms for efficient propagation of updates

to data and mappings. Here we outline some more of the details of the novel semiring-based

model used in Orchestra.

As already mentioned in Section 1.1, the main idea in the semiring-annotations framework is

to decorate tuples in source relations with annotations from a domain K and combine these anno-

tations during relational algebra query processing using abstract ⊗ (joint use) and ⊕ (alternative

use) operators. Additionally, to model selection predicates, the ⊗ operation is used together with

distinguished elements 0 and 1 from K. Finally, we observe that certain identities on relational

algebra queries (needed for compatibility with standard DBMS query optimizers) hold exactly

when the structure (K,⊕,⊗,0,1) is a commutative semiring.

By plugging in the Boolean semiring (B,∨,∧,false,true) we obtain the usual set semantics

(where tuples “in” a relation are annotated true, and tuples “not in” a relation are annotated

false). By plugging in the semiring of natural numbers (N,+,·,0,1), we obtain the usual bag

semantics (where a relation with duplicate entries is modeled as a set relation with annotations

representing tuple multiplicities). Still other semirings can be used to capture the various prove-

nance models listed earlier. The “most informative” kind of provenance annotations in this frame-

work is the semiring of provenance polynomials: this is the commutative semiring (N[X],+,·,0,1)

of polynomials with natural number coefficients over a set of uninterpreted variables X.

For recursive queries, such as Datalog programs corresponding to cyclic sets of tgds, a technical

difficulty arises because provenance annotations may become infinite. Indeed, we have already

seen this phenomenon illustrated in Example 1.8. The annotations in that example (cf. Figure 1.3)

are not polynomials; rather they are formal power series.

As a practical matter, working directly with infinite formal power series is infeasible. However,

as we shall see in Chapter 2, it turns out that the formal power series which arise in provenance

computations have a natural, finite representation as a system of algebraic equations. The main idea

is fairly intuitive. For each tuple t in the output of a recursive Datalog query, we associate a

variable x and write an equation x = ... whose RHS represents the alternative ways in which

t can be produced as an immediate consequence of other tuples. By doing this for all tuples in

15

Page 25

G :

gid

82

26

nam

Osci.

Musc.

can

Dros.

Musc.

g1= g1

g2= g2

U :

nam1

Osci.

Musc.

nam2

Dros.

Musc.

u1= m2⊗ g1

u2= m2⊗ g2

B :

bid

44

⊥1

⊥1

⊥2

nam

Peri.

Osci.

Dros.

Musc.

b1= b1

b2= m1⊗ g1

b3= m3⊗ b2⊗ u1

b4= (m1⊗ g2) ⊕ (m3⊗ b4⊗ g2)

Figure 1.4: Tabular provenance representation using system of equations

the output, we obtain a system of algebraic equations whose least fixpoint is exactly the (possibly

infinite) formal power series annotations.

Example 1.10. Refer to Figure 1.4, which shows the same provenance-annotated tables of Fig-

ure 1.3(b), but this time represented using a system of equations. Note that the annotation for the

source tuple G(82,Osci.,Dros.) is now g1, which is trivially equal to g1. Tuple U(Osci.,Dros.) was

obtained as an immediate consequence of G(82,Osci.,Dros.) using mapping m2, so the equation

for its annotation is u1= m2⊗ g1; by plugging in g1= g1we obtain the annotation m2⊗ g1, as

in Figure 1.3(b). The interesting case in the example is the annotation b4for B(⊥2,Musc.), whose

equation involves the tuple itself: b4= (m1⊗ g2) ⊕ (m3⊗ b4⊗ g2) (representing the two alterna-

tive ways of deriving the tuple as an immediate consequence). The solution for b4is the infinite

formal power series

b4= (m1⊗ g2) ⊕ (m3⊗ (m1⊗ g2) ⊗ (m2⊗ g2)) ⊕ ···

shown in Figure 1.3(b).

The representation of provenance annotations as a system of algebraic equations is equivalent

to the graphical representation we saw earlier in this section.

Note that in the examples, variables representing mapping names themselves participate in the

provenance calculations. To support this, the semiring framework includes a scalar multiplication

construct that can be used freely in mappings and queries. Thus, for the tgd

(m2)

G(g,n,c) → U(n,n)

from the example above, we would employ the corresponding Datalog rule

(r2)

m2⊗ U(n,n) :- G(g,n,c)

which multiplies its output in the head by the uninterpreted variable m2. This effectively “tags”

the output by the mapping name.

16

Page 26

Contribution 3: Orchestra system implementation

(with Karvounarakis, Ives, and Tannen [60])

The Orchestra prototype described in this dissertation consists of two main components: a client

layer, installed at each peer, which performs functions including update translation, provenance

recordkeeping, and transaction log crawling (in order to extract edit logs); and a distributed

storage component, used to keep catalog-level information such as mapping definitions, as well

as archives of published updates.

Orchestra uses a modified distributed hash table implementation for its distributed storage

component [131, 132], hence most of the complexity of the implementation is in the client compo-

nent. This is a middleware layer implemented in around 100,000 lines of Java code that sits atop

each peer’s local DBMS. Chapter 3 describes this layer in detail, but we overview here how one

of the key implementation issues is addressed, namely the encoding and storage of provenance

information.

A basic philosophy in the Orchestra implementation is to push as much work as possible

down into the underlying DBMS, with the Java layer acting mainly as a coordinator for its activ-

ities. This allows the system to benefit from the existing DBMS facilities for storage and scalable

query processing over large data sets. At the same time, Orchestra attempts to remain rela-

tively agnostic as to the particular DBMS implementation underneath — in particular, it avoids

as much as possible the use of user-defined functions, which are generally not portable across

implementations.

In keeping with this philosophy, Orchestra uses an relational encoding schema for prove-

nance that allows the information to be maintained using only SQL queries (without user-defined

functions). This is done by adopting the graph-based perspective on provenance (cf. Figure 1.3(a)),

and encoding the graph using provenance tables. For each mapping mi, we use a separate prove-

nance table Pirecording the edges through nodes labeled mi.

For example, to store the graph in Figure 1.3(a), Orchestra uses three edge tables P1, P2, and

P3. P3contains one tuple for each occurrence of m3in the graph:

P3:

b1.bidb1.namu.nam1

Osci.

u.nam2

b2.bidb2.nam

⊥1

⊥2

Osci.Dros.

⊥1

⊥2

Dros.

Musc. Musc. Musc. Musc.

Note that the table above contains a number of redundancies, due to the fact that the mapping

forces equivalences among some of the columns. In particular, the values in columns b1.bid and

b2.bid are always the same, as are those in b1.nam and u.nam1, and those in u.nam2and b2.nam.

Orchestra actually stores a compressed version of the edge table with such redundancies elimi-

17

Page 27

nated, producing (in the case of P3above):

b1.bid b1.namu.nam2

⊥1

⊥2

Osci. Dros.

Musc.Musc.

Additional information such as key constraints can be used to compress these tables even further.

A major benefit of this encoding scheme is that the mapping tables can be constructed and

maintained using ordinary queries which are derived from the mapping definitions. Continuing

with the example, recall that mapping m3corresponds to the Datalog rule:

(r3)

B(b,n2) :- B(b,n1),U(n1,n2)

From this, we derive two rules:

(r?

3)

P3(b,n1,n2) :- B(b,n1),U(n1,n2)

(r??

3)

B(b,n2) :- P3(b,n1,n2)

The first rule joins tuples from B and U, putting the result in P3(the compressed version); the

second takes tuples from P3and projects the relevant columns, putting the result in B.

The translated rules for all the mappings together form a Datalog program that can be executed

to populate the peer instances and their mapping tables. Since mainstream commercial database

systems support only limited forms of recursive queries, the Orchestra client layer also includes

a Datalog engine, which uses the ordinary DBMS query and update facilities in combination with

Java control logic for iteration to fixpoint.

Contribution 4: algorithms for incremental update exchange

(with Karvounarakis, Ives, and Tannen [60])

A major technical challenge in CDSS is how to implement update exchange incrementally. In-

tuitively, the CDSS operating mode resembles deferred view maintenance across a set of views;

and indeed, as we shall see, classical techniques from view maintenance can be applied usefully

in update exchange. However, the presence of provenance information also provides a kind of

“index” that can be exploited to improve on the classical techniques.

Classical view maintenance techniques are essentially based on query reformulation. The pre-

mier examples of this approach in view maintenance are the count and DReD algorithms of [66],

each of which is based the so-called delta rules for a Datalog program. These are another Data-

log program, obtained from the original, that can be used used to compute sets of insertions or

deletions to a materialized view, given sets of insertions or deletions to base relations. We shall

18

Page 28

discuss delta rules further in the next section (and later in Chapters 3 and 5); for now, we illustrate

with a simple example that gives the flavor of the technique.

Example 1.11. Let Q be the conjunctive query

Q(x,y) :- R(x,z),S(z,y)

which returns paths of length 2 through R and S. Suppose that R and S are modified by inserting

some tuples, stored in relations R+and S+. The delta rules query for Q is the following union of

conjunctive queries, which computes the set Q+of corresponding insertions in Q:

Q+(x,y)

:-

R+(x,z),S(z,y)

Q+(x,y)

:-

R(x,z),S+(z,y)

Q+(x,y)

:-

R+(x,z),S+(z,y)

These rules compute all combinations of joins of tuples in R and S with newly-inserted tuples

from R+and S+, and also joins of newly-inserted tuples from R+and S+.

As we show in Chapter 3, count and DReD can be adapted to perform CDSS update exchange

as well. (A priori, this fact is not obvious, because in CDSS we are not just maintaining materialized

views, but also their accompanying provenance information.)

Classical techniques such as DReD do not perform very well, however, in one important case:

incremental propagation of tuple deletions with recursive mappings. For DReD, this is because

the approach involves computing a loose over-estimate of the corresponding deletions in mate-

rialized views (using delta rules), then attempting to re-derive those tuples from other existing

tuples (hence its name: DReD stands for “delete and re-derive”). This over-estimate can be large,

leading to poor performance as many tuples may need to be re-derived.

To overcome this limitation, we observe that provenance annotations can be exploited as the

basis of a much more efficient algorithm for propagating tuple deletions in the presence of re-

cursive mappings. Intuitively, the idea is quite simple: the provenance annotations explain how

output tuples were derived from source tuples; hence the effects of a source tuple deletion can be

computed by examining the provenance annotations of output tuples. Roughly speaking, given a

deletion of source tuple with provenance token p, we replace all occurrences of p by 0 in prove-

nance expressions of derived tuples, and then simplify the expressions. Derived tuples whose

provenance expressions simplify to 0 are then discarded.

Example 1.12. Suppose R(a,b) is a derived output tuple with provenance u ⊗ v, and we delete

the source tuple corresponding to u. To propagate the deletion, we evaluate 0 ⊗ v = 0 and find

that R(a,b) should be deleted as well. On the other hand, suppose R(a,c) is a derived tuple with

19

Page 29

provenance (u ⊗ v) ⊕ r. After the deletion, its provenance expression becomes (0 ⊗ v) ⊕ r = r,

hence we see that the tuple should not be deleted because it has another derivation.

For recursive views, the procedure also involves an extra step needed to detect and delete

tuples whose provenance expressions do not seem to simplify to 0 only because of cycles in

the provenance graph. This extra step carries some cost; still, the provenance-based algorithm

turns out to be the method of choice for propagating tuple deletions in CDSS, as validated by

experimental results in Chapter 3.

Contribution 5: query containment and equivalence

Semiring annotated relations are compatible with the standard optimizations used in DBMS query

optimizers, as we show in Chapter 2. However, for more advanced optimizations based on elim-

inating redundant joins from conjunctive queries [22] (or redundant conjunctive queries from unions

of conjunctive queries [123]), this is not necessarily the case. Indeed, this phenomenon was al-

ready observed for the case of bag semantics, where eliminating a “redundant” self-join from a

query actually changes the query’s meaning [24]. As a simple example, consider the following

two conjunctive queries:

Q1(x,y) :- R(x,y),R(x,z)

Q2(x,y) :- R(x,y)

Under set semantics, these two queries are equivalent; hence Q1can be optimized by eliminating

the redundant join predicate, producing Q2. However, under bag semantics, the queries are not

equivalent: the “redundant” join predicate in Q1has the effect of increasing the multiplicities of

output tuples.

Motivated by these observations, we make a thorough study of the interplay between prove-

nance annotations and the fundamental problems of checking query containment and equivalence,

which underly advanced query optimizations (Chapter 4). We study these problems for conjunc-

tive queries and unions of conjunctive queries, for five different kinds of provenance information

that can be captured using semiring annotations (including the provenance polynomials used

in CDSS), leading to four different hierarchies among these semantics. We show that, for each

form of provenance we consider, all four problems turn out to be decidable, and we identify the

complexity in each case (see Figure 1.5). Several of these results are surprising because, for exam-

ple, provenance polynomials are closely related to bag semantics, and containment of unions of

conjunctive queries under bag semantics is known to be undecidable [81]. We also identify a fun-

damental tradeoff between informativeness of provenance information and query optimization:

the richer the provenance information, the fewer optimizations possible. However, even for the

20

Page 30

B

PosBoolX Lin(X)

Why(X)

Trio(X)

B[X]

N[X]

N

CQs

cont

equiv

cont

equiv

np-c

np-c

np-c

np-c

np-c

np-c

np-c

np-c

np-c

np-c

np-c

np-c

np-c

gi-c

np-c

np-c

np-c

gi-c

np-c

gi-c

np-c

gi-c

? (Πp

2-hard)

gi-c

undec

gi-c

UCQs

in pspace np-c in pspace

gi-cnp-c gi-c

In the table, non-shaded boxes indicate contributions of this dissertation. np-c is short for np-complete.

gi-c is short for gi-complete (i.e., complete for the class of problems polynomial time reducible to graph

isomorphism).

Figure 1.5: Complexity of containment and equivalence with provenance annotations

most informative model possible in the framework, provenance polynomials, we show that equiv-

alence of unions of conjunctive queries with provenance polynomial annotations holds exactly

when bag equivalence holds, hence any query reformulation which is sound for bag semantics

(and therefore might be employed by a commercial DBMS optimizer) is sound with respect to all

forms of semiring annotations. As a corollary, we also obtain a new proof of the decidability of

bag equivalence of unions of conjunctive queries.

The theoretical results on the decidability of query equivalence with provenance annotations

play a crucial role in the subsequent development of techniques for handling CDSS evolution in

Chapter 5.

Contribution 6: ring-annotated relations for updates

(with Ives and Tannen [59])

We have already seen that a major focus of CDSS is on the efficient handling of the dynamic

aspects of data sharing, in particular, handling updates to data (update exchange) and to mappings

(mapping evolution). And we have already mentioned that for update exchange, Orchestra adapts

techniques based on the so-called delta rules technique [66], and improves upon them by exploiting

the presence of provenance annotations.

Mapping evolution seems on the face of it a very different problem than update exchange.

Somewhat surprisingly, then, it turns out that both problems (along with their classical cousins,

view maintenance and view adaptation) can be attacked uniformly by the same approach, where

we view each as a special case of a more general problem of optimizing queries using materialized

views [101, 23]. This is the problem which asks, given a query Q and a set V of materialized views,

to find a reformulation of Q using the materialized views which can be executed more efficiently.

Example 1.13. To illustrate update exchange / view maintenance, consider a source relation R

21

Page 31

R?(x,y)

R?(x,y)

:-

:-

R(x,y)

R∆(x,y)

V(x,y)

(a) Materialized view definitions

:-

R(x,y),R(y,z)

V?(x,y)

:-

R?(x,y),R?(y,z)

V∆(x,y)

−V∆(x,y)

(b) New view definition, and difference with

old

:-

:-

V?(x,y)

V(x,y)

V∆(x,y)

V∆(x,y)

V∆(x,y)

(c) A delta-rules style reformulation

:-

:-

:-

R∆(x,z),R(z,y)

R(x,z),R∆(z,y)

R∆(x,z),R∆(z,y)

Figure 1.6: Update propagation and query reformulation

Q(x,y)

Q(x,y)

:-

:-

R(x,u),R(u,v),R(v,y)

R(x,z),R(z,y)

(a) Old mappings

Q?(x,y)

Q?(x,y)

:-

:-

(b) New mappings

R(x,u),R(u,v),R(v,y)

R(x,y)

Q∆(x,y)

−Q∆(x,y)

(c) Their difference

:-

:-

Q(x,y)

Q?(x,y)

Q∆(x,y)

−Q∆(x,y)

:-

:-

R(x,y)

R(x,z),R(z,y)

(d) A reformulated plan

Figure 1.7: Mapping evolution and query reformulation

and a collection R∆of updates to R (consisting of tuple insertions and deletions), which when

applied to R yields the relation R?. R?can be thought of as a materialized view over R and R∆

whose definition is given in Figure 1.6(a). Now, consider the materialized view V over R whose

definition is also given in Figure 1.6(a). To reflect the updates made to R, we need to compute the

new version V?of V, shown in Figure 1.6(b). Moreover, to perform this computation incrementally,

we would like to compute just the difference V∆between V?and V, also defined in Figure 1.6(b),

where the leading “−” indicates a difference operation. This could then be applied to V to produce

V?. Note that the definition of V∆involves first computing V?, then subtracting V from it, a rather

inefficient approach! A better plan is the delta rules reformulation of V∆shown in Figure 1.6(b),

which we can imagine as having been produced via a general process of reformulating V∆using

the materialized views V and R?.

Example 1.14. To illustrate mapping evolution / view adaptation, Figure 1.7(a) shows a materi-

alized view definition Q which is modified (by changing its second rule) to produce the view

definition Q?shown in Figure 1.7(b). To compute Q?incrementally, we might prefer to compute

the difference Q∆between Q and Q?, shown in Figure 1.7(c). As with the previous example, the

definition does not produce a very efficient plan! Instead, we can reformulate Q∆using material-

ized view Q, producing the plan shown in Figure 1.7(d).

These two examples suggest certain requirements that need to be addressed. First, they both

22

Page 32

R :

a b

b c

a d

d c

d e

2

1

1

1

3

R∆:

a b

b c

e f

−2

−1

1

R?:

a d

d c

d e

e f

1

1

3

1

V :

a c

a e

3

3

V∆:

a c

d f

−2

3

V?:

a c

a e

d f

1

3

3

(a) Delta rules example with Z-relations

R :

a b

b c

c d

p

r

s

Q :

a d

a c

b d

p · r · s

p · r

r · s

a c

b d

a b

b c

c d

Q?:

a d

a b

b c

c d

p · r · s

p

r

s

Q∆:

−p · r

−r · s

p

r

s

(b) Mapping evolution example with Z[X]-relations

Figure 1.8: Representing and computing updates with ring-annotated relations

make essential use of the difference operator, not present in the framework of semiring-annotated

relations, and not usually considered in previous work on optimizing queries using material-

ized views. Second, both involve updates (tuple insertions and deletions) being represented and

manipulated as ordinary relations.

To handle these requirements, we propose the use of ring-annotated relations to represent data

and updates uniformly, and with the difference operator defined in the natural way using inverses

in the ring. We study two concrete instantiations of this framework. The first is Z-relations, where

tuples are annotated with (positive or negative) integers. This is the ring-annotated analog of

bag semantics, with positive multiplicities representing insertions and negative ones representing

deletions. The second is Z[X]-relations, the ring-annotated analog of the provenance polynomials,

where we allow positive or negative integer coefficients.

Example 1.15. Figure 1.8(a) shows an example of Z-annotated relations corresponding to Exam-

ple 1.13. Note that R∆has tuples with positive and negative multiplicities, e.g., the annotation of

(a,b) is −2 indicating that a deletion of two copies of that tuple. V∆corresponds to the result of

evaluating the query from Figure 1.6(c) over R, R?, and V under the Z-semantics, and note that

V∆also has tuples with positive and negative multiplicities. By applying V∆to V via a union

operation (which sums tuple multiplicities), we obtain V?.

Example 1.16. Figure 1.8(b) shows an example of Z[X]-annotated relations corresponding to Ex-

ample 1.14. Note that Q∆shows the result of evaluating the plan in Figure 1.7(d) over the Z[X]-

relations Q and R, and by applying Q∆to Q (by union, which again sums tuple annotations), we

obtain the table Q?shown in the figure.

The extension to ring-annotated relations and relational queries using difference gives us a

rich framework for representing and computing updates, but we still need a general procedure

for optimizing queries using materialized views in this new context. Herein lies the big surprise:

23

Page 33

we are able to give a sound and complete procedure for enumerating query reformulations under

the Z and Z[X] annotated semantics, for relational algebra queries and views. This is in sharp

contrast to traditional set and bag semantics, where it can be shown that no such procedure exists.

The enumeration procedure is based on a simple term rewrite system, the details of which are

described in Chapter 5. We also show that equivalence of relational algebra queries under these

new semantics is decidable (in pspace). This also stands in contrast to the situation for set and

bag semantics, where allowing the difference operator in queries leads to undecidability of equiv-

alence. Thus, the ring-annotated relations framework gives us benefits both in expressiveness (for

representing and computing updates), but is also surprisingly tractable.

Contribution 7: provenance for XML and nested relations

(with Foster and Tannen [48])

The Orchestra prototype in its current incarnation handles only relational data. However, other

kinds of data, such as XML and nested relational data, are also used commonly in the sciences

and other collaborative domains. Thus an important longer-term goal of CDSS involves extending

the paradigm to work with these richer data models as well. As a key first step towards this

goal, we develop an extension of the theoretical framework of semiring annotations to unordered

XML (UXML) and nested relational data. For UXML, this involves decorating XML elements in

a tree with semiring annotations, and extending the semantics of a positive fragment of XML’s

premier query language, XQuery, to operate on semiring-annotated, unordered trees. For nested

relations, we decorate both tuples and elements of nested collections with semiring annotations,

and extend the semantics of the positive nested relational calculus to work with these annotated

nested relations.

Example 1.17. To illustrate the flavor of this model for XML, Figure 1.9(a) shows an example of an

XML tree, some of whose elements are annotated with provenance polynomials. Now consider

the XPath query Qdef

= $T//c which picks out the set of subtrees of elements of $T whose label is c.

The result of evaluating Q on the source tree in Figure 1.9(a) is shown in Figure 1.9(b). Note that

the annotation of the root element of the first tree in the output is x1· y3+ y1· y2. This indicates

that the tree was produced in two different ways: one by traversing a path through elements in

the input tree with annotations x1and y3, and the other via a path through elements annotated

y1and y2.

We show that this semantics is conservative with respect to the semantics of positive relational

algebra on semiring-annotated relations, and that key properties underlying the relational version

of the framework continue to hold for annotated XML and nested relations. Hence, we give strong

24

Page 34

$T =

a

bx1

a

cy3

d

cy1

cy2

d

a

bx2

(a) A provenance-annotated XML tree

(cx1·y3+y1·y2

,

cy1

cy2

d

a

bx2)

(b) XPath query result

Figure 1.9: Semiring-annotated XML example.

evidence that provenance annotations for XML and nested relations fundamentally “make sense,”

paving the way towards a future extension of CDSS implementations to these data models. We

also propose novel applications of independent interest, including incomplete XML and XML with

access control policies, which are enabled by the framework.

1.3 Roadmap

The remainder of the thesis is structured as follows:

• Chapter 2 introduces semiring-annotated relations and develops the formal foundations of

provenance needed for CDSS. We also discuss connections with bag (multiset) semantics,

incomplete and probabilistic databases, and other models of provenance such as lineage [34]

and why-provenance [17] which can be captured using semiring annotations.

• Chapter 3 describes in more detail the formal model and prototype implementation of the

Orchestra CDSS, including the implementation of provenance storage, incremental update

exchange, and trust policies.

• Chapter 4 examines the fundamental issues of query containment and equivalence for

provenance-annotated relations. These problems turn out to be highly sensitive to the

presence of provenance information.We give positive decidability results for contain-

ment/equivalence of conjunctive queries and unions of conjunctive queries for several forms

of provenance information. We also highlight connections with the same problems for set

and bag semantics.

• Chapter 5 explains how incremental view maintenance, view adaptation, and mapping evo-

lution can be seen as special cases of a more general problem of answering queries using

materialized views. We argue for signed ring-annotated relations as a natural representation

of both data and updates to data, and for the incorporation of the difference operator to en-

able more rewritings. We show that, surprisingly, equivalence of relational algebra queries is

25

Page 35

decidable for two instantiations of this framework relevant to CDSS and traditional database

systems. We also present a sound and complete procedure for reformulating queries with

materialized views under these semantics.

• Chapter 6 extends the semiring provenance model of Chapter 2 to work with unordered

XML and nested relational data. We show that the key properties of Datalog queries on

semiring-annotated relations carry over to (positive) XQuery queries on semiring-annotated

unordered XML, and to nested relational calculus queries on semiring-annotated complex

values. This paves the way towards a future extension of CDSS to XML and nested relational

data.

• We discuss related work in Chapter 7, and we conclude in Chapter 8.

26

Page 36

Chapter 2

Provenance Semirings

A major component of the vision for CDSS [82] is the recording of data provenance as data and

updates are transformed according to schema mappings. We begin, therefore, by developing a

suitable formal model of data provenance for CDSS. The particular provenance model we use

emerges as a by-product of the development of a more general theory of queries on annotated

relations.

Several forms of annotated relations have appeared in various contexts in the literature. Query

answering in these settings involves generalizing the relational algebra (RA) to perform corre-

sponding operations on the annotations. The seminal paper in incomplete databases [79], for ex-

ample, generalized RA to conditional tables (c-tables), where relations are annotated with Boolean

formulas. In probabilistic databases, [50] and [138] generalized RA to event tables, also a form

of annotated relations. In data warehousing, [34] and [35] compute lineages for tuples in the

output of queries, in effect generalizing RA to computations on relations annotated with sets of

contributing tuples. Finally, RA on bag semantics can be viewed as a generalization to annotated

relations, where a tuple’s annotation is a number representing its multiplicity. We shall examples

of all of these kinds of annotated relations in Section 2.1.

We observe that in all of these cases, the calculations with annotations are strikingly similar.

This suggests looking for an algebraic structure on annotations that captures the above as particu-

lar cases. We propose using commutative semirings for this purpose. In fact, we can show that the

laws of commutative semirings are forced by certain expected identities in RA. Having identified

commutative semirings as the right algebraic structure, we argue that a symbolic representation

of semiring calculations is just what is needed to record, document, and track RA querying from

input to output for applications such as CDSS which require rich provenance information. It is

a standard philosophy in algebra that such symbolic representations form the most general such

27

Page 37

structure. In the case of commutative semirings, just as for rings, the symbolic representation is

that of polynomials. We therefore propose to use polynomials to capture provenance. Next we

look to extend our approach to recursive Datalog queries (needed in CDSS when mappings have

cycles). To achieve this we combine semirings with fixed point theory.

The outline of this chapter is as follows:

• We introduce K-relations, in which tuples are annotated (tagged) with elements from K. We

define a generalized positive algebra on K-relations and argue that K must be a commutative

semiring (Section 2.2).

• For recording data provenance, we propose polynomials with integer coefficients, and we

show that positive algebra semantics for any commutative semirings factors through the

provenance semantics (Section 2.3).

• We discuss other forms of provenance information, previously proposed in the literature,

which can be captured using semirings, and show that they can be organized in a prove-

nance hierarchy (Section 2.4).

• We extend these results to Datalog queries by considering semirings with fixed points (Sec-

tion 2.5).

• For the (possibly infinite) provenance in Datalog query answers we propose semirings of

formal power series that are shown to be generated by finite algebraic systems of fixed

point equations (Section 2.6).

• We give algorithms for deciding the finiteness of these formal power series, for computing

them when finite, and for computing the coefficient of an arbitrary monomial otherwise

(Section 2.7).

• We show how to specialize our algorithms for computing full Datalog answers when K is

a finite distributive lattice, in particular for incomplete and probabilistic databases (Sec-

tion 2.8).

2.1 Queries on Annotated Relations

We motivate our study by considering three important examples of query answering on annotated

relations and highlighting the similarities between them.

The first example comes from the study of incomplete databases, where a simple representation

system is the maybe-table [124, 63], in which tuples annotated with a ‘?’ may or may not be in

28

Page 38

a b c

a b c

d b e

f g e

?

?

?

a b c

a b c

d b e

f g e

b1

b2

b3

?

∅,

a c

a c

,

a c

d e

,

a c

f e,

a c

a c

a e

d c

d e

,

a c

d e

f e

,

a c

a c

f e

,

a c

a c

a e

d c

d e

f e

?

(a)(b) (c)

Figure 2.1: A maybe-table and a query result

a c

a c

a e

d c

d e

f e

(b1∧ b1) ∨ (b1∧ b1)

b1∧ b2

b1∧ b2

(b2∧ b2) ∨ (b2∧ b2) ∨ (b2∧ b3)

(b3∧ b3) ∨ (b3∧ b3) ∨ (b2∧ b3)

(a)

a c

a c

a e

d c

d e

f e

b1

b1∧ b2

b1∧ b2

b2

b3

(b)

Figure 2.2: Result of Imielinski-Lipski computation

the instance, independently from each other. An example is given in Figure 2.1(a). Such a table

represents a set of possible worlds, and the answer to a query over such a table is the set of instances

obtained by evaluating the query over each possible world. Thus, given a query like

Qdef

= πac

?πabR ? πbcR ∪ πacR ? πbcR?

the query result is the set of possible worlds shown in Figure 2.1(c). The second table in the query

result, for example, is obtained by evaluating Q over R with optional tuples (d,b,e) and (f, g,e)

omitted.

Unfortunately, this set of possible worlds cannot itself be represented by a maybe-table, in-

tuitively because whenever the tuples (a,e) and (d,c) appear, then so do (a,c) and (d,e), and

maybe-tables cannot represent such a dependency.

To overcome such limitations, Imielinski and Lipski [79] introduced c-tables, where tuples are

annotated with Boolean formulas called conditions. A maybe-table is a simple kind of c-table,

where the annotations are distinct Boolean variables, as shown in Figure 2.1(b). In contrast to

weaker representation systems, c-tables are expressive enough to be closed under RA queries, and

the main result of [79] is an algorithm for answering RA queries on c-tables, producing another

c-table as a result. On our example, this algorithm produces the c-table shown in Figure 2.2(a),

which can be simplified to the c-table shown in Figure 2.2(b); this c-table represents exactly the

set of possible worlds shown in Figure 2.1(c).

Another kind of table with annotations is a multiset or bag. In this case, the annotations are

29

Page 39

a b c

a b c

d b e

f g e

2

5

1

a c

a c

a e

d c

d e

f e

2· 2+ 2· 2 = 8

2· 5 = 10

2· 5 = 10

5· 5+ 5· 5+ 5· 1 = 55

1· 1+ 1· 1+ 5· 1 = 7

(b) (a)

Figure 2.3: Bag semantics example

a b c

a b c

d b e

f g e

x

y

z

E

x

y

z

Pr

0.6

0.5

0.1

a c

a c

a e

d c

d e

f e

x

x∩ y

x∩ y

y

z

(b) (a)

Figure 2.4: Probabilistic example

natural numbers which represent the multiplicity of the tuple in the multiset. (A tuple not listed

in the table has multiplicity 0.) Query answering on such tables involves calculating not just the

tuples in the output, but also their multiplicities.

For example, consider the multiset R shown in Figure 2.3(a). Then Q applied to R, where Q is

the same query from before, produces the multiset shown in Figure 2.3(b). Note that for projection

and union we add multiplicities while for join we multiply them. There is a striking similarity

between the arithmetic calculations we do here for multisets, and the Boolean calculations for the

c-table.

A third example comes from the study of probabilistic databases, where tuples are associated

with values from [0,1] which represent the probability that the tuple is present in the database.

Answering queries over probabilistic tables requires computing the correct probabilities for tuples

in the output. To do this, Fuhr and R¨ ollecke [50] and Zim´ anyi [138] introduced event tables, where

tuples are annotated with probabilistic events, and they gave a query answering algorithm for

computing the events associated with tuples in the query output.1

Figure 2.4(a) shows an example of an event table with associated event probabilities (e.g., x

represents the event that (a,b,c) appears in the instance, and x,y,z are assumed independent).

Considering again the same query Q as above, the Fuhr-R¨ olleke-Zim´ anyi query answering algo-

rithm produces the event table shown in Figure 2.4(b). Note again the similarity between this

table and the example earlier with c-tables. The probabilities of tuples in the output of the query

can be computed from this table using the independence of x and y.

2.2 Positive Relational Algebra

In this section we attempt to unify the examples above by considering generalized relations in

which the tuples are annotated (tagged) with information of various kinds. Then, we will define

1The Fuhr-R¨ ollecke-Zim´ anyi algorithm is a general-purpose intensional algorithm. Dalvi and Suciu [36] give a sound

and complete algorithm which returns a “safe” query plan, if one exists, which may be used to answer the query correctly

via a more efficient extensional algorithm. For instance, our example does not admit a safe plan.

30

Page 40

a generalization of the positive relational algebra (RA+) to such tagged-tuple relations. The

examples in Section 2.1 will turn out to be particular cases.

We use here the named perspective [2] of the relational model in which tuples are functions

t : U → D with U a finite set of attributes and D a domain of values. We fix the domain D for

the time being and we denote the set of all such U-tuples by U-Tup. (Usual) relations over U are

finite subsets of U-Tup.

A notationally convenient way of working with tagged-tuple relations is to model tagging by

a function on all possible tuples, with those tuples not considered to be “in” the relation tagged

with a special value. For example, the usual set-theoretic relations correspond to functions that

map U-Tup to B = {true,false} with the tuples in the relation tagged by true and those not in the

relation tagged by false.

Definition 2.1. Let K be a set of annotations (or tags) containing a distinguished element 0. A

K-relation over a finite set of attributes U is a function R : U-Tup → K such that its support

defined by supp(R)def

= {t | R(t) ?= 0} is finite. A K-instance I is a mapping which associates

each predicate symbol R with a K-relation RI.

In generalizing RA+we will need to assume more structure on the set of tags. To deal with

selection we assume that the set K contains two distinct values 0 ?= 1 which denote “out of”

and “in” the relation, respectively. To deal with union and projection and therefore to combine

different tags of the same tuple into one tag we assume that K is equipped with a binary operation

“⊕.” To deal with natural join (hence intersection and selection) and therefore to combine the tags

of joinable tuples we assume that K is equipped with another binary operation “⊗.”

Definition 2.2. Let (K,⊕,⊗,0,1) be an algebraic structure with two binary operations and two

distinguished elements. The denotation ?Q?Iof a positive relational algebra query Q on a K-

empty relation For any set of attributes U, there is ∅ : U-Tup → K such that ?∅?I(t)def

predicate If R is a predicate symbol with attributes U then ?R?I: U-Tup → K is simply RI.

union If ?Q1?I,?Q2?I: U-Tup → K then ?Q1∪ Q2?I: U-Tup → K is defined by

?Q1∪ Q2?I(t)

Notice that if ?Q1?Iand ?Q1?Ihave finite support then so has ?Q1∪ Q2?I, provided we have

instance I, a K-relation of finite support, is defined inductively as follows:

= 0.

def

= ?Q1?I(t) ⊕ ?Q2?I(t)

0⊕ 0 = 0.

31

Page 41

projection If ?Q1?I: U-Tup → K and V ⊆ U then ?πVQ1?I: V-Tup → K is defined by

?πVQ1?I(t)

def

=

?

t=t?on V and

?Q1?I(t?)?=0

?Q1?I(t?)

Here t = t?on V means t?is a U-tuple whose restriction to V is the same as the V-tuple t.

Note also that the sum is finite, assuming ?Q1?Ihas finite support; and ?πVQ1?Ialso has

selection If ?Q1?I: U-Tup → K and the selection predicate P maps each U-tuple to either 0 or 1

?σPQ1?I(t)

Which {0,1}-valued functions are used as selection predicates is left unspecified, except that

we assume that false—the constantly 0 predicate, and true—the constantly 1 predicate, are

always available. Note that if ?Q1?Ihas finite support, then so does ?σPQ1?I, provided that

natural join If ?Qi?I: Ui-Tup → K for i = 1,2 then ?Q1? Q2?Iis the K-relation over U1∪ U2

?Q1? Q2?I(t)

and ?Q2?Ihave finite support, then so does ?Q1? Q2?I, provided a ⊗ 0 = 0 ⊗ a = 0, for all

renaming If ?Q1?I: U-Tup → K and β : U → U?is a bijection then ?ρβQ1?Iis a K-relation over U?

?ρβQ1?I(t)

finite support, again provided 0⊕ 0 = 0.

then ?σPQ1?I: U-Tup → K is defined by

def

= ?Q1?I(t) ⊗ P(t)

0⊗ 0 = 0 and 0⊗ 1 = 0.

defined by

def

= ?Q1?I(t1) ⊗ ?Q2?I(t2)

where t1= t on U1and t2= t on U2(recall that t is a U1∪ U2-tuple). Notice that if ?Q1?I

a ∈ K.

defined by

def

= ?Q1?I(t ◦ β)

and notice that if ?Q1?Ihas finite support then so does ?ρβQ1?I.

This definition generalizes the definitions of RA+for the motivating examples we saw. Indeed,

for (B,∨,∧,false,true) we obtain the usual RA+with set semantics. For (N,+,·,0,1) it is RA+

with bag semantics.

For the Imielinski-Lipski algebra on c-tables we consider the set of Boolean expressions over

some set B of variables which are positive, i.e., they involve only disjunction, conjunction, and

constants for true and false. Then we identify those expressions that yield the same truth-value

for all Boolean assignments of the variables in B.2Denoting by PosBool(B) the result and applying

2in order to permit simplifications; it turns out that this is the same as transforming using the axioms of distributive

lattices [33]

32

Page 42

Definition 2.2 to the structure (PosBool(B),∨,∧,false,true) produces exactly the Imielinski-Lipski

algebra. Finally, for (P(Ω),∪,∩,∅,Ω) we obtain the Fuhr-R¨ olleke-Zim´ anyi RA+on event tables.

These four structures are examples of commutative semirings.

Definition 2.3. An algebraic structure (K,⊕,⊗,0,1) is called a commutative semiring if (K,⊕,0)

and (K,⊗,1) are commutative monoids, ⊗ is distributive over ⊕, and ∀a ∈ K, 0⊗ a = a ⊗ 0 = 0.

Further evidence for requiring K to form such a semiring is given by

Proposition 2.4. The following RA identities:

• union is associative, commutative and has identity ∅;

• join is associative, commutative and distributive over union;

• projections and selections commute with each other as well as with unions and joins (when applicable);

• σfalse(Q1) = ∅ and σtrue(Q1) = Q1.

hold for the positive algebra on K-relations if and only if (K,⊕,⊗,0,1) is a commutative semiring.

Glaringly absent from the list of relational identities are the idempotence of union and of

(self-)join. Indeed, these fail for the bag semantics, an important particular case of our general

treatment.

Any function h : K1 → K2which maps 0K1to 0K2can be used to transform K1-relations

to K2-relations simply by applying h to each tag (note that the support may shrink but never

increase). Abusing the notation a bit we denote the resulting transformations from K1-relations

to K2-relations and from K1-instances to K2-instances also by h. The RA+operations we have

defined work nicely with semiring structures:

Proposition 2.5. Let h : K1→ K2and assume that K1,K2are commutative semirings. The transfor-

mation given by h from K1-relations to K2-relations commutes with any RA+query Q and K-instance I,

?Q?h(I)= h(?Q?I), if and only if h is a semiring homomorphism.

Example 2.6. If B is a set of Boolean variables, any Boolean valuation ν : B → B extends uniquely

to a semiring homomorphism Evalν : PosBool(B) → B such that Evalν(b) = ν(b). The possible

worlds of a Boolean c-table T (a PosBool(B)-annotated relation) can be formally defined as follows:

Rep(T)def

= {Evalν(T) | ν is a Boolean valuation ν : B → B}

Using Proposition 2.5, we see that for any positive relational query Q ∈ RA+, Boolean c-table T,

and Boolean valuation ν, we have

?Q?Evalν(T)= Evalν(?Q?T)

33

Page 43

a b c

a b c

d b e

f g e

p

r

s

a c

a c

a e

d c

d e

f e

{p}

{p,r}

{p,r}

{r,s}

{r,s}

(b)

a c

a c

a e

d c

d e

f e

2p2

pr

pr

2r2+ rs

2s2+ rs

(a)(c)

Figure 2.5: Lineage and provenance polynomials

The correctness of the Imielinski-Lipski algebra on c-tables with respect to the possible worlds

semantics follows as a corollary.

Example 2.7. Commercial relational database systems typically implement bag semantics, but in-

clude a DISTINCT operator which may be used to eliminate duplicates from a query result. This

operator can be viewed as the semiring homomorphism δ : N → B which maps 0 to false and

everything else to true. Using Proposition 2.5, we see that for any bag instance I and positive

relational query Q ∈ RA+, we have

?Q?δ(I)= δ(?Q?I)

Hence eliminating duplicates as a last step yields (not surprisingly) the same answer that would

have been obtained by first eliminating duplicates from source tables, then evaluating the query

under set semantics.

2.3Polynomials for Provenance

Lineage was defined in [34, 35] as a way of relating the tuples in a query output to the tuples in

the query input that “contribute” to them. The lineage of a tuple t in a query output is in fact the

set of all contributing input tuples.

Computing the lineage for queries in RA+turns out to be exactly Definition 2.2 for the semir-

ing [13] (Lin(X),+,·,⊥,∅) where X consists of the ids of the tuples in the input instance, where

Lin(X) = P(X)∪⊥, ⊥+S = S+⊥ = S, ⊥·S = S·⊥ = ⊥, and S+ T = S· T = S∪ T if S,T ?= ⊥.

For example, we consider the same tuples as in relation R used in the examples of Section 2.1 but

now we tag them with their own ids p,r,s, as shown in Figure 2.5(a). The resulting R can be seen

as a Lin(p,r,s)-relation by replacing p with {p}, etc. Applying the query Q from Section 2.1 to R

we obtain according to Definition 2.2 the Lin(p,r,s)-relation shown in Figure 2.5(b).

For example, in the query result in Figure 2.5(b) (f,e) and (d,e) have the same lineage, the

input tuples with id r and s. However, the query can also calculate (f,e) from s alone and (d,e)

34

Page 44

from r alone. In a provenance application in which one of r or s is perhaps less trusted or less

usable than the other the effect can be different on (f,e) than on (d,e) and this cannot be detected

by lineage. This example illustrates the limitations of lineage (also recognized in [25]). It seems

that we need to know not just which input tuples contribute but also how they contribute.

On the other hand, by using the different operations of the semiring, Definition 2.2 appears to

fully “document” how an output tuple is produced. To record the documentation as tuple tags

we need to use a semiring of symbolic expressions. In the case of semirings, as in ring theory,

these are the polynomials.

Definition 2.8. Let X be a set of indeterminates, which can be thought of as tuple identifiers.

The positive algebra provenance semiring for X is the semiring of polynomials with variables

(a.k.a. indeterminates) from X and coefficients from N, with the operations defined as usual3:

(N[X],+,·,0,1). We refer to the elements of N[X] has provenance polynomials.

Example 2.9. Start again from the relation R in Figure 2.5(a) in which tuples are tagged with their

own id. R can be seen as an N[p,r,s]-relation. Applying to R the query Q from Section 2.1

and doing the calculations in the provenance semiring we obtain the N[p,r,s]-relation shown

in Figure 2.5(c). The provenance of (f,e) is 2s2+ rs which can be “read” as follows: (f,e) is

computed by Q in three different ways; two of them use just the input tuple s, but twice; the third

uses input tuples r and s. We also see that the provenance of (d,e) is different from that of (f,e)

and we see how it is different!

The following standard property of polynomials captures the intuition that N[X] is as “gen-

eral” as any semiring:

Proposition 2.10. Let K be a commutative semiring and X a set of variables. For any valuation v : X → K

there exists a unique homomorphism of semirings Evalv : N[X] → K such that for the one-variable

monomials we have Evalv(x) = v(x).

As the notation suggests, Evalv(P) evaluates the polynomial P in K given a valuation for its

variables. In calculations with the integer coefficients, na (where n ∈ N and a ∈ K) is the sum in

K of n copies of a. Note that N is embedded in K by mapping n to the sum of n copies of 1K.

Using the Eval notation, for any P ∈ N[x1,...,xn] and any K the polynomial function fP:

Kn→ K is given by:

fP(a1,...,an)def

= Evalv(P) v(xi) = ai,i = 1..n

3These are polynomials in commutative variables so their operations are the same as in middle-school algebra, except

that subtraction is not allowed.

35

Page 45

Putting together Propositions 2.5 and 2.10 we obtain Theorem 2.11 below, a conceptually im-

portant fact that says, informally, that the semantics of RA+on K-instances for any semiring K

factors through the semantics of the same in provenance semirings.

Indeed, let K be a commutative semiring, let I be a K-instance, and let X be the set of tuple ids

of the tuples in supp(I). There is an obvious valuation v : X → K that associates to a tuple id the

tag of that tuple in I.

We associate to I an “abstractly tagged” version, denoted ab(I), which is an X ∪ {0}-relation.

ab(I) is such that supp(ab(I)) = supp(I) and the tuples in supp(ab(I)) are tagged by their own

tuple id. For example, in Figure 2.9(d) we show an abstractly-tagged version of the relation in

Figure 2.9(b). Note that as an X ∪ {0}-relation, ab(I) is a particular kind of N[X]-relation.

Theorem 2.11 (factoring). For any RA+query Q and K-instance I we have ?Q?I= Evalv◦ ?Q?ab(I).

To illustrate an instance of this theorem, consider the provenance polynomial 2r2+ rs of the

tuple (d,e) in Figure 2.5(c). Evaluating it in N for p = 2,r = 5,s = 1 we get 55 which is indeed

the multiplicity of (d,e) in Figure 2.3(a).

The factoring theorem thus allows pre-computing of provenance polynomial annotations, then

evaluation of them under a particular semantics in a specific K for various applications. As we

shall see in Chapter 3, this is precisely the strategy employed by Orchestra.

2.4A Hierarchy of Provenance

Besides the provenance models discussed so far, several other forms of provenance information

proposed in the literature can also be captured using semiring annotations. As we shall see,

the various models also turn out to be related in a precise manner, namely by surjective semiring

homomorphisms, as summarized in Figure 2.7.

One such provenance model is obtained from the provenance polynomials by replacing natural

number coefficients with Boolean coefficients:

Definition 2.12 (Boolean Provenance Polynomials). The Boolean provenance polynomials semiring for

X is the semiring of polynomials over variables X with Boolean coefficients: (B[X],+,·,0,1).

Considering the same positive relational algebra query Q as in Section 2.1, Figure 2.6(c) shows

the result of applying Q to R, where R is interpreted as a B[X]-relation. Note that the annotations

in Figure 2.6(c) can be obtained from those in Figure 2.6(b) by simply dropping the numeric

coefficients. In fact, one can check that the operation f : N[X] → B[X] which “drops coefficients”

(i.e., by replacing non-zero coefficients with true) is a surjective semiring homomorphism.

36

Page 46

a b c

a b c

d b e

f g e

p

r

s

(a) Source R

a b

a c

a e

d c

d e

f e

2p2

pr

pr

2r2+ rs

2s2+ rs

(b) N[X]

a b

a c

a e

d c

d e

f e

p2

pr

pr

r2+ rs

s2+ rs

(c) B[X]

a b

a c

a e

d c

d e

f e

2p

pr

pr

2r + rs

2s + rs

(d) Trio(X)

a b

a c

a e

d c

d e

f e

{{p}}

{{p,r}}

{{p,r}}

{{r},{r,s}}

{{s},{r,s}}

(e) Why(X)

a b

a c

a e

d c

d e

f e

(f) PosBool(X)

p

p ∧ r

p ∧ r

r

s

a b

a c

a e

d c

d e

f e

(g) Lin(X)

{p}

{p,r}

{p,r}

{r,s}

{r,s}

Figure 2.6: Comparison of provenance annotations

The Trio system [8] has recently proposed a form of annotated relation suitable as a repre-

sentation system for incomplete bag databases (i.e., where the possible worlds are bag instances

rather than set instances). Trio-style tables can also be captured using annotations from a semiring

we shall denote Trio(X). Like B[X], this semiring can be viewed as being obtained from N[X], but

instead of “dropping coefficients,” this time we “drop exponents.” This is most conveniently for-

malized using the notion of quotient semirings (see Appendix for definition). Let f : N[X] → N[X]

be the mapping that “drops exponents,” e.g., f maps 2x2y+3xy+2z3+1 to 5xy+2z+1. Denote

by ≈fthe equivalence relation on N[X] defined by a ≈fb

≈fis a congruence relation (see Appendix for definition). This justifies the following:

def

⇐⇒ f(a) = f(b). One can check that

Definition 2.13 (Trio Semiring). The Trio semiring for X is the quotient semiring of N[X] by ≈f,

denoted Trio(X).

As an example, considering again the same query Q, Figure 2.6(d) shows the result of applying

Q to R, where R is interpreted as a Trio(X)-relation, and an annotation a is understood to represent

its equivalence class a/≈fin ≈f. Note that the mapping h : N[X] → Trio(X) defined by h(a) ?→

a/≈fis a surjective semiring homomorphism.

Still another provenance model that can be captured using semirings is the why-provenance

of [17]. The why-provenance of a tuple is the set of sets of “contributing” source tuples, which is

called the proof witness basis in [17]. This can be captured using a semiring [13] (called the proof

why-provenance semiring in [13]):

Definition 2.14 (Why-Provenance). The why-provenance semiring for X is

37

Page 47

(Why(X),∪,?,∅,{∅}) where Why(X)def

= P(P(X)) and ? denotes pairwise union:

A? Bdef

= {a ∪ b : a ∈ A,b ∈ B}

Considering again the same query Q, we can interpret the source relation in Figure 2.6(a) as

a why-provenance relation by doubly-nesting the variables (e.g., p becomes {{p}}). Figure 2.6(e)

shows the query output and the resulting why-provenance annotations. Note that these annota-

tions can be obtained from the B[X]-annotations by dropping exponents (and writing the result

as a set of sets rather than sum of monomials). One can check that the corresponding operation

g : B[X] → Why(X) which “drops exponents” is in fact a surjective semiring homomorphism.

Note also that the annotations can be obtained from the Trio(X)-annotations by dropping coeffi-

cients, and it is easy to verify that the corresponding operation h : Trio(X) → Why(X) which does

this is also a surjective semiring homomorphism.

Note that the lineage for an output tuple can be obtained from the why-provenance of the

tuple by flattening the set of sets, i.e., applying the function h : Why(X) → Lin(X) defined by

h(I) =?

Finally, an interesting variation on the why-provenance semiring is obtained by requiring that

S∈IS. Once again, we can show that h is a surjective semiring homomorphism.

the witness basis for an output tuple be minimal. Here the domain is irr(P(X)) the set of irre-

dundant subsets of P(X), i.e., W is in irr(P(X)) if for any A,B in W neither is a subset of the

other. We can associate with any W ⊆ P(X) a unique irredundant subset irr(W) by repeatedly

looking for elements A,B such that A ⊆ B and deleting B from W. Then we define a semiring

(irr(P(X)),+,·,0,1) as follows:

def

=

irr(I ∪ J)

0

=

∅

I + JI · J

def

=

def

=

irr(I ? J)

{∅}

def

1

This is the semiring in which we compute the minimal witness basis [17]. It is a well-known

semiring: the construction above is the construction for the free distributive lattice generated by

the set X. Moreover, it is isomorphic to a semiring we have already seen, PosBool(X).

These various models of provenance can be neatly arranged in the diagram of Figure 2.7,

where a path downwards from K1to K2indicates that there exists a surjective semiring homomor-

phism h : K1→ K2. Coupled with Proposition 2.5, this gives a clear picture of the relative “informa-

tiveness” of the various provenance models, since provenance computations for models lower in

the hierarchy can always be factored through computations involving models above them in the

hierarchy. The existence of such homomorphisms also turns out to be useful in Chapter 4 where

we shall use them to derive some easy bounds on the relative behavior of the provenance models

with respect to query containment.

38

Page 48

N[X]

B[X]

Trio(X)

Why(X)

Lin(X)

PosBool(X)

B

most informative

least informative

Figure 2.7: Provenance hierarchy

Q(x,y) :- R(x,z),R(z,y)

(a)

a a

a b

b b

2

3

4

(b)

a a

a b

b b

2· 2 = 4

2· 3+ 3· 4 = 18

4· 4 = 16

(c)

Figure 2.8: Datalog with bag semantics

2.5 Datalog on K-Relations

We now seek to give semantics on K-relations to Datalog queries. It is more convenient to use the

unnamed perspective here. We also consider only “pure” Datalog rules in which all subgoals are

relational atoms. Also, to simplify the presentation, we shall focus on Datalog programs with a

single idb-and-output predicate, and we shall identify a Datalog program by its output predicate

name.

First observe that for conjunctive queries over K-instances the semantics in Definition 2.2 sim-

plifies to computing tags as sums of products, each product corresponding to a valuation of the

query variables that makes the query body hold. For example, consider the conjunctive query

and N-relation shown in Figure 2.8(a) and (b), respectively.

There are two valuations that produce the answer Q(a,b): {x ?→ a,y ?→ a,z ?→ b} yields

the body R(a,a),R(a,b) while {x ?→ a,y ?→ b,z ?→ b} yields the body R(a,b),R(b,b). The sum of

products of tags is 2 · 3 + 3 · 4 which is exactly what the equivalent RA+query yields according

to Definition 2.2. If we think of this conjunctive query as a Datalog program, the two valuations

above correspond to the two derivation trees of the tuple Q(a,b).

This suggests the following generalized proof-theoretic semantics for Datalog on K-relations:

the tag of an answer tuple is the sum over all its derivation trees of the product of the tags of the

leaves of each tree, stated formally in Definition 2.15. Indeed, this generalizes the bag semantics

of Datalog considered in [113, 114] when the number of derivation trees is finite. In general, a tuple

39

Page 49

can have infinitely many derivation trees (an algorithm for detecting this appears in [115]) hence

we need to work with semirings in which infinite sums are defined.

Closed semirings [133] have infinite sums but their “⊕” is idempotent which rules out the bag

and provenance semantics. We will adopt the approach used in formal languages [95] and later

show further connections with how semirings and formal power series are used for context-free

languages. By assuming that D is countable, it will suffice to define countable sums.

Let (K,⊕,⊗,0,1) be a semiring and define a ≤ b

we say that K is naturally ordered. B, N, N[X] and the other semiring examples we gave so far

def

⇔ ∃x a ⊕ x = b. When ≤ is a partial order

are all naturally ordered.

We say that K is an ω-complete semiring if it is naturally ordered and ≤ is such that ω-chains

have least upper bounds. In such semirings we can define countable sums:

?

n∈N

an

def

=

sup

m∈N(⊕m

i=0ai)

Note that if ∃N s.t. ∀n > N,an= 0 then ⊕n∈Nan = ⊕N

far are ω-complete with the exception of N and N[X]. We show below how to “complete” them.

i=0ai. All the semiring examples we gave so

An ω-continuous semiring is an ω-complete semiring in which the operations ⊕ and ⊗ are

ω-continuous in each argument. It follows that countable sums are associative and commutative,

that ⊗ distributes over countable sums and that countable sums are monotone in each addend.

Examples of commutative ω-continuous semirings:

• (B,∨,∧,false,true)

• (N∞,+,·,0,1)

and ∞·n = n·∞ = ∞ except for ∞·0 = 0·∞ = 0. We can think of N∞as the ω-continuous

“completion” of N.

where we add ∞ to the natural numbers and define ∞ + n = n + ∞ = ∞

• (PosBool(B),∨,∧,false,true) with B finite. This commutative semiring is in fact a distributive

lattice [33] and the natural order is the lattice order. Since we identify those expressions that

yield the same truth-value for all boolean assignments for the variables, B finite makes

PosBool(B) finite, hence ω-continuous.

• (P(Ω),∪,∩,∅,Ω), used for event tables which is also an example of distributive lattice.

• (N∞,min,+,∞,0), the tropical semiring [95].

• ([0,1],max,min,0,1) is related to fuzzy sets [137] so we will call it the fuzzy semiring.

40

Page 50

Definition 2.15. Let (K,⊕,⊗,0,1) be a commutative ω-continuous semiring, and let Q be a Data-

log query. For any K-instance I, define

?Q?I(t) =

?

τ yields t

?

?

R(t?)∈leaves(τ)

RI(t?)

?

where τ ranges over all Q-derivation trees for t and R(t?) ranges over all the leaves of τ.

Proposition 2.16. For any K-instance I, ?Q?Ihas finite support and is therefore a K-relation. Hence,

Definition 2.15 gives us a semantics for Datalog queries on K-instances.

Proof. (sketch) Let J be the set instance corresponding to the support of I, and let t?be an output

tuple such that ?Q?I(t?) ?= 0. By Definition 2.15, this implies that there is a derivation tree τ for

then t?∈ ?Q?J. Since ?Q?Jis finite, ?Q?Ihas finite support.

Example 2.17. As an example, consider the Datalog program Q defined by the rules shown in

t?(such that the tags of the tuples in its leaves are non-zero and correspond to this product). But

Figure 2.9(c), applied on N-relation R shown in Figure 2.9(a). Since any N-relation is also a

N∞-relation and N∞is ω-continuous we can answer this query4and we obtain the table s0hown

in Figure 2.9(b). Observe that the annotation of (a,b) in the output is 8. Figure 2.10(a) shows

how this value was computed as the sum of products of the annotations of the leaves of the two

derivation trees for Q(a,b) On the other hand, observe that (b,d) has annotation ∞ in the output.

Figure 2.10(b) illustrates the computation of this value, which involves infinitely many derivation

trees.

A couple of sanity checks follow.

Proposition 2.18. Let Q be an RA+query in which the selection predicates only test for attribute equality

and let Q?be the (non-recursive) Datalog query obtained by standard translation from Q. Then Q and Q?

produce the same answer when applied to the same K-instance.

Proposition 2.19. For any Datalog query Q and any B-instance I, supp(?Q?I) is the same as the result

The definition of Datalog semantics given above is not so useful computationally. However,

of applying Q to the standard set instance supp(I).

we can think of it as a proof-theoretic definition, and as with standard Datalog, it turns out that

there is an equivalent, fixpoint-theoretic definition that is much more workable.

Intuitively, this involves representing the possibly infinite sum of products above as a system of

fixpoint equations that reflect all the ways that a tuple can be produced as a result of applying the

4This particular query computes transitive closure with bag semantics.

41