ArticlePDF Available

Abstract and Figures

Different notions of provenance for database queries have been proposed and studied in the past few years. In this article, we detail three main notions of database provenance, some of their applications, and compare and contrast amongst them. Specifically, we review why, how, and where provenance, describe the relationships among these notions of provenance, and describe some of their applications in confidence computation, view maintenance and update, debugging, and annotation propagation.
Content may be subject to copyright.
A preview of the PDF is not available
... Another feature of interest is the availability of functional dependencies to describe state invariants that ensure safe partitioning (sharding) of code and state. A third tradition is the body of literature on data provenance [7], which allows data dependencies to be analyzed in subtle ways, with applications to distributed systems including use in efficient fault injection [1]. These topics are beyond the scope of this paper. ...
Preprint
Full-text available
In the Hydro project we are designing a compiler toolkit that can optimize for the concerns of distributed systems, including scale-up and scale-down, availability, and consistency of outcomes across replicas. This invited paper overviews the project, and provides an early walk-through of the kind of optimization that is possible. We illustrate how type transformations as well as local program transformations can combine, step by step, to convert a single-node program into a variety of distributed design points that offer the same semantics with different performance and deployment characteristics.
... The need for documenting datasets is not exclusive to QoE datasets. For example, data provenance has been studied extensively in other fields, such as in the databases community [31,32]. Similarly, more recently, many works have focussed on the process of documenting the creation and use of machine learning datasets. ...
Preprint
Full-text available
Over the years, many subjective and objective quality assessment datasets have been created and made available to the research community. However, there is no standard process for documenting the various aspects of the dataset, such as details about the source sequences, number of test subjects, test methodology, encoding settings, etc. Such information is often of great importance to the users of the dataset as it can help them get a quick understanding of the motivation and scope of the dataset. Without such a template, it is left to each reader to collate the information from the relevant publication or website, which is a tedious and time-consuming process. In some cases, the absence of a template to guide the documentation process can result in an unintentional omission of some important information. This paper addresses this simple but significant gap by proposing a datasheet template for documenting various aspects of subjective and objective quality assessment datasets for multimedia data. The contributions presented in this work aim to simplify the documentation process for existing and new datasets and improve their reproducibility. The proposed datasheet template is available on GitHub, along with a few sample datasheets of a few open-source audiovisual subjective and objective datasets.
... Outside of the realm of temporal logics one can find the "proof trees as explanations" paradigm in regular expression matching [31] and in the database community [10]. Metric Temporal Logic. ...
Chapter
Full-text available
Runtime monitors analyze system execution traces for policy compliance. Monitors for propositional specification languages, such as metric temporal logic (MTL), produce Boolean verdicts denoting whether the policy is satisfied or violated at a given point in the trace. Given a sufficiently complex policy, it can be difficult for the monitor’s user to understand how the monitor arrived at its verdict. We develop an MTL monitor that outputs verdicts capturing why the policy was satisfied or violated. Our verdicts are proof trees in a sound and complete proof system that we design. We demonstrate that such verdicts can serve as explanations for end users by augmenting our monitor with a graphical interface for the interactive exploration of proof trees. As a second application, our verdicts serve as certificates in a formally verified checker we develop using the Isabelle proof assistant. Keywordsmetric temporal logicruntime monitoringexplanationsproof systemformal verificationcertification
... Another related research area is explaining query answers [72,87]. Given a query and a database, they explain query answers using the tuples in the given database, e.g., using provenance [36]. We, in contrast, explain dataset changes using data transformations. ...
Preprint
Full-text available
In multi-user environments in which data science and analysis is collaborative, multiple versions of the same datasets are generated. While managing and storing data versions has received some attention in the research literature, the semantic nature of such changes has remained under-explored. In this work, we introduce \texttt{Explain-Da-V}, a framework aiming to explain changes between two given dataset versions. \texttt{Explain-Da-V} generates \emph{explanations} that use \emph{data transformations} to explain changes. We further introduce a set of measures that evaluate the validity, generalizability, and explainability of these explanations. We empirically show, using an adapted existing benchmark and a newly created benchmark, that \texttt{Explain-Da-V} generates better explanations than existing data transformation synthesis methods.
Article
Provenance information corresponds to essential metadata that describes the entities, users, and processes involved in the history and evolution of a data object. While the benefits of tracking provenance information have been widely understood in a variety of domains, only recently provenance solutions have gained interest in security community. Indeed, on the one hand, provenance allows for a reliable historical analysis enabling security-related applications such as forensic analysis and attribution of malicious activity. On the one hand, the unprecedented changes in the threat landscape place demands for securing provenance information to facilitate its trustworthiness. With the recent growth of provenance studies in security, in this work, we examine the role of data provenance in security and privacy. To set this work in context, we outline fundamental principles and models of data provenance and explore how the existing studies achieve security principles. We further review the existing schemes for securing data provenance collection and manipulation known as secure provenance and the role of data provenance for security and privacy, which we refer to as threat provenance.
Article
Software organizations are increasingly incorporating machine learning (ML) into their product offerings, driving a need for new data management tools. Many of these tools facilitate the initial development of ML applications, but sustaining these applications post-deployment is difficult due to lack of real-time feedback (i.e., labels) for predictions and silent failures that could occur at any component of the ML pipeline (e.g., data distribution shift or anomalous features). We propose a new type of data management system that offers end-to-end observability , or visibility into complex system behavior, for deployed ML pipelines through assisted (1) detection, (2) diagnosis, and (3) reaction to ML-related bugs. We describe new research challenges and suggest preliminary solution ideas in all three aspects. Finally, we introduce an example architecture for a "bolt-on" ML observability system, or one that wraps around existing tools in the stack.
Article
Database systems use static analysis to determine upfront which data is needed for answering a query and use indexes and other physical design techniques to speed-up access to that data. However, for important classes of queries, e.g., HAVING and top-k queries, it is impossible to determine up-front what data is relevant. To overcome this limitation, we develop provenance-based data skipping (PBDS), a novel approach that generates provenance sketches to concisely encode what data is relevant for a query. Once a provenance sketch has been captured it is used to speed up subsequent queries. PBDS can exploit physical design artifacts such as indexes and zone maps.
Chapter
Why-provenance—explaining why a query result is obtained—is an essential asset for reaching the goal of Explainable AI. For instance, recursive (Datalog) queries may show unexpected derivations due to complex entanglement of database atoms inside recursive rule applications. Provenance, and why-provenance in particular, helps debugging rule sets to eventually obtain the desired set of rules. There are three kinds of approaches to computing why-provenance for Datalog in the literature: (1) the complete ones, (2) the approximate ones, and (3) the theoretical ones. What all these approaches have in common is that they aim at computing provenance for all IDB atoms, while only a few atoms might be requested to be explained. We contribute an on-demand approach: After deriving all entailed facts of a Datalog program, we allow for querying for the provenance of particular IDB atoms and the structures involved in deriving provenance are computed only then. Our framework is based on terminating existential rules, recording the different rule applications. We present two implementations of the framework, one based on the semiring solver FPsolve, the other one based Datalog(S), a recent extension of Datalog by set terms. We perform experiments on benchmark rule sets using both implementations and discuss feasibility of provenance on-demand.KeywordsDatalog provenanceWhy-provenanceDatalog(S)
Article
Full-text available
We consider systems for data sharing among heterogeneous peers related by a network of schema mappings. Each peer has a locally controlled and edited database instance, but wants to ask queries over related data from other peers as well. To achieve this, every peer’s updates propagate along the mappings to the other peers. However, this update exchange is filtered by trust conditions — expressing what data and sources a peer judges to be authoritative — which may cause a peer to reject another’s updates. In order to support such filtering, updates carry provenance information. These systems target scientific data sharing applications, and their general principles and architecture have been described in [21]. In this paper we present methods for realizing such systems. Specifically, we extend techniques from data integration, data exchange, and incremental view maintenance to propagate updates along mappings; we integrate a novel model for tracking data provenance, such that curators may filter updates based on trust conditions over this provenance; we discuss strategies for implementing our techniques in conjunction with an RDBMS; and we experimentally demonstrate the viability of our techniques in the Orchestra prototype system. This technical report supersedes the version which appeared in VLDB 2007 [17] and corrects certain technical claims regarding the semantics of our system (see errata in Sections [3.1] and [4.1.1]).
Article
Full-text available
Information describing the origin of data, generally referred to as provenance, is important in scientific and curated databases where it is the basis for the trust one puts in their contents. Since such databases are constructed using operations of both query and update languages, it is of paramount importance to describe the effect of these languages on provenance. In this article we study provenance for query and update languages that are closely related to SQL, and compare two ways in which they can manipulate provenance so that elements of the input are rearranged to elements of the output: implicit provenance, where a query or update only provides the rearranged output, and provenance is provided implicitly by a default provenance semantics; and explicit provenance, where a query or update provides both the output and the description of the provenance of each component of the output. Although explicit provenance is in general more expressive, we show that the classes of implicit provenance operations expressible by query and update languages correspond to natural semantic subclasses of the explicit provenance queries. One of the consequences of this study is that provenance separates the expressive power of query and update languages. The model is also relevant to annotation propagation schemes in which annotations on the input to a query or update have to be transferred to the output or vice versa.
Article
Full-text available
Science is now being revolutionized by the capabilities of distribut-ing computation and human effort over the World Wide Web. This revolution offers dramatic benefits but also poses serious risks due to the fluid nature of digital information. The Web today does not provide adequate repeatability, reliability, accountability and trust guarantees for scientific applications. One important part of this problem is tracking and managing provenance information, or metadata about sources, authorship, derivation, or other historical aspects of data. In this paper, we discuss motivating examples for provenance in science and computer security, introduce four mod-els of provenance for XML queries, and discuss their properties and prospects for further research.
Conference Paper
We present an annotation management system for relational databases. In this system, every piece of data in a relation is assumed to have zero or more annotations associated with it and annotations are propagated along, from the source to the output, as data is being transformed through a query. Such an annotation management system is important for understanding the provenance and quality of data, especially in applications that deal with integration of scientific and biological data. We present an extension, pSQL, of a fragment of SQL that has three different types of annotation propagation schemes, each useful for different purposes. The default scheme propagates annotations according to where data is copied from. The default-all scheme propagates annotations according to where data is copied from among all equivalent formulations of a given query. The custom scheme allows a user to specify how annotations should propagate. We present a storage scheme for the annotations and describe algorithms for translating a pSQL query under each propagation scheme into one or more SQL queries that would correctly retrieve the relevant annotations according to the specified propagation scheme. For the default-all scheme, we also show how we generate finitely many queries that can simulate the annotation propagation behavior of the set of all equivalent queries, which is possibly infinite. The algorithms are implemented and the feasibility of the system is demonstrated by a set of experiments that we have conducted.
Article
The relational model has recently been extended to so-called K-rela-tions in which tuples are assigned a unique value in a semiring K. A query lan-guage, denoted by RA + K , similar to the classical positive relational algebra, al-lows for the querying of K-relations. We study the completeness of RA + K in the sense of Bancilhon and Paredaens (BP) and show that RA + K is, in general, not BP-complete. Faced with this incompleteness, we identify two necessary and sufficient operators that need to be added to RA + K to make it BP-complete: dif-ference (−) and duplicate elimination (δ). We investigate conditions on semirings under which these constructs can be added in a natural way, and investigate basic properties of our query languages.
Article
A database view is a portion of the data structured in a way suitable to a specific application. Updates on views must be translated into updates on the underlying database. This paper studies the translation process in the relational model. The procedure is as follows: first, a “complete” set of updates is defined such that together with every update the set contains a “return” update, that is, one that brings the view back to the original state; given two updates in the set, their composition is also in the set. To translate a complete set, we define a mapping called a “translator,” that associates with each view update a unique database update called a “translation.” The constraint on a translation is to take the database to a state mapping onto the updated view. The constraint on the translator is to be a morphism. We propose a method for defining translators. Together with the user-defined view, we define a “complementary” view such that the database could be computed from the view and its complement. We show that a view can have many different complements and that the choice of a complement determines an update policy. Thus, we fix a view complement and we define the translation of a given view update in such a way that the complement remains invariant (“translation under constant complement”). The main result of the paper states that, given a complete set U of view updates, U has a translator if and only if U is translatable under constant complement.
Article
Summary form only given. The goal of the Whips project (WareHousing Information Project at Stanford) is to develop algorithms and tools for the creation and maintenance of a data warehouse (J. Wiener et al., 1996). In particular, we have developed an architecture and implemented a prototype for identifying data changes at distributed heterogeneous sources, transforming them and summarizing them in accordance with warehouse specifications, and incrementally integrating them into the warehouse. In effect, the warehouse stores materialized views of the source data. The Whips architecture is designed specifically to fulfil several important and interrelated goals: sources and warehouse views can be added and removed dynamically; it is scalable by adding more internal modules; changes at the sources are detected automatically; the warehouse may be updated continuously as the sources change, without requiring down time; and the warehouse is always kept consistent with the source data by the integration algorithms. The Whips system is composed of many distinct modules that potentially reside on different machines. Each module is implemented as a CORBA object. They communicate with each other using ILU, a COBRA compliant object library developed by Xerox PARC.
Article
The lineage of a datum records its processing history. Because such information can be used to trace the source of anomalies and errors in processed data sets, it is valuable to users for a variety of applications, including the investigation of anomalies and debugging. Traditional data lineage approaches rely on metadata. However, metadata does not scale well to fine-grained lineage, especially in large data sets. For example, it is not feasible to store all of the information that is necessary to trace from a specific floating-point value in a processed data set to a particular satellite image pixel in a source data set. In this paper, we propose a novel method to support fine-grained data lineage. Rather than relying on metadata, our approach lazily computes the lineage using a limited amount of information about the processing operators and the base data. We introduce the notions of weak inversion and verification. While our system does not perfectly invert the data, it uses weak inversion and verification to provide a number of guarantees about the lineage it generates. We propose a design for the implementation of weak inversion and verification in an object-relational database management system.
Article
To support first-class views in a database system, we must not only allow users to query and browse the database through views, but also allow database updates through views: the well-known view update problem. Although the view update problem has been studied extensively, we take a fresh approach based on data lineage to provide improved results for translating deletions against virtual or materialized views into deletions against the underlying database. Our fully automatic algorithm finds a translation that is guaranteed to be exact (side-effect free) whenever an exact translation exists, using only the view definition at compile-time and the base data at view-update time.