Article

Collaborative Data Sharing with Mappings and Provenance

Publicly accessible Penn Dissertations DOI:edissertations/58
Source: OAI

ABSTRACT A key challenge in science today involves integrating data from databases managed by different collaborating scientists. In this dissertation, we develop the foundations and applications of collaborative data sharing systems (CDSSs), which address this challenge. A CDSS allows collaborators to define loose confederations of heterogeneous databases, relating them through schema mappings that establish how data should flow from one site to the next. In addition to simply propagating data along the mappings, it is critical to record data provenance (annotations describing where and how data originated) and to support policies allowing scientists to specify whose data they trust, and when. Since a large data sharing confederation is certain to evolve over time, the CDSS must also efficiently handle incremental changes to data, schemas, and mappings. We focus in this dissertation on the formal foundations of CDSSs, as well as practical issues of its implementation in a prototype CDSS called Orchestra. We propose a novel model of data provenance appropriate for CDSSs, based on a framework of semiring-annotated relations. This framework elegantly generalizes a number of other important database semantics involving annotated relations, including ranked results, prior provenance models, and probabilistic databases. We describe the design and implementation of the Orchestra prototype, which supports update propagation across schema mappings while maintaining data provenance and filtering data according to trust policies. We investigate fundamental questions of query containment and equivalence in the context of provenance information. We use the results of these investigations to develop novel approaches to efficiently propagating changes to data and mappings in a CDSS. Our approaches highlight unexpected connections between the two problems and with the problem of optimizing queries using materialized views. Finally, we show that semiring annotations also make sense for XML and nested relational data, paving the way towards a future extension of CDSS to these richer data models.

0 0
 · 
0 Bookmarks
 · 
31 Views

Full-text

View
0 Downloads
Available from

Keywords

collaborative data sharing systems
 
data provenance
 
database semantics
 
formal foundations
 
framework elegantly generalizes
 
fundamental questions
 
future extension
 
integrating data
 
large data sharing confederation
 
nested relational data
 
novel approaches
 
Orchestra prototype
 
practical issues
 
prior provenance models
 
propagating data
 
prototype CDSS
 
record data provenance
 
richer data models
 
schema mappings
 
trust policies