Collaborative Data Sharing with Mappings and Provenance

Publicly accessible Penn Dissertations
Source: OAI

ABSTRACT A key challenge in science today involves integrating data from databases managed by different collaborating scientists. In this dissertation, we develop the foundations and applications of collaborative data sharing systems (CDSSs), which address this challenge. A CDSS allows collaborators to define loose confederations of heterogeneous databases, relating them through schema mappings that establish how data should flow from one site to the next. In addition to simply propagating data along the mappings, it is critical to record data provenance (annotations describing where and how data originated) and to support policies allowing scientists to specify whose data they trust, and when. Since a large data sharing confederation is certain to evolve over time, the CDSS must also efficiently handle incremental changes to data, schemas, and mappings. We focus in this dissertation on the formal foundations of CDSSs, as well as practical issues of its implementation in a prototype CDSS called Orchestra. We propose a novel model of data provenance appropriate for CDSSs, based on a framework of semiring-annotated relations. This framework elegantly generalizes a number of other important database semantics involving annotated relations, including ranked results, prior provenance models, and probabilistic databases. We describe the design and implementation of the Orchestra prototype, which supports update propagation across schema mappings while maintaining data provenance and filtering data according to trust policies. We investigate fundamental questions of query containment and equivalence in the context of provenance information. We use the results of these investigations to develop novel approaches to efficiently propagating changes to data and mappings in a CDSS. Our approaches highlight unexpected connections between the two problems and with the problem of optimizing queries using materialized views. Finally, we show that semiring annotations also make sense for XML and nested relational data, paving the way towards a future extension of CDSS to these richer data models.

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Data integration is a pervasive challenge faced in appli-cations that need to query across multiple autonomous and heterogeneous data sources. Data integration is crucial in large enterprises that own a multitude of data sources, for progress in large-scale scientific projects, where data sets are being produced independently by multiple researchers, for better cooperation among government agencies, each with their own data sources, and in o ering good search quality across the millions of structured data sources on the World-Wide Web. Ten years ago we published "Querying Heterogeneous In-formation Sources using Source Descriptions" [73], a paper describing some aspects of the Information Manifold data integration project. The Information Manifold and many other projects conducted at the time [5, 6, 20, 25, 38, 43, 51, 66, 100] have led to tremendous progress on data in-tegration and to quite a few commercial data integration products. This paper o ers a perspective on the contribu-tions of the Information Manifold and its peers, describes some of the important bodies of work in the data integra-tion field in the last ten years, and outlines some challenges to data integration research today. We note in advance that this is not intended to be a comprehensive survey of data integration, and even though the reference list is long, it is by no means complete.
    Proceedings of the 32nd International Conference on Very Large Data Bases, Seoul, Korea, September 12-15, 2006; 01/2006
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The Semantic Web envisions a World Wide Web in which data is described with rich semantics and applications can pose complex queries. To this point, researchers have defined new languages for specifying meanings for concepts and developed techniques for reasoning about them, using RDF as the data model. To flourish, the Semantic Web needs to be able to accommodate the huge amounts of existing data and the applications operating on them. To achieve this, we are faced with two problems. First, most of the world's data is available not in RDF but in XML; XML and the applications consuming it rely not only on the domain structure of the data, but also on its document structure. Hence, to provide interoperability between such sources, we must map between both their domain structures and their document structures. Second, data management practitioners often prefer to exchange data through local point-to-point data translations, rather than mapping to common mediated schemas or ontologies.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Most of our research and scholarship now depends on curated databases. A curated database is any kind of structured repository such as a traditional database, an ontology or an XML file, that is created and updated with a great deal of human effort. For example, most reference works (dictionaries, encyclopaedias, gazetteers, etc.) that we used to find on the reference shelves of libraries are now curated databases; and because it is now so easy to publish databases on the web, there has been an explosion in the number of new curated databases used in scientific research. Curated databases are of particular importance to digital librarians because the central component of a digital library – its catalogue or metadata – is very likely to be a curated database. The value of curated databases lies in the organisation, the annotation and the quality of the data they contain. Like the paper reference works they have replaced, they usually represent the efforts of a dedicated group of people to produce a definitive description of enterprise or some subject area.
    09/2009: pages 2-2;


Available from