ArticlePDF Available

Abstract and Figures

Data management is growing in complexity as large-scale applications take advantage of the loosely coupled resources brought together by grid middleware and by abundant storage capacity. Metadata describing the data products used in and generated by these applications is essential to disambiguate the data and enable reuse. Data provenance, one kind of metadata, pertains to the derivation history of a data product starting from its original sources. In this paper we create a taxonomy of data provenance characteristics and apply it to current research efforts in e-science, focusing primarily on scientific workflow approaches. The main aspect of our taxonomy categorizes provenance systems based on why they record provenance, what they describe, how they represent and store provenance, and ways to disseminate it. The survey culminates with an identification of open research problems in the field.
Content may be subject to copyright.
A Survey of Data Provenance in e-Science
Yogesh L. Simmhan Beth Plale Dennis Gannon
Computer Science Department
Indiana University, Bloomington, IN 47405
{ysimmhan, plale, gannon}@cs.indiana.edu
ABSTRACT
Data management is growing in complexity as large-
scale applications take advantage of the loosely coupled
resources brought together by grid middleware and by
abundant storage capacity. Metadata describing the data
products used in and generated by these applications is
essential to disambiguate the data and enable reuse. Data
provenance, one kind of metadata, pertains to the
derivation history of a data product starting from its
original sources.
In this paper we create a taxonomy of data provenance
characteristics and apply it to current research efforts in
e-science, focusing primarily on scientific workflow
approaches. The main aspect of our taxonomy
categorizes provenance systems based on why they
record provenance, what they describe, how they
represent and store provenance, and ways to disseminate
it. The survey culminates with an identification of open
research problems in the field.
1. Introduction
The growing number and size of computational and data
resources coupled with uniform access mechanisms
provided by a common Grid middleware stack is
allowing scientists to perform advanced scientific tasks
in collaboratory environments. Scientific workflows are
the means by which these tasks can be composed. The
workflows can generate terabytes of data, mandating
rich and descriptive metadata about the data in order to
make sense of it and reuse it. One kind of metadata is
provenance (also referred to as lineage and pedigree),
which tracks the steps by which the data was derived
and can provide significant value addition in such data
intensive e-science projects.
Scientific domains use different forms of provenance
and for various purposes. Publications are a common
form of representing the provenance of experimental
data and results. Increasingly, Digital Object Identifiers
(DOIs) [1] are used to cite data used in experiments so
that the papers can relate the experimental process and
analysis which form the data’s lineage to the actual
data used and produced. Some scientific fields go
beyond this and store lineage information in a machine
accessible and understandable form. Geographic
information system (GIS) standards suggest that
metadata about the quality of datasets should include a
description of the lineage of the data product to help the
data users to decide if the dataset meets the requirement
of their application [2]. Materials engineers choose
materials for the design of critical components, such as
for an airplane, based on their statistical analysis and it
is essential to establish the pedigree of this data to
prevent system failures and for audit [3]. When sharing
biological and biomedical data in life sciences research,
presence of its transformation record gives a context in
which it can be used and also credits the author(s) of the
data [4]. Knowledge of provenance is also relevant from
the perspective of regulatory mechanisms to protect
intellectual property [5]. With a large number of datasets
appearing in the public domain, it is increasingly
important to determine their veracity and quality. A
detailed history of the data will allow the users to
discern for themselves if the data is acceptable.
Provenance can be described in various terms depending
on the domain where it is applied. Buneman et al [6],
who refer to data provenance in the context of database
systems, define it as the description of the origins of
data and the process by which it arrived at the database.
Lanter [7], who discusses derived data products in GIS,
characterizes lineage as information describing materials
and transformations applied to derive the data.
Provenance can be associated not just with data
products, but with the process(es) that enabled their
creation as well. Greenwood et al [8] expand Lanter’s
definition and view it as metadata recording the process
of experiment workflows, annotations, and notes about
experiments. For the purposes of this paper, we define
data provenance as information that helps determine the
derivation history of a data product, starting from its
original sources. We use the term data product or dataset
to refer to data in any form, such as files, tables, and
virtual collections. The two important features of the
provenance of a data product are the ancestral data
product(s) from which this data product evolved, and the
process of transformation of these ancestral data
product(s), potentially through workflows, that helped
derive this data product.
In this survey, we compare current data provenance
research in the scientific domain. Based on an extensive
survey of the literature on provenance [9], we have
developed a taxonomy of provenance techniques that we
use to analyze five selected systems. Four of the projects
use workflows to perform scientific experiments and
simulations. The fifth research work investigates
provenance techniques for data transformed through
queries in database systems. The relationship between
SIGMOD Record, Vol. 34, No. 3, Sept. 2005
31
workflows and database queries with respect to lineage
is evident1. Research on tracking the lineage of database
queries and on managing provenance in workflow
systems share a symbiotic relationship, and the
possibility of developing cross-cutting techniques is
something we expose in this study. We conclude this
survey with an identification of open research problems.
The complete version of this survey [9] reviews an
additional four systems and also investigates the use of
provenance in the business domain.
While data provenance has gained increasing interest
recently due to unique desiderata introduced by
distributed data in Grids, few sources are available in
the literature that compare across approaches. Bose et al
[10] survey lineage retrieval systems, workflow
systems, and collaborative environments, with the goal
of proposing a meta-model for a systems architecture for
lineage retrieval. Our taxonomy based on usage, subject,
representation, storage, and dissemination more fully
captures the unique characteristics of these provenance
systems. Miles et al [11] study use cases for recording
provenance in e-science experiments for the purposes of
defining the technical requirements for a provenance
architecture. We prescribe no particular model but
instead discuss extant models for lineage management
that can guide future provenance management systems.
2. Taxonomy of Provenance Techniques
Different approaches have been taken to support data
provenance requirements for individual domains. In this
section, we present a taxonomy of these techniques from
a conceptual level with brief discussions on their pros
and cons. A summary of the taxonomy is given in Figure
1. Each of the five main headings is discussed in turn.
2.1 Application of Provenance
Provenance systems can support a number of uses [12,
13]. Goble [14] summarizes several applications of
provenance information as follows:
! Data Quality: Lineage can be used to estimate data
quality and data reliability based on the source data
and transformations [4]. It can also provide proof
statements on data derivation [15].
! Audit Trail: Provenance can be used to trace the audit
trail of data [11], determine resource usage [8], and
detect errors in data generation [16].
! Replication Recipes: Detailed provenance information
can allow repetition of data derivation, help maintain
its currency [11], and be a recipe for replication [17].
! Attribution: Pedigree can establish the copyright and
ownership of data, enable its citation [4], and
determine liability in case of erroneous data.
1 Workflows form a graph of processes that transform data
products. Database queries can form a graph of operations
that operate on tables.
! Informational: A generic use of lineage is to query
based on lineage metadata for data discovery. It can
also be browsed to provide a context to interpret data.
2.2 Subject of Provenance
Provenance information can be collected about different
resources in the data processing system and at multiple
levels of detail. The provenance techniques we surveyed
focus on data, but this data lineage can either be
available explicitly or deduced indirectly. In an explicit
model, which we term a data-oriented model, lineage
metadata is specifically gathered about the data product.
One can delineate the provenance metadata about the
data product from metadata concerning other resources.
This contrasts to a process-oriented, or indirect, model
where the deriving processes are the primary entities for
which provenance is collected, and the data provenance
is determined by inspecting the input and output data
products of these processes [18].
The usefulness of provenance in a certain domain is
linked to the granularity at which it is collected. The
requirements range from provenance on attributes and
tuples in a database [19] to provenance for collections of
files, say, generated by an ensemble experiment run
[20]. Increasing use of abstract datasets [17, 18] that
refer to data at any granularity or format allows a flexible
approach. The cost of collecting and storing provenance
can be inversely proportional to its granularity.
2.3 Representation of Provenance
Different techniques can be used to represent
provenance information, some of which depend on the
underlying data processing system. The manner in
which provenance is represented has implications on the
cost of recording it and the richness of its usage. The
two major approaches to representing provenance
information use either annotations or inversion. In the
former, metadata comprising of the derivation history of
a data product is collected as annotations and
descriptions about source data and processes. This is an
eager form [21] of representation in that provenance is
pre-computed and readily usable as metadata.
Alternatively, the inversion method uses the property by
which some derivations can be inverted to find the input
data supplied to them to derive the output data.
Examples include queries and user-defined functions in
databases that can be inverted automatically or by
explicit functions [19, 22, 23]. In this case, information
about the queries and the output data may suffice to
identify the source data.
While the inversion method has the advantage of being
more compact than the annotation approach, the
information it provides is sparse and limited to the
derivation history of the data. Annotations, on the other
hand, can be richer and, in addition to the derivation
history, often include the parameters passed to the
derivation processes, the versions of the workflows that
32
SIGMOD Record, Vol. 34, No. 3, Sept. 2005
Provenance
Taxonomy
Use of
Provenance
Subject of
Provenance
Provenance
Representation
Storing
Provenance
Data Quality
Audit Trail
Replication
Attribution
Process
Oriented
Granularity
Representation
Scheme
Annotation
Inversion
Informational
Data Oriented
Contents
Syntactic
Information
Semantic
Information
Fine Grained Coarse
Grained
Scalability
Provenance
Dissemination
Overhead
Visual Graph
Queries
Service API
Provenance
Taxonomy
Use of
Provenance
Subject of
Provenance
Provenance
Representation
Storing
Provenance
Data Quality
Audit Trail
Replication
Attribution
Process
Oriented
Granularity
Representation
Scheme
Annotation
Inversion
Informational
Data Oriented
Contents
Syntactic
Information
Semantic
Information
Fine Grained Coarse
Grained
Scalability
Provenance
Dissemination
Overhead
Visual Graph
Queries
Service API
Figure
1
Taxonomy of Provena
will enable reproduction of the data, or even related
publication references [24].
There is no metadata standard for lineage representation
across disciplines, and due to their diverse needs, it is a
challenge for a suitable one to evolve [25]. Many current
provenance systems that use annotations have adopted
XML for representing the lineage information [11, 18,
25, 26]. Some also capture semantic information within
provenance using domain ontologies in languages like
RDF and OWL [18, 25]. Ontologies precisely express
the concepts and relationships used in the provenance
and provide good contextual information.
2.4 Provenance Storage
Provenance information can grow to be larger than the
data it describes if the data is fine-grained and
provenance information rich. So the manner in which
the provenance metadata is stored is important to its
scalability. The inversion method discussed in section
2.3 is arguably more scalable than using annotations
[19]. However, one can reduce storage needs in the
annotation method by recording just the immediately
preceding transformation step that creates the data and
recursively inspecting the provenance information of
those ancestors for the complete derivation history.
Provenance can be tightly coupled to the data it
describes and located in the same data storage system or
even be embedded within the data file, as done in the
headers of NASA Flexible Image Transport System
files. Such approaches can ease maintaining the integrity
of provenance, but make it harder to publish and search
just the provenance. Provenance can also be stored with
other metadata or simply by itself [26]. In maintaining
provenance, we should consider if it is immutable, or if
it can be updated to reflect the current state of its
predecessors, or whether it should be versioned [14].
The provenance collection mechanism and its storage
repository also determine the trust one places in the
provenance and if any mediation service is needed [11].
Management of provenance incurs costs for its
collection and for its storage. Less frequently used
provenance information can be archived to reduce
storage overhead or a demand-supply model based on
usefulness can retain provenance for those frequently
used. If provenance depends on users manually adding
annotations instead of automatically collecting it, the
burden on the user may prevent complete provenance
from being recorded and available in a machine
accessible form that has semantic value [18].
2.5 Provenance Dissemination
In order to use provenance, a system should allow rich
and diverse means to access it. A common way of
disseminating provenance data is through a derivation
graph that users can browse and inspect [16, 18, 25, 26].
Users can also search for datasets based on their
provenance metadata, such as to locate all datasets
generated by a executing a certain workflow. If semantic
provenance information is available, these query results
can automatically feed input datasets for a workflow at
runtime [25]. The derivation history of datasets can be
used to replicate data at another site, or update it if a
dataset is stale due to changes made to its ancestors [27].
Provenance retrieval APIs can additionally allow users
to implement their own mechanism of usage.
3. Survey of Data Provenance Techniques
In our full survey of data provenance [9], we discuss
nine major works that, taken together, provide a
comprehensive overview of research in this field. In this
paper, five works have been selected for discussion. A
summary of their characteristics, as defined by the
taxonomy, can be found in Table 1.
3.1 Chimera
Chimera [27] manages the derivation and analysis of
data objects in collaboratory environments and collects
provenance in the form of data derivation steps for
datasets [17]. Provenance is used for on-demand
regeneration of derived data (“virtual data”), comparison
of data, and auditing data derivations.
Chimera uses a process oriented model to record
provenance. Users construct workflows (called
SIGMOD Record, Vol. 34, No. 3, Sept. 2005
33
derivation graphs or DAGs) using a Virtual Data
Language (VDL) [17, 27]. The VDL conforms to a
schema that represents data products as abstract typed
datasets and their materialized replicas. Datasets can be
files, tables, and objects of varying granularity, though
the prototype supports only files. Computational process
templates, called transformations, are scripts in the file
system and, in future, web services [17]. The
parameterized instance of the transformations, called
derivations, can be connected to form workflows that
consume and produce replicas. Upon execution,
workflows automatically create invocation objects for
each derivation in the workflow, annotated with runtime
information of the process. Invocation objects are the
glue that link input and output data products, and they
constitute an annotation scheme for representing the
provenance. Semantic information on the dataset
derivation is not collected.
The lineage in Chimera is represented in VDL that maps
to SQL queries in a relational database, accessed
through a virtual data catalog (VDC) service [27].
Metadata can be stored in a single VDC, or distributed
over multiple VDC repositories with inter-catalog
references to data and processes, to enable scaling.
Lineage information can be retrieved from the VDC
using queries written in VDL that can, for example,
recursively search for derivations that generated a
particular dataset. A virtual data browser that uses the
VDL queries to interactively access the catalog is
proposed [27]. A novel use of provenance in Chimera is
to plan and estimate the cost of regenerating datasets.
When a dataset has been previously created and it needs
to be regenerated (e.g. to create a new replica), its
provenance guides the workflow planner in selecting an
optimal plan for resource allocation [17, 27].
3.2 myGrid
myGrid provides middleware in support of in silico
(computational laboratory) experiments in biology,
modeled as workflows in a Grid environment [18].
myGrid services include resource discovery, workflow
enactment, and metadata and provenance management,
which enable integration and present a semantically
enhanced information model for bio-informatics.
myGrid is service-oriented and executes workflows
written in XScufl language using the Taverna engine
[18]. A provenance log of the workflow enactment
contains the services invoked, their parameters, the start
and end times, the data products used and derived, and
ontology descriptions, and it is automatically recorded
when the workflow executes. This process-oriented
workflow derivation log is inverted to infer the
provenance for the intermediate and final data products.
Users need to annotate workflows and services with
semantic descriptions to enable this inference and have
the semantic metadata carried over to the data products.
In addition to contextual and organizational metadata
such as owner, project, and experiment hypothesis,
ontological terms can also be provided to describe the
data and the experiment [8]. XML, HTML, and RDF are
used to represent syntactic and semantic provenance
metadata using the annotation scheme [14]. The
granularity at which provenance can be stored is flexible
and is any resource identifiable by an LSID [18].
The myGrid Information Repository (mIR) data service
is a central repository built over a relational database to
store metadata about experimental components [18]. A
number of ways are available for knowledge discovery
using provenance. For instance, the semantic
provenance information available as RDF can be viewed
as a labeled graph using the Haystack semantic web
browser [18]. COHSE (Conceptual Open Hypermedia
Services Environment), a semantic hyperlink utility, is
another tool used to build a semantic web of
provenance. Here, semantically annotated provenance
logs are interlinked using an ontology reasoning service
and displayed as a hyperlinked web page. Provenance
information generated during the execution of a
workflow can also trigger the rerun of another workflow
whose input data parameters it may have updated.
3.3 CMCS
The CMCS project is an informatics toolkit for
collaboration and metadata-based data management for
multi-scale science [24, 25]. CMCS manages
heterogeneous data flows and metadata across multi-
disciplinary sciences such as combustion research,
supplemented by provenance metadata for establishing
the pedigree of data. CMCS uses the Scientific
Annotation Middleware (SAM) repository for storing
URL referenceable files and collections [25].
CMCS uses an annotation scheme to associate XML
metadata properties with the files in SAM and manages
them through a Distributed Authoring and Versioning
(WebDAV) interface. Files form the level of granularity
and all resources such as data objects, processes, web
services, and bibliographic records are modeled as files.
Dublin Core (DC) verbs like Has Reference, Issued, and
Is Version Of are used as XML properties for data files
and semantically relate them to their deriving processes
through XLink references in SAM [24]. DC elements
like Title and Creator, and user-defined metadata can
provide additional context information. Heterogeneous
metadata schemas are supported by mapping them to
standard DC metadata terms using XSLT translators.
Direct association of provenance metadata with the data
object makes this a data-oriented model.
There is no facility for automated collection of lineage
from a workflow’s execution. Data files and their
metadata are populated by DAV-aware applications in
workflows or manually entered by scientists through a
portal interface [25]. Provenance metadata properties
34
SIGMOD Record, Vol. 34, No. 3, Sept. 2005
Table 1 Summary of characteristics of surveyed data provenance techniques
Chimera myGRID CMCS ESSW Trio
Applied
Domain
Physics,
Astronomy Biology Chemical Sciences
Earth Sciences None
Workflow Type Script Based Service Oriented Service Oriented Script Based Database Query
Use of Provenance Informational; Audit;
Data Replication
Context Information;
Re-enactment
Informational;
update data Informational Informational; up
date propagation
Subject Process Process Data Both Data
Granularity Abstract datasets
(Presently files)
Abstract resources
having LSID Files Files Tuples in
Database
Representation
Scheme
Virtual Data Language
Annotations
XML/RDF
Annotations
DublinCore XML
Annotations
XML/RDF
Annotations
Query
Inversion
Semantic Info. No Yes Yes Proposed No
Storage Repository/
Backend
Virtual Data Catalog/
Relational DB
mIR repository/
Relational DB
SAM over DAV/
Relational DB
Lineage Server/
Relational DB Relational DB
User Overhead
User defines
derivations;
Automated WF trace
User defines Service
semantics; Automated
WF Trace
Manual: Apps use
DAV APIs; Users
use portal
Use Libraries to
generate
provenance
Inverse queries
automatically
generated
Scalability
Addressed Yes No No Proposed No
Dissemination Queries Semantic browser;
Lineage graph
Browser;Queries;
GXL/RDF Browser SQL/TriQL
Queries
can be queried from SAM using generic WebDAV
clients. Special portlets allow users to traverse the
provenance metadata for a resource as a web page with
hyperlinks to related data, or as a labeled graph
represented in the Graphics eXchange Language (GXL).
The provenance information can also be exported to
RDF that semantic agents can use to infer relationships
between resources. Provenance metadata that indicate
data modification can generate notifications that trigger
workflow execution to update dependent data products.
3.4 ESSW
The Earth System Science Workbench (ESSW) [28] is a
metadata management and data storage system for earth
science researchers. Lineage is a key facet of the
metadata created in the workbench, and is used for
detecting errors in derived data products and in
determining the quality of datasets.
ESSW uses a scripting model for data processing i.e. all
data manipulation is done through scripts that wrap
existing scientific applications [26]. The sequence of
invocation of these scripts by a master workflow script
forms a DAG. Data products at the granularity of files
are consumed and produced by the scripts, with each
data product and script having a uniquely labeled
metadata object. As the workflow script invokes
individual scripts, these scripts, as part of their
execution, compose XML metadata for themselves and
the data products they generate. The workflow script
links the data flow between successive scripts using
their metadata ids to form the lineage trace for all data
products, represented as annotations. By chaining the
scripts and the data using parent-child links, ESSW is
balanced between data and process oriented lineage.
ESSW puts the onus on the script writer to record the
metadata and lineage using templates and libraries that
are provided. The libraries store metadata objects as
files in a web accessible location and the lineage
separately in a relational database [26]. Scalability is not
currently addressed though it is proposed to federate
lineage across organizations. The metadata and lineage
information can be navigated as a workflow DAG
through a web browser that uses PHP scripts to access
the lineage database [28]. Future work includes
encoding lineage information semantically as RDF
triples to help answer richer queries [26].
3.5 Trio
Cui and Widom [22, 29] trace lineage information for
view data in data warehouses. The Trio project [23]
leverages some of this work in a proposed database
system which has data accuracy and data lineage as
inherent components. While data warehouse mining and
updation motivates lineage tracking in this project, any
e-science system that uses database queries and
functions to model workflows and data transformations
can apply such techniques.
A database view can be modeled as a query tree that is
evaluated bottom-up, starting with leaf operators having
tables as inputs and successive parent operators taking
as input the result of a child operator [22]. For ASPJ
(Aggregate-Select-Project-Join operator) views, it is
possible to create an inverse query of the view query
that operates on the materialized view, and recursively
moves down the query tree to identity the source tables
in the leaves that form the view data’s lineage [22].
Trio [23] uses this inversion model to automatically
determine the source data for tuples created by view
SIGMOD Record, Vol. 34, No. 3, Sept. 2005
35
queries. The inverse queries are recorded at the
granularity of a view tuple and stored in a special
Lineage table. This direct association of lineage with
tuples makes this a data-oriented provenance scheme.
Mechanisms to handle (non-view) tuples created by
insert and update queries, and through user-defined
functions are yet to be determined. Lineage in Trio is
simply the source tuples and the view query that created
the view tuple, with no semantic metadata recorded.
Scalability is not specifically addressed either. Other
than querying the Lineage table, some special purpose
constructs will be provided for retrieving lineage
information through a Trio Query Language (TriQL).
4. Conclusion
In this paper, we presented a taxonomy to understand
and compare provenance techniques used in e-science
projects. The exercise shows that provenance is still an
exploratory field and several open research questions are
exposed. Ways to federate provenance information and
assert its truthfulness need study for it to be usable
across organizations [12]. Evolution of metadata and
service interface standards to manage provenance in
diverse domains will also contribute to a wider adoption
of provenance and promote its sharing [11]. The ability
to seamlessly represent provenance of data derived from
both workflows and databases can help in its portability.
Ways to store provenance about missing or deleted data
(phantom lineage [23]) require further consideration.
Finally, a deeper understanding of provenance is needed
to identify novel ways to leverage it to its full potential.
5. References
[1] J. Brase, "Using Digital Library Techniques - Registration
of Scientific Primary Data," in ECDL, 2004.
[2] D. G. Clarke and D. M. Clark, "Lineage," in Elements of
Spatial Data Quality, 1995.
[3] J. L. Romeu, "Data Quality and Pedigree," in Material
Ease, 1999.
[4] H. V. Jagadish and F. Olken, "Database Management for
Life Sciences Research," in SIGMOD Record, vol. 33, 2004.
[5] "Access to genetic resources and Benefit-Sharing (ABS)
Program," United Nations University, 2003.
[6] P. Buneman, S. Khanna, and W. C. Tan, "Why and Where:
A Characterization of Data Provenance," in ICDT, 2001.
[7] D. P. Lanter, "Design of a Lineage-Based Meta-Data Base
for GIS," in Cartography and Geographic Information
Systems, vol. 18, 1991.
[8] M. Greenwood, C. Goble, R. Stevens, J. Zhao, M. Addis,
D. Marvin, L. Moreau, and T. Oinn, "Provenance of e-Science
Experiments - experience from Bioinformatics," in
Proceedings of the UK OST e-Science 2nd AHM, 2003.
[9] Y. L. Simmhan, B. Plale, and D. Gannon, "A Survey of
Data Provenance Techniques," in Technical Report TR-618:
Computer Science Department, Indiana University, 2005.
[10] R. Bose and J. Frew, "Lineage retrieval for scientific data
processing: a survey," in ACM Comput. Surv., vol. 37, 2005.
[11] S. Miles, P. Groth, M. Branco, and L. Moreau, "The
requirements of recording and using provenance in e-Science
experiments," in Technical Report, Electronics and Computer
Science, University of Southampton, 2005.
[12] D. Pearson, "Presentation on Grid Data Requirements
Scoping Metadata & Provenance," in Workshop on Data
Derivation and Provenance, Chicago, 2002.
[13] G. Cameron, "Provenance and Pragmatics," in Workshop
on Data Provenance and Annotation, Edinburgh, 2003.
[14] C. Goble, "Position Statement: Musings on Provenance,
Workflow and (Semantic Web) Annotations for
Bioinformatics," in Workshop on Data Derivation and
Provenance, Chicago, 2002.
[15] P. P. da Silva, D. L. McGuinness, and R. McCool,
"Knowledge Provenance Infrastructure," in IEEE Data
Engineering Bulletin, vol. 26, 2003.
[16] H. Galhardas, D. Florescu, D. Shasha, E. Simon, and C.-
A. Saita, "Improving Data Cleaning Quality Using a Data
Lineage Facility," in DMDW, 2001.
[17] I. T. Foster, J. S. Vöckler, M. Wilde, and Y. Zhao, "The
Virtual Data Grid: A New Model and Architecture for Data-
Intensive Collaboration," in CIDR, 2003.
[18] J. Zhao, C. A. Goble, R. Stevens, and S. Bechhofer,
"Semantically Linking and Browsing Provenance Logs for E-
science," in ICSNW, 2004.
[19] A. Woodruff and M. Stonebraker, "Supporting Fine-
grained Data Lineage in a Database Visualization
Environment," in ICDE, 1997.
[20] B. Plale, D. Gannon, D. Reed, S. Graves, K.
Droegemeier, B. Wilhelmson, and M. Ramamurthy, "Towards
Dynamically Adaptive Weather Analysis and Forecasting in
LEAD," in ICCS workshop on Dynamic Data Driven
Applications, 2005.
[21] D. Bhagwat, L. Chiticariu, W. C. Tan, and G.
Vijayvargiya, "An Annotation Management System for
Relational Databases," in VLDB, 2004.
[22] Y. Cui and J. Widom, "Practical Lineage Tracing in Data
Warehouses," in ICDE, 2000.
[23] J. Widom, "Trio: A System for Integrated Management of
Data, Accuracy, and Lineage," in CIDR, 2005.
[24] C. Pancerella, J. Hewson, W. Koegler, D. Leahy, M. Lee,
L. Rahn, C. Yang, J. D. Myers, B. Didier, R. McCoy, K.
Schuchardt, E. Stephan, T. Windus, K. Amin, S. Bittner, C.
Lansing, M. Minkoff, S. Nijsure, G. v. Laszewski, R. Pinzon,
B. Ruscic, Al Wagner, B. Wang, W. Pitz, Y. L. Ho, D.
Montoya, L. Xu, T. C. Allison, W. H. Green, Jr, and M.
Frenklach, "Metadata in the collaboratory for multi-scale
chemical science," in Dublin Core Conference, 2003.
[25] J. Myers, C. Pancerella, C. Lansing, K. Schuchardt, and
B. Didier, "Multi-Scale Science, Supporting Emerging
Practice with Semantically Derived Provenance," in ISWC
workshop on Semantic Web Technologies for Searching and
Retrieving Scientific Data, 2003.
[26] R. Bose and J. Frew, "Composing Lineage Metadata with
XML for Custom Satellite-Derived Data Products," in
SSDBM, 2004.
[27] I. T. Foster, J.-S. Vöckler, M. Wilde, and Y. Zhao,
"Chimera: A Virtual Data System for Representing, Querying,
and Automating Data Derivation," in SSDBM, 2002.
[28] J. Frew and R. Bose, "Earth System Science Workbench:
A Data Management Infrastructure for Earth Science
Products," in SSDBM, 2001.
[29] Y. Cui and J. Widom, "Lineage tracing for general data
warehouse transformations," in VLDB Journal, vol. 12, 2003.
36
SIGMOD Record, Vol. 34, No. 3, Sept. 2005
... To support the requirements for data provenance in each domain, many methods have been used. In a survey study on provenance [50], a conceptual taxonomy is provided along with brief analyses of their benefits and drawbacks. In Figure 1, the taxonomy is summarized. ...
... In contrast, in a process-oriented, or indirect, paradigm, the deriving processes serve as the main objects for which provenance is gathered, and the input and output data products of these processes are examined to ascertain the data provenance. The level of detail at which provenance is gathered determines how relevant information is in a particular field [50]. ...
... Some use domain ontologies in languages like RDF and OWL to capture semantic data inside provenance. The concepts and connections used in the provenance data are stated in ontologies in a precise manner [50]. ...
... Data provenance records the related operations and data in the execution of data process, which is regarded as an important data in many process-aware systems. In the application domains of data provenance, it is used for auditing in the data base and supporting analysis in the complex science experiment environment in the early times, process analysis and process verification following the development of workflow, semantic resolving in the web, and now existing as metadata in the cloud environment [1,2,3]. Following the development of the application of data provenance, we can find out that the scale of the data object supported by data provenance becomes more and more huge. ...
... But it can compose the bigger predicate by the operators of single predicate, like ∧, ∨ and . For example, to find the process nodes affected by both p 7 and Pd 2 in figure1 can be described as follow: 2 ...
... Publication and traceability. Despite data lineage in science being of concern for some time [55], current common geospatial data-storage formats do not provide support for versioning, let alone traceability. GeoTIFF, the common format for satellite imagery data, contains no fields around versioning in the standard [31], and nor do upcoming formats like geozarr [9]. ...
Preprint
Full-text available
We make a case for planetary computing: accessible, interoperable and extensible end-to-end systems infrastructure to process petabytes of global remote-sensing data for the scientific analysis of environmental action. We discuss some pressing scientific scenarios, survey existing solutions and find them incomplete, and present directions for systems research to help reverse the climate and biodiversity crises.
... Additionally, outdated knowledge can be considered incorrect, which relates to the previously described metric. -Provenance (Essential): Provenance data is widely acknowledged as essential for facilitating the reuse, management, and reproducibility of published data [40]. According to Golbeck and Mannes [13], when it comes to filtering and aggregation, tracking the provenance of Semantic Web metadata is essential, especially when the trustworthiness of the data is in question. ...
Conference Paper
Full-text available
Knowledge Graphs (KG) have gained increasing importance in science, business and society in the last years. However, most knowledge graphs were either extracted or compiled from existing sources. There are only relatively few examples where knowledge graphs were genuinely created by an intertwined human-machine collaboration. Also, since the quality of data and knowledge graphs is of paramount importance , a number of data quality assessment models have been proposed. However, they do not take the specific aspects of intertwined human-machine curated knowledge graphs into account. In this work, we propose a graded maturity model for scholarly knowledge graphs (KGMM), which specifically focuses on aspects related to the joint, evolutionary curation of knowledge graphs for digital libraries. Our model comprises 5 maturity stages with 20 quality measures. We demonstrate the implementation of our model in a large scale scholarly knowledge graph curation effort.
... There are several motivations for describing the data provenance [14], for example, evaluating the quality of data based on its origin and the transformation processes carried out to generate the data. Another application is auditing since it is possible to trace the history of data and identify potential errors caused during a transformation process. ...
Article
Full-text available
Open Educational Resources (OER) can be adapted and combined to create new resources that better meet the specific needs of different kinds of users and scenarios. In this sense, OER strongly contributes to generating and sharing educational knowledge. Due to the possibility of creating a new OER through the revision and remix activities, the original OER and the transformation process should be adequately identified. This way, the user of the OER has enough information about the history of the resource and, thus, can use it with confidence and security. In this context, determining data provenance, which describes the history of a data from its origin to its current state, becomes very relevant. For OER, there are examples of metadata standards and digital repositories that help to obtain the data provenance. However, the information collected is insufficient to identify the entire history of the provenance of OER. This article proposes a Provenance Model for OER called the ProvOER Model, which allows the documentation and identification of the provenance of OER. For this purpose, a minimum set of metadata was defined that reflects the OER intrinsic properties and the activities that created a new OER. The experiments showed that the ProvOER Model produced a suitable representation of the provenance of OER. In addition, the ProvOER Model allowed identifying the original OER used in a revise or remix activity and the continuous stretch used to create a new resource.
Article
Full-text available
In this paper we consider the problem of defending against increasing data exfiltration threats in the domain of cybersecurity. We review existing work on exfiltration threats and corresponding countermeasures. We consider current problems and challenges that need to be addressed to provide a qualitatively better level of protection against data exfiltration. After considering the magnitude of the data exfiltration threat, we outline the objectives of this paper and the scope of the review. We then provide an extensive discussion of present methods of defending against data exfiltration. We note that current methodologies for defending against data exfiltration do not connect well with domain experts, both as sources of knowledge and as partners in decision-making. However, human interventions continue to be required in cybersecurity. Thus, cybersecurity applications are necessarily socio-technical systems which cannot be safely and efficiently operated without considering relevant human factors issues. We conclude with a call for approaches that can more effectively integrate human expertise into defense against data exfiltration.
Chapter
On the one hand, data provenance provides the history of the origin of data and updates the modification cycle on one side, on the other hand, blockchain offers features that meet these immutability requirements of a version of data. These two technologies have applications and features that, in union, contribute to generate ideal technological structures to the management of data in several organizations. The objective of this study is to present an analysis of the relations between the main applications of data provenance as well as the blockchain features and point out applications in the Health Information Systems (HIS), in addition to a literature review on the theme. In regard to that, an analysis has been made as described in this study methodology, where the following questions were answered: i) What relations are there between data provenance and blockchain? ii) Can data provenance in union with blockchain contribute to applications in the HIS? Soon after, it was possible to prove through some of the studies present in the literature the combined use of data provenance and blockchain in HIS. Based on the relationships found between data provenance and blockchain, it was possible to conclude that these relationships contribute to data management in any organization, including HIS. It was also observed that different data provenance and blockchain methods, techniques, models, and methodologies are intertwined to generate the structures of data management in HIS, especially in Personal Health Record (PHR) and Electronic Health Record (EHR).KeywordsData ProvenanceBlockchainHealth Information Systems
Article
Full-text available
Article
Full-text available
In e-Science experiments, it is vital to record the experimental process for later use such as in interpreting results, verifying that the correct process took place or tracing where data came from. The documentation of a process that led to some data is called the provenance of that data, and a provenance architecture is the software architecture for a system that will provide the necessary functionality to record, store and use provenance data. However, there has been little principled analysis of what is actually required of a provenance architecture, so it is impossible to determine the functionality they would ideally support. In this paper, we present use cases for a provenance architecture from current experiments in biology , chemistry, physics and computer science, and analyse the use cases to determine the technical requirements of a generic, application-independent architecture. We propose an architecture that meets these requirements and evaluate a preliminary implementation by attempting to realise one of the use cases.
Article
Full-text available
Data management is growing in complexity as large-scale applications take advantage of the loosely coupled resources brought together by grid middleware and by abundant storage capacity. Metadata describing the data products used in and generated by these applications is essential to disambiguate the data and enable reuse. Data provenance, one kind of metadata, pertains to the derivation history of a data product starting from its original sources. The provenance of data products generated by complex transformations such as workflows is of considerable value to scientists. From it, one can ascertain the quality of the data based on its ancestral data and derivations, track back sources of errors, allow automated re-enactment of derivations to update a data, and provide attribution of data sources. Provenance is also essential to the business domain where it can be used to drill down to the source of data in a data warehouse, track the creation of intellectual property, and provide an audit trail for regulatory purposes. In this paper we create a taxonomy of data provenance techniques, and apply the classification to current research efforts in the field. The main aspect of our taxonomy categorizes provenance systems based on why they record provenance, what they describe, how they represent and store provenance, and ways to disseminate it. Our synthesis can help those building scientific and business metadata-management systems to understand existing provenance system designs. The survey culminates with an identification of open research problems in the field. 1 Introduction The growing number and size of computational and data resources coupled with uniform access mechanisms provided by a common Grid middleware stack is allowing scientists to perform advanced scientific tasks in collaboratory environments. Large collaboratory scientific projects such as the Large Hadron Collider [1] and Sloan Digital Sky Survey (SDSS) [2] generate terabytes of data whose complexity is managed by data grids. This data deluge mandates the need for rich and descriptive metadata to accompany the data in order to understand it and reuse it across partner organizations. Business users too are having to work with data from third-parties and from across the enterprise that are aggregated within a data warehouse. Dash-boarding tools that help analysts with forecasting and trend prediction operate on these data silos and it is essential for these data mining tasks to have metadata describing the data properties [3]. Provenance is one kind of metadata which tracks the steps by which the data was derived and can provide significant value addition in such data intensive scenarios.
Article
Full-text available
The goal of the Collaboratory for the Multi-scale Chemical Sciences (CMCS) [1] is to develop an informatics-based approach to synthesizing multi-scale chemistry information to create knowledge in the chemical sciences. CMCS is using a portal and metadata-aware content store as a base for building a system to support inter-domain knowledge exchange in chemical science. Key aspects of the system include configurable metadata extraction and translation, a core schema for scientific pedigree, and a suite of tools for managing data and metadata and visualizing pedigree relationships between data entries. CMCS metadata is represented using Dublin Core with metadata extensions that are useful to both the chemical science community and the science community in general. CMCS is working with several chemistry groups who are using the system to collaboratively assemble and analyze existing data to derive new chemical knowledge. In this paper we discuss the project's metadata-related requirements, the relevant software infrastructure, core metadata schema, and tools that use the metadata to enhance science.
Article
Full-text available
Scientific progress is becoming increasingly dependent on our ability to study phenomena at multiple scales and from multiple perspectives. The ability to recontextualize third-party data within the semantic and syntactic framework of a given research project is increasingly seen as a primary barrier in multi-scale science. Within the Collaboratory for Multi-Scale Chemical Science (CMCS) project, we are developing a general-purpose, informatics-based approach that emphasizes "on-demand" metadata creation, configurable data translations, and semantic mapping to support the rapidly increasing and continually evolving requirements for managing data, metadata, and data relationships in such projects. A concrete example of this approach is the design of the CMCS provenance subsystem. The concept of provenance varies across communities, and multiple independent applications contribute to and use provenance. In the CMCS project, we have developed generic tools for viewing provenance relationships and for using them to, for example, scope notifications and searches. These tools rely on a configurable concept of provenance defined in terms of other relationships. The result is a very flexible mechanism capable of tracking data provenance across many disciplines and supporting multiple uses of provenance information..
Article
A conceptual design is presented for a lineage meta-data base system that documents data sources and geographic information system (GIS) transformations applied to derive cartographic products. Artificial intelligence techniques of semantic networks are used to organize input-output relationships between map layers, and frames are used for storing lineage attributes characterizing source, intermediate, and product layers. An example indicates that a lineage meta-data base enables GIS users to determine the fitness for use of spatial data sets.
Article
Data warehousing systems integrate information from operational data sources into a central repository to enable analysis and mining of the integrated information. During the integration process, source data typically undergoes a series of transformations, which may vary from simple algebraic operations or aggregations to complex “data cleansing” procedures. In a warehousing environment, the data lineage problem is that of tracing warehouse data items back to the original source items from which they were derived. We formally define the lineage tracing problem in the presence of general data warehouse transformations, and we present algorithms for lineage tracing in this environment. Our tracing procedures take advantage of known structure or properties of transformations when present, but also work in the absence of such information. Our results can be used as the basis for a lineage tracing tool in a general warehousing setting, and also can guide the design of data warehouses that enable efficient lineage tracing.