ArticlePDF Available

Abstract and Figures

Data management is growing in complexity as large-scale applications take advantage of the loosely coupled resources brought together by grid middleware and by abundant storage capacity. Metadata describing the data products used in and generated by these applications is essential to disambiguate the data and enable reuse. Data provenance, one kind of metadata, pertains to the derivation history of a data product starting from its original sources. In this paper we create a taxonomy of data provenance characteristics and apply it to current research efforts in e-science, focusing primarily on scientific workflow approaches. The main aspect of our taxonomy categorizes provenance systems based on why they record provenance, what they describe, how they represent and store provenance, and ways to disseminate it. The survey culminates with an identification of open research problems in the field.
Content may be subject to copyright.
A Survey of Data Provenance in e-Science
Yogesh L. Simmhan Beth Plale Dennis Gannon
Computer Science Department
Indiana University, Bloomington, IN 47405
{ysimmhan, plale, gannon}@cs.indiana.edu
ABSTRACT
Data management is growing in complexity as large-
scale applications take advantage of the loosely coupled
resources brought together by grid middleware and by
abundant storage capacity. Metadata describing the data
products used in and generated by these applications is
essential to disambiguate the data and enable reuse. Data
provenance, one kind of metadata, pertains to the
derivation history of a data product starting from its
original sources.
In this paper we create a taxonomy of data provenance
characteristics and apply it to current research efforts in
e-science, focusing primarily on scientific workflow
approaches. The main aspect of our taxonomy
categorizes provenance systems based on why they
record provenance, what they describe, how they
represent and store provenance, and ways to disseminate
it. The survey culminates with an identification of open
research problems in the field.
1. Introduction
The growing number and size of computational and data
resources coupled with uniform access mechanisms
provided by a common Grid middleware stack is
allowing scientists to perform advanced scientific tasks
in collaboratory environments. Scientific workflows are
the means by which these tasks can be composed. The
workflows can generate terabytes of data, mandating
rich and descriptive metadata about the data in order to
make sense of it and reuse it. One kind of metadata is
provenance (also referred to as lineage and pedigree),
which tracks the steps by which the data was derived
and can provide significant value addition in such data
intensive e-science projects.
Scientific domains use different forms of provenance
and for various purposes. Publications are a common
form of representing the provenance of experimental
data and results. Increasingly, Digital Object Identifiers
(DOIs) [1] are used to cite data used in experiments so
that the papers can relate the experimental process and
analysis which form the data’s lineage to the actual
data used and produced. Some scientific fields go
beyond this and store lineage information in a machine
accessible and understandable form. Geographic
information system (GIS) standards suggest that
metadata about the quality of datasets should include a
description of the lineage of the data product to help the
data users to decide if the dataset meets the requirement
of their application [2]. Materials engineers choose
materials for the design of critical components, such as
for an airplane, based on their statistical analysis and it
is essential to establish the pedigree of this data to
prevent system failures and for audit [3]. When sharing
biological and biomedical data in life sciences research,
presence of its transformation record gives a context in
which it can be used and also credits the author(s) of the
data [4]. Knowledge of provenance is also relevant from
the perspective of regulatory mechanisms to protect
intellectual property [5]. With a large number of datasets
appearing in the public domain, it is increasingly
important to determine their veracity and quality. A
detailed history of the data will allow the users to
discern for themselves if the data is acceptable.
Provenance can be described in various terms depending
on the domain where it is applied. Buneman et al [6],
who refer to data provenance in the context of database
systems, define it as the description of the origins of
data and the process by which it arrived at the database.
Lanter [7], who discusses derived data products in GIS,
characterizes lineage as information describing materials
and transformations applied to derive the data.
Provenance can be associated not just with data
products, but with the process(es) that enabled their
creation as well. Greenwood et al [8] expand Lanter’s
definition and view it as metadata recording the process
of experiment workflows, annotations, and notes about
experiments. For the purposes of this paper, we define
data provenance as information that helps determine the
derivation history of a data product, starting from its
original sources. We use the term data product or dataset
to refer to data in any form, such as files, tables, and
virtual collections. The two important features of the
provenance of a data product are the ancestral data
product(s) from which this data product evolved, and the
process of transformation of these ancestral data
product(s), potentially through workflows, that helped
derive this data product.
In this survey, we compare current data provenance
research in the scientific domain. Based on an extensive
survey of the literature on provenance [9], we have
developed a taxonomy of provenance techniques that we
use to analyze five selected systems. Four of the projects
use workflows to perform scientific experiments and
simulations. The fifth research work investigates
provenance techniques for data transformed through
queries in database systems. The relationship between
SIGMOD Record, Vol. 34, No. 3, Sept. 2005
31
workflows and database queries with respect to lineage
is evident1. Research on tracking the lineage of database
queries and on managing provenance in workflow
systems share a symbiotic relationship, and the
possibility of developing cross-cutting techniques is
something we expose in this study. We conclude this
survey with an identification of open research problems.
The complete version of this survey [9] reviews an
additional four systems and also investigates the use of
provenance in the business domain.
While data provenance has gained increasing interest
recently due to unique desiderata introduced by
distributed data in Grids, few sources are available in
the literature that compare across approaches. Bose et al
[10] survey lineage retrieval systems, workflow
systems, and collaborative environments, with the goal
of proposing a meta-model for a systems architecture for
lineage retrieval. Our taxonomy based on usage, subject,
representation, storage, and dissemination more fully
captures the unique characteristics of these provenance
systems. Miles et al [11] study use cases for recording
provenance in e-science experiments for the purposes of
defining the technical requirements for a provenance
architecture. We prescribe no particular model but
instead discuss extant models for lineage management
that can guide future provenance management systems.
2. Taxonomy of Provenance Techniques
Different approaches have been taken to support data
provenance requirements for individual domains. In this
section, we present a taxonomy of these techniques from
a conceptual level with brief discussions on their pros
and cons. A summary of the taxonomy is given in Figure
1. Each of the five main headings is discussed in turn.
2.1 Application of Provenance
Provenance systems can support a number of uses [12,
13]. Goble [14] summarizes several applications of
provenance information as follows:
! Data Quality: Lineage can be used to estimate data
quality and data reliability based on the source data
and transformations [4]. It can also provide proof
statements on data derivation [15].
! Audit Trail: Provenance can be used to trace the audit
trail of data [11], determine resource usage [8], and
detect errors in data generation [16].
! Replication Recipes: Detailed provenance information
can allow repetition of data derivation, help maintain
its currency [11], and be a recipe for replication [17].
! Attribution: Pedigree can establish the copyright and
ownership of data, enable its citation [4], and
determine liability in case of erroneous data.
1 Workflows form a graph of processes that transform data
products. Database queries can form a graph of operations
that operate on tables.
! Informational: A generic use of lineage is to query
based on lineage metadata for data discovery. It can
also be browsed to provide a context to interpret data.
2.2 Subject of Provenance
Provenance information can be collected about different
resources in the data processing system and at multiple
levels of detail. The provenance techniques we surveyed
focus on data, but this data lineage can either be
available explicitly or deduced indirectly. In an explicit
model, which we term a data-oriented model, lineage
metadata is specifically gathered about the data product.
One can delineate the provenance metadata about the
data product from metadata concerning other resources.
This contrasts to a process-oriented, or indirect, model
where the deriving processes are the primary entities for
which provenance is collected, and the data provenance
is determined by inspecting the input and output data
products of these processes [18].
The usefulness of provenance in a certain domain is
linked to the granularity at which it is collected. The
requirements range from provenance on attributes and
tuples in a database [19] to provenance for collections of
files, say, generated by an ensemble experiment run
[20]. Increasing use of abstract datasets [17, 18] that
refer to data at any granularity or format allows a flexible
approach. The cost of collecting and storing provenance
can be inversely proportional to its granularity.
2.3 Representation of Provenance
Different techniques can be used to represent
provenance information, some of which depend on the
underlying data processing system. The manner in
which provenance is represented has implications on the
cost of recording it and the richness of its usage. The
two major approaches to representing provenance
information use either annotations or inversion. In the
former, metadata comprising of the derivation history of
a data product is collected as annotations and
descriptions about source data and processes. This is an
eager form [21] of representation in that provenance is
pre-computed and readily usable as metadata.
Alternatively, the inversion method uses the property by
which some derivations can be inverted to find the input
data supplied to them to derive the output data.
Examples include queries and user-defined functions in
databases that can be inverted automatically or by
explicit functions [19, 22, 23]. In this case, information
about the queries and the output data may suffice to
identify the source data.
While the inversion method has the advantage of being
more compact than the annotation approach, the
information it provides is sparse and limited to the
derivation history of the data. Annotations, on the other
hand, can be richer and, in addition to the derivation
history, often include the parameters passed to the
derivation processes, the versions of the workflows that
32
SIGMOD Record, Vol. 34, No. 3, Sept. 2005
Provenance
Taxonomy
Use of
Provenance
Subject of
Provenance
Provenance
Representation
Storing
Provenance
Data Quality
Audit Trail
Replication
Attribution
Process
Oriented
Granularity
Representation
Scheme
Annotation
Inversion
Informational
Data Oriented
Contents
Syntactic
Information
Semantic
Information
Fine Grained Coarse
Grained
Scalability
Provenance
Dissemination
Overhead
Visual Graph
Queries
Service API
Provenance
Taxonomy
Use of
Provenance
Subject of
Provenance
Provenance
Representation
Storing
Provenance
Data Quality
Audit Trail
Replication
Attribution
Process
Oriented
Granularity
Representation
Scheme
Annotation
Inversion
Informational
Data Oriented
Contents
Syntactic
Information
Semantic
Information
Fine Grained Coarse
Grained
Scalability
Provenance
Dissemination
Overhead
Visual Graph
Queries
Service API
Figure
1
Taxonomy of Provena
will enable reproduction of the data, or even related
publication references [24].
There is no metadata standard for lineage representation
across disciplines, and due to their diverse needs, it is a
challenge for a suitable one to evolve [25]. Many current
provenance systems that use annotations have adopted
XML for representing the lineage information [11, 18,
25, 26]. Some also capture semantic information within
provenance using domain ontologies in languages like
RDF and OWL [18, 25]. Ontologies precisely express
the concepts and relationships used in the provenance
and provide good contextual information.
2.4 Provenance Storage
Provenance information can grow to be larger than the
data it describes if the data is fine-grained and
provenance information rich. So the manner in which
the provenance metadata is stored is important to its
scalability. The inversion method discussed in section
2.3 is arguably more scalable than using annotations
[19]. However, one can reduce storage needs in the
annotation method by recording just the immediately
preceding transformation step that creates the data and
recursively inspecting the provenance information of
those ancestors for the complete derivation history.
Provenance can be tightly coupled to the data it
describes and located in the same data storage system or
even be embedded within the data file, as done in the
headers of NASA Flexible Image Transport System
files. Such approaches can ease maintaining the integrity
of provenance, but make it harder to publish and search
just the provenance. Provenance can also be stored with
other metadata or simply by itself [26]. In maintaining
provenance, we should consider if it is immutable, or if
it can be updated to reflect the current state of its
predecessors, or whether it should be versioned [14].
The provenance collection mechanism and its storage
repository also determine the trust one places in the
provenance and if any mediation service is needed [11].
Management of provenance incurs costs for its
collection and for its storage. Less frequently used
provenance information can be archived to reduce
storage overhead or a demand-supply model based on
usefulness can retain provenance for those frequently
used. If provenance depends on users manually adding
annotations instead of automatically collecting it, the
burden on the user may prevent complete provenance
from being recorded and available in a machine
accessible form that has semantic value [18].
2.5 Provenance Dissemination
In order to use provenance, a system should allow rich
and diverse means to access it. A common way of
disseminating provenance data is through a derivation
graph that users can browse and inspect [16, 18, 25, 26].
Users can also search for datasets based on their
provenance metadata, such as to locate all datasets
generated by a executing a certain workflow. If semantic
provenance information is available, these query results
can automatically feed input datasets for a workflow at
runtime [25]. The derivation history of datasets can be
used to replicate data at another site, or update it if a
dataset is stale due to changes made to its ancestors [27].
Provenance retrieval APIs can additionally allow users
to implement their own mechanism of usage.
3. Survey of Data Provenance Techniques
In our full survey of data provenance [9], we discuss
nine major works that, taken together, provide a
comprehensive overview of research in this field. In this
paper, five works have been selected for discussion. A
summary of their characteristics, as defined by the
taxonomy, can be found in Table 1.
3.1 Chimera
Chimera [27] manages the derivation and analysis of
data objects in collaboratory environments and collects
provenance in the form of data derivation steps for
datasets [17]. Provenance is used for on-demand
regeneration of derived data (“virtual data”), comparison
of data, and auditing data derivations.
Chimera uses a process oriented model to record
provenance. Users construct workflows (called
SIGMOD Record, Vol. 34, No. 3, Sept. 2005
33
derivation graphs or DAGs) using a Virtual Data
Language (VDL) [17, 27]. The VDL conforms to a
schema that represents data products as abstract typed
datasets and their materialized replicas. Datasets can be
files, tables, and objects of varying granularity, though
the prototype supports only files. Computational process
templates, called transformations, are scripts in the file
system and, in future, web services [17]. The
parameterized instance of the transformations, called
derivations, can be connected to form workflows that
consume and produce replicas. Upon execution,
workflows automatically create invocation objects for
each derivation in the workflow, annotated with runtime
information of the process. Invocation objects are the
glue that link input and output data products, and they
constitute an annotation scheme for representing the
provenance. Semantic information on the dataset
derivation is not collected.
The lineage in Chimera is represented in VDL that maps
to SQL queries in a relational database, accessed
through a virtual data catalog (VDC) service [27].
Metadata can be stored in a single VDC, or distributed
over multiple VDC repositories with inter-catalog
references to data and processes, to enable scaling.
Lineage information can be retrieved from the VDC
using queries written in VDL that can, for example,
recursively search for derivations that generated a
particular dataset. A virtual data browser that uses the
VDL queries to interactively access the catalog is
proposed [27]. A novel use of provenance in Chimera is
to plan and estimate the cost of regenerating datasets.
When a dataset has been previously created and it needs
to be regenerated (e.g. to create a new replica), its
provenance guides the workflow planner in selecting an
optimal plan for resource allocation [17, 27].
3.2 myGrid
myGrid provides middleware in support of in silico
(computational laboratory) experiments in biology,
modeled as workflows in a Grid environment [18].
myGrid services include resource discovery, workflow
enactment, and metadata and provenance management,
which enable integration and present a semantically
enhanced information model for bio-informatics.
myGrid is service-oriented and executes workflows
written in XScufl language using the Taverna engine
[18]. A provenance log of the workflow enactment
contains the services invoked, their parameters, the start
and end times, the data products used and derived, and
ontology descriptions, and it is automatically recorded
when the workflow executes. This process-oriented
workflow derivation log is inverted to infer the
provenance for the intermediate and final data products.
Users need to annotate workflows and services with
semantic descriptions to enable this inference and have
the semantic metadata carried over to the data products.
In addition to contextual and organizational metadata
such as owner, project, and experiment hypothesis,
ontological terms can also be provided to describe the
data and the experiment [8]. XML, HTML, and RDF are
used to represent syntactic and semantic provenance
metadata using the annotation scheme [14]. The
granularity at which provenance can be stored is flexible
and is any resource identifiable by an LSID [18].
The myGrid Information Repository (mIR) data service
is a central repository built over a relational database to
store metadata about experimental components [18]. A
number of ways are available for knowledge discovery
using provenance. For instance, the semantic
provenance information available as RDF can be viewed
as a labeled graph using the Haystack semantic web
browser [18]. COHSE (Conceptual Open Hypermedia
Services Environment), a semantic hyperlink utility, is
another tool used to build a semantic web of
provenance. Here, semantically annotated provenance
logs are interlinked using an ontology reasoning service
and displayed as a hyperlinked web page. Provenance
information generated during the execution of a
workflow can also trigger the rerun of another workflow
whose input data parameters it may have updated.
3.3 CMCS
The CMCS project is an informatics toolkit for
collaboration and metadata-based data management for
multi-scale science [24, 25]. CMCS manages
heterogeneous data flows and metadata across multi-
disciplinary sciences such as combustion research,
supplemented by provenance metadata for establishing
the pedigree of data. CMCS uses the Scientific
Annotation Middleware (SAM) repository for storing
URL referenceable files and collections [25].
CMCS uses an annotation scheme to associate XML
metadata properties with the files in SAM and manages
them through a Distributed Authoring and Versioning
(WebDAV) interface. Files form the level of granularity
and all resources such as data objects, processes, web
services, and bibliographic records are modeled as files.
Dublin Core (DC) verbs like Has Reference, Issued, and
Is Version Of are used as XML properties for data files
and semantically relate them to their deriving processes
through XLink references in SAM [24]. DC elements
like Title and Creator, and user-defined metadata can
provide additional context information. Heterogeneous
metadata schemas are supported by mapping them to
standard DC metadata terms using XSLT translators.
Direct association of provenance metadata with the data
object makes this a data-oriented model.
There is no facility for automated collection of lineage
from a workflow’s execution. Data files and their
metadata are populated by DAV-aware applications in
workflows or manually entered by scientists through a
portal interface [25]. Provenance metadata properties
34
SIGMOD Record, Vol. 34, No. 3, Sept. 2005
Table 1 Summary of characteristics of surveyed data provenance techniques
Chimera myGRID CMCS ESSW Trio
Applied
Domain
Physics,
Astronomy Biology Chemical Sciences
Earth Sciences None
Workflow Type Script Based Service Oriented Service Oriented Script Based Database Query
Use of Provenance Informational; Audit;
Data Replication
Context Information;
Re-enactment
Informational;
update data Informational Informational; up
date propagation
Subject Process Process Data Both Data
Granularity Abstract datasets
(Presently files)
Abstract resources
having LSID Files Files Tuples in
Database
Representation
Scheme
Virtual Data Language
Annotations
XML/RDF
Annotations
DublinCore XML
Annotations
XML/RDF
Annotations
Query
Inversion
Semantic Info. No Yes Yes Proposed No
Storage Repository/
Backend
Virtual Data Catalog/
Relational DB
mIR repository/
Relational DB
SAM over DAV/
Relational DB
Lineage Server/
Relational DB Relational DB
User Overhead
User defines
derivations;
Automated WF trace
User defines Service
semantics; Automated
WF Trace
Manual: Apps use
DAV APIs; Users
use portal
Use Libraries to
generate
provenance
Inverse queries
automatically
generated
Scalability
Addressed Yes No No Proposed No
Dissemination Queries Semantic browser;
Lineage graph
Browser;Queries;
GXL/RDF Browser SQL/TriQL
Queries
can be queried from SAM using generic WebDAV
clients. Special portlets allow users to traverse the
provenance metadata for a resource as a web page with
hyperlinks to related data, or as a labeled graph
represented in the Graphics eXchange Language (GXL).
The provenance information can also be exported to
RDF that semantic agents can use to infer relationships
between resources. Provenance metadata that indicate
data modification can generate notifications that trigger
workflow execution to update dependent data products.
3.4 ESSW
The Earth System Science Workbench (ESSW) [28] is a
metadata management and data storage system for earth
science researchers. Lineage is a key facet of the
metadata created in the workbench, and is used for
detecting errors in derived data products and in
determining the quality of datasets.
ESSW uses a scripting model for data processing i.e. all
data manipulation is done through scripts that wrap
existing scientific applications [26]. The sequence of
invocation of these scripts by a master workflow script
forms a DAG. Data products at the granularity of files
are consumed and produced by the scripts, with each
data product and script having a uniquely labeled
metadata object. As the workflow script invokes
individual scripts, these scripts, as part of their
execution, compose XML metadata for themselves and
the data products they generate. The workflow script
links the data flow between successive scripts using
their metadata ids to form the lineage trace for all data
products, represented as annotations. By chaining the
scripts and the data using parent-child links, ESSW is
balanced between data and process oriented lineage.
ESSW puts the onus on the script writer to record the
metadata and lineage using templates and libraries that
are provided. The libraries store metadata objects as
files in a web accessible location and the lineage
separately in a relational database [26]. Scalability is not
currently addressed though it is proposed to federate
lineage across organizations. The metadata and lineage
information can be navigated as a workflow DAG
through a web browser that uses PHP scripts to access
the lineage database [28]. Future work includes
encoding lineage information semantically as RDF
triples to help answer richer queries [26].
3.5 Trio
Cui and Widom [22, 29] trace lineage information for
view data in data warehouses. The Trio project [23]
leverages some of this work in a proposed database
system which has data accuracy and data lineage as
inherent components. While data warehouse mining and
updation motivates lineage tracking in this project, any
e-science system that uses database queries and
functions to model workflows and data transformations
can apply such techniques.
A database view can be modeled as a query tree that is
evaluated bottom-up, starting with leaf operators having
tables as inputs and successive parent operators taking
as input the result of a child operator [22]. For ASPJ
(Aggregate-Select-Project-Join operator) views, it is
possible to create an inverse query of the view query
that operates on the materialized view, and recursively
moves down the query tree to identity the source tables
in the leaves that form the view data’s lineage [22].
Trio [23] uses this inversion model to automatically
determine the source data for tuples created by view
SIGMOD Record, Vol. 34, No. 3, Sept. 2005
35
queries. The inverse queries are recorded at the
granularity of a view tuple and stored in a special
Lineage table. This direct association of lineage with
tuples makes this a data-oriented provenance scheme.
Mechanisms to handle (non-view) tuples created by
insert and update queries, and through user-defined
functions are yet to be determined. Lineage in Trio is
simply the source tuples and the view query that created
the view tuple, with no semantic metadata recorded.
Scalability is not specifically addressed either. Other
than querying the Lineage table, some special purpose
constructs will be provided for retrieving lineage
information through a Trio Query Language (TriQL).
4. Conclusion
In this paper, we presented a taxonomy to understand
and compare provenance techniques used in e-science
projects. The exercise shows that provenance is still an
exploratory field and several open research questions are
exposed. Ways to federate provenance information and
assert its truthfulness need study for it to be usable
across organizations [12]. Evolution of metadata and
service interface standards to manage provenance in
diverse domains will also contribute to a wider adoption
of provenance and promote its sharing [11]. The ability
to seamlessly represent provenance of data derived from
both workflows and databases can help in its portability.
Ways to store provenance about missing or deleted data
(phantom lineage [23]) require further consideration.
Finally, a deeper understanding of provenance is needed
to identify novel ways to leverage it to its full potential.
5. References
[1] J. Brase, "Using Digital Library Techniques - Registration
of Scientific Primary Data," in ECDL, 2004.
[2] D. G. Clarke and D. M. Clark, "Lineage," in Elements of
Spatial Data Quality, 1995.
[3] J. L. Romeu, "Data Quality and Pedigree," in Material
Ease, 1999.
[4] H. V. Jagadish and F. Olken, "Database Management for
Life Sciences Research," in SIGMOD Record, vol. 33, 2004.
[5] "Access to genetic resources and Benefit-Sharing (ABS)
Program," United Nations University, 2003.
[6] P. Buneman, S. Khanna, and W. C. Tan, "Why and Where:
A Characterization of Data Provenance," in ICDT, 2001.
[7] D. P. Lanter, "Design of a Lineage-Based Meta-Data Base
for GIS," in Cartography and Geographic Information
Systems, vol. 18, 1991.
[8] M. Greenwood, C. Goble, R. Stevens, J. Zhao, M. Addis,
D. Marvin, L. Moreau, and T. Oinn, "Provenance of e-Science
Experiments - experience from Bioinformatics," in
Proceedings of the UK OST e-Science 2nd AHM, 2003.
[9] Y. L. Simmhan, B. Plale, and D. Gannon, "A Survey of
Data Provenance Techniques," in Technical Report TR-618:
Computer Science Department, Indiana University, 2005.
[10] R. Bose and J. Frew, "Lineage retrieval for scientific data
processing: a survey," in ACM Comput. Surv., vol. 37, 2005.
[11] S. Miles, P. Groth, M. Branco, and L. Moreau, "The
requirements of recording and using provenance in e-Science
experiments," in Technical Report, Electronics and Computer
Science, University of Southampton, 2005.
[12] D. Pearson, "Presentation on Grid Data Requirements
Scoping Metadata & Provenance," in Workshop on Data
Derivation and Provenance, Chicago, 2002.
[13] G. Cameron, "Provenance and Pragmatics," in Workshop
on Data Provenance and Annotation, Edinburgh, 2003.
[14] C. Goble, "Position Statement: Musings on Provenance,
Workflow and (Semantic Web) Annotations for
Bioinformatics," in Workshop on Data Derivation and
Provenance, Chicago, 2002.
[15] P. P. da Silva, D. L. McGuinness, and R. McCool,
"Knowledge Provenance Infrastructure," in IEEE Data
Engineering Bulletin, vol. 26, 2003.
[16] H. Galhardas, D. Florescu, D. Shasha, E. Simon, and C.-
A. Saita, "Improving Data Cleaning Quality Using a Data
Lineage Facility," in DMDW, 2001.
[17] I. T. Foster, J. S. Vöckler, M. Wilde, and Y. Zhao, "The
Virtual Data Grid: A New Model and Architecture for Data-
Intensive Collaboration," in CIDR, 2003.
[18] J. Zhao, C. A. Goble, R. Stevens, and S. Bechhofer,
"Semantically Linking and Browsing Provenance Logs for E-
science," in ICSNW, 2004.
[19] A. Woodruff and M. Stonebraker, "Supporting Fine-
grained Data Lineage in a Database Visualization
Environment," in ICDE, 1997.
[20] B. Plale, D. Gannon, D. Reed, S. Graves, K.
Droegemeier, B. Wilhelmson, and M. Ramamurthy, "Towards
Dynamically Adaptive Weather Analysis and Forecasting in
LEAD," in ICCS workshop on Dynamic Data Driven
Applications, 2005.
[21] D. Bhagwat, L. Chiticariu, W. C. Tan, and G.
Vijayvargiya, "An Annotation Management System for
Relational Databases," in VLDB, 2004.
[22] Y. Cui and J. Widom, "Practical Lineage Tracing in Data
Warehouses," in ICDE, 2000.
[23] J. Widom, "Trio: A System for Integrated Management of
Data, Accuracy, and Lineage," in CIDR, 2005.
[24] C. Pancerella, J. Hewson, W. Koegler, D. Leahy, M. Lee,
L. Rahn, C. Yang, J. D. Myers, B. Didier, R. McCoy, K.
Schuchardt, E. Stephan, T. Windus, K. Amin, S. Bittner, C.
Lansing, M. Minkoff, S. Nijsure, G. v. Laszewski, R. Pinzon,
B. Ruscic, Al Wagner, B. Wang, W. Pitz, Y. L. Ho, D.
Montoya, L. Xu, T. C. Allison, W. H. Green, Jr, and M.
Frenklach, "Metadata in the collaboratory for multi-scale
chemical science," in Dublin Core Conference, 2003.
[25] J. Myers, C. Pancerella, C. Lansing, K. Schuchardt, and
B. Didier, "Multi-Scale Science, Supporting Emerging
Practice with Semantically Derived Provenance," in ISWC
workshop on Semantic Web Technologies for Searching and
Retrieving Scientific Data, 2003.
[26] R. Bose and J. Frew, "Composing Lineage Metadata with
XML for Custom Satellite-Derived Data Products," in
SSDBM, 2004.
[27] I. T. Foster, J.-S. Vöckler, M. Wilde, and Y. Zhao,
"Chimera: A Virtual Data System for Representing, Querying,
and Automating Data Derivation," in SSDBM, 2002.
[28] J. Frew and R. Bose, "Earth System Science Workbench:
A Data Management Infrastructure for Earth Science
Products," in SSDBM, 2001.
[29] Y. Cui and J. Widom, "Lineage tracing for general data
warehouse transformations," in VLDB Journal, vol. 12, 2003.
36
SIGMOD Record, Vol. 34, No. 3, Sept. 2005
... It explains how data evolves from process to another. Many researchers used provenance in the field of information technology, which is defined as the process of recording the history of origin, evolution, process activities and manipulation of data over time [30,128,156]. Researchers introduce data provenance as a solution to provide a number of required features to ensure trustworthiness, integrity, data quality, confidentiality, availability and other security requirements. ...
... Provenance, also referred to as pedigree, or genealogy, is a form of metadata that documents the origin and utilization of a given entity [31,56,128]. In the field of information technology, provenance considers data as the counterpart or reflection of an art object. ...
Preprint
Full-text available
The Internet of Things (IoT) relies on resource-constrained devices deployed in unprotected environments. Different types of risks may be faced during data transmission in single-hop and multi-hop scenarios. Addressing these vulnerabilities is crucial. A systematic literature review of data provenance in IoT is presented, exploring existing techniques, practical implementations, security requirements, and performance metrics. Respective contributions and shortcomings are compared. A taxonomy related to the development of data provenance in IoT is proposed. Open issues are identified, and future research directions are presented, providing useful insights for the evolution of data provenance research in the context of the IoT.
... Some of the literature where provenance is used as a search mechanism is provided below. Provenance data has a wide range of uses in computer science, such as reproducibility, debuggine, optimiztion, and compliance Ko et al. (2011) The research work in Provenance (n.d.), Chen et al. (2017), Imran et al. (2012b), Simmhan et al. (2005), and Tsai et al. (2007) presented the idea of using provenance as a search criterion in computational systems. They argued the use of provenance as first-class cloud data and used it for enhanced search capabilities. ...
... Summary of studies on provenance-based search mechanisms.ReferenceFocus Methodology Key findings Provenance (n.d.),Simmhan et al. (2005), and Tsai et al. ...
Article
Full-text available
Cloud computing revolutionizes data management by offering centralized repositories or services accessible over the Internet. These services, hosted by a single provider or distributed across multiple entities, facilitate seamless access for users and applications. Additionally, cloud technology enables federated search capabilities, allowing organizations to amalgamate data from diverse sources and perform comprehensive searches. However, such integration often leads to challenges in data quality and duplication due to structural disparities among datasets, including variations in metadata. This research presents a novel provenance‐based search model designed to enhance data quality within cloud environments. The model expands the traditional concept of a single canonical URL by incorporating provenance data, thus providing users with diverse search options. Leveraging this model, the study conducts inferential analyses to improve data accuracy and identify duplicate entries effectively. To verify the proposed model, two research paper datasets from Kaggle and DBLP repositories are utilized, and the model effectively identifies duplicates, even with partial queries. Tests demonstrate the system's ability to remove duplicates based on title or author, in both single and distributed dataset scenarios. Traditional search engines struggle with duplicate content, resulting in biased results or inefficient crawling. In contrast, this research uses provenance data to improve search capabilities, overcoming these limitations.
... Data provenance is metadata describing the origin, history and changes of data [10], which can provide insight into a processing workflow and pinpoint to errors and problems. While data provenance is commonly used in many disciplines, it is less frequently implemented in biomedical research [11,12]. ...
... Verifiable computation has been extensively studied within the context of many applications [5], [30]- [48], [48]- [51]. In particular, the use of verifiable computation in cloud computing [47], [49], [50], [52] and Internet of Things [48] are active fields of research. ...
Article
Full-text available
The rapid increase in data generation has led to outsourcing computation to cloud service providers, allowing clients to handle large tasks without investing resources. However, this brings up security concerns, and while there are solutions like fully homomorphic encryption and specific task-oriented methods, challenges in optimizing performance and enhancing security models remain for widespread industry adoption. Outsourcing computations to an untrusted remote computer can be risky, but attestation techniques and verifiable computation schemes aim to ensure the correct execution of outsourced computations. Nevertheless, the latter approach incurs significant overhead in generating a proof for the client. To minimize this overhead, the concept of a Correct Execution Environment (CEE) has been proposed (CEEv1), which omits proof generation for trusted parts of the prover. This paper proposes a new hardware-based CEE (CEEv2) that supports virtual memory and uses an inverted page table mechanism to detect, or prevent, illegal modifications to page mappings. The proposed mechanism supports virtual memory and thwarts virtual-to-physical mapping attacks, while minimizing software modifications. The paper also compares the proposed mechanism to other similar mechanisms used in AMD’s SEV-SNP [1] and Intel’s SGX [2].
... A proveniência tem ganhado alta relevância no âmbito científico [9,51,24,28,29,61]. Essa seção tenta responder alguma das questões a respeito da proveniência: O que realmente é a proveniência? ...
Thesis
Full-text available
Recently, many programming languages have become popular in the scientific environment, especially scripting languages, due to their high abstraction level. In this context, many scientists began to model their scientific experiments in scripting languages to ensure more control and efficient data management when seeking for results. If, on one side, scripts allow scientists to optimize and automate their computational experiments, on the other side, the need to modify the script and execute it several times to confirm or refute the experimental hypothesis generates large quantities of data. Provenance plays a key role in helping scientists keep track of all these changes and execution data in this scenario. However, current provenance management tools for scripts fail in two main aspects: (i) they fail in providing an intuitive visualization of prospective provenance, and (ii) they fail in merging prospective and retrospective provenance in a single diagram. To fill in these gaps, this dissertation presents ProspectiveProv, an approach that aims at helping scientists to understand the structure of script experiments. To do so, ProspectiveProv uses prospective and retrospective provenance to generate diagrams that represent the structure of the experiment code as a means to support the process of understanding and analyzing the results of scientific experiments modeled as Python scripts. To evaluate our proposed approach, we compare the diagrams generated by ProspectiveProv with those generated by similar approaches, and also with the isolated use of scripts (no diagrams). The experimental results show that ProspectiveProv is as efficient and effective as the compared approaches when considering complex scripts. Also, users from other knowledge areas (not computer science) felt more comfortable when using diagrams generated by ProspectiveProv.
... In the era of big data, the ability to accurately trace the provenance of data is critical for ensuring the integrity, reliability, and accountability of information (Simmhan et al., 2005;Buneman et al., 2001). Data provenance, the documentation of the origins and the life-cycle of data, is essential for various applications, including the detection of AI-generated content and deepfakes, legal compliance, data ownership protection, and copyright protection. ...
Preprint
Full-text available
Identifying the origin of data is crucial for data provenance, with applications including data ownership protection, media forensics, and detecting AI-generated content. A standard approach involves embedding-based retrieval techniques that match query data with entries in a reference dataset. However, this method is not robust against benign and malicious edits. To address this, we propose Data Retrieval with Error-corrected codes and Watermarking (DREW). DREW randomly clusters the reference dataset, injects unique error-controlled watermark keys into each cluster, and uses these keys at query time to identify the appropriate cluster for a given sample. After locating the relevant cluster, embedding vector similarity retrieval is performed within the cluster to find the most accurate matches. The integration of error control codes (ECC) ensures reliable cluster assignments, enabling the method to perform retrieval on the entire dataset in case the ECC algorithm cannot detect the correct cluster with high confidence. This makes DREW maintain baseline performance, while also providing opportunities for performance improvements due to the increased likelihood of correctly matching queries to their origin when performing retrieval on a smaller subset of the dataset. Depending on the watermark technique used, DREW can provide substantial improvements in retrieval accuracy (up to 40\% for some datasets and modification types) across multiple datasets and state-of-the-art embedding models (e.g., DinoV2, CLIP), making our method a promising solution for secure and reliable source identification. The code is available at https://github.com/mehrdadsaberi/DREW
... As HIMS evolve, so do the ethical dilemmas they present, necessitating continual assessment and revision of ethical guidelines (Winter, 2013). By involving a broad range of participants, especially patient representatives, these committees ensure that the patient's voice and rights are central to the governance of HIMS, thereby fostering trust and promoting more patient-centered healthcare practices (Simmhan et al., 2005). These comprehensive governance structures are essential for maintaining the integrity of HIMS and ensuring they serve the best interests of patients and healthcare providers. ...
Article
This paper explores the crucial role of privacy and ethics in Health Information Management Systems (HIMS), addressing the inherent challenges and presenting innovative solutions through detailed case studies. We examine the significant privacy concerns, such as data breaches and unauthorized data sharing, alongside ethical issues like informed consent and equitable access to technology. Effective solutions including the Privacy-Preserving Distributed Analytics (PPDA) model and proactive bioethics committees, as demonstrated by institutions like Vanderbilt University Medical Center, illustrate successful strategies for managing these concerns. The paper emphasizes the importance of prioritizing privacy and ethics not merely as compliance requirements but as foundational elements essential to the trustworthiness and effectiveness of HIMS. It advocates for a continuous, proactive approach to address these issues as technology evolves and regulations change. Furthermore, we call for a collaborative effort among policymakers, healthcare providers, technologists, and patients to develop and refine HIMS that uphold the highest standards of privacy, ethics, and accessibility, thus enhancing the quality of care and health outcomes for all stakeholders.
Article
Full-text available
Este artigo destaca a importância da segurança e eficiência de sistemas computacionais nas instituições de ensino e pesquisa devido ao aumento no volume de dados gerados. As interfaces de Programação de Aplicativos (APIs, do inglês Application Programming Interfaces) baseadas em REST (Representation State Transfer) são utilizadas para automatizar a integração de softwares, principalmente na pesquisa científica. A segurança destas APIs é crucial, pois configurações inadequadas podem torná-las vulneráveis a ataques cibernéticos, resultando em vazamento de dados. Neste sentido, o desenvolvimento orientado a testes de software e o uso de UUIDs (Identificadores Únicos Universais) devem ser práticas essenciais a serem consideradas no desenvolvimento de sistemas computacionais nesta área. O propósito do artigo é destacar vantagens do uso de UUIDs (Universally Unique Identifier) nas URLs (Uniform Resource Locator) das APIs como recurso adicional visando reduzir o risco de vazamento acidental ou intencional de informações, de modo a tornar as identificações exclusivas e, desta forma, mais seguras.
Article
Full-text available
Article
Full-text available
In e-Science experiments, it is vital to record the experimental process for later use such as in interpreting results, verifying that the correct process took place or tracing where data came from. The documentation of a process that led to some data is called the provenance of that data, and a provenance architecture is the software architecture for a system that will provide the necessary functionality to record, store and use provenance data. However, there has been little principled analysis of what is actually required of a provenance architecture, so it is impossible to determine the functionality they would ideally support. In this paper, we present use cases for a provenance architecture from current experiments in biology , chemistry, physics and computer science, and analyse the use cases to determine the technical requirements of a generic, application-independent architecture. We propose an architecture that meets these requirements and evaluate a preliminary implementation by attempting to realise one of the use cases.
Article
Full-text available
Data management is growing in complexity as large-scale applications take advantage of the loosely coupled resources brought together by grid middleware and by abundant storage capacity. Metadata describing the data products used in and generated by these applications is essential to disambiguate the data and enable reuse. Data provenance, one kind of metadata, pertains to the derivation history of a data product starting from its original sources. The provenance of data products generated by complex transformations such as workflows is of considerable value to scientists. From it, one can ascertain the quality of the data based on its ancestral data and derivations, track back sources of errors, allow automated re-enactment of derivations to update a data, and provide attribution of data sources. Provenance is also essential to the business domain where it can be used to drill down to the source of data in a data warehouse, track the creation of intellectual property, and provide an audit trail for regulatory purposes. In this paper we create a taxonomy of data provenance techniques, and apply the classification to current research efforts in the field. The main aspect of our taxonomy categorizes provenance systems based on why they record provenance, what they describe, how they represent and store provenance, and ways to disseminate it. Our synthesis can help those building scientific and business metadata-management systems to understand existing provenance system designs. The survey culminates with an identification of open research problems in the field. 1 Introduction The growing number and size of computational and data resources coupled with uniform access mechanisms provided by a common Grid middleware stack is allowing scientists to perform advanced scientific tasks in collaboratory environments. Large collaboratory scientific projects such as the Large Hadron Collider [1] and Sloan Digital Sky Survey (SDSS) [2] generate terabytes of data whose complexity is managed by data grids. This data deluge mandates the need for rich and descriptive metadata to accompany the data in order to understand it and reuse it across partner organizations. Business users too are having to work with data from third-parties and from across the enterprise that are aggregated within a data warehouse. Dash-boarding tools that help analysts with forecasting and trend prediction operate on these data silos and it is essential for these data mining tasks to have metadata describing the data properties [3]. Provenance is one kind of metadata which tracks the steps by which the data was derived and can provide significant value addition in such data intensive scenarios.
Article
Full-text available
The goal of the Collaboratory for the Multi-scale Chemical Sciences (CMCS) [1] is to develop an informatics-based approach to synthesizing multi-scale chemistry information to create knowledge in the chemical sciences. CMCS is using a portal and metadata-aware content store as a base for building a system to support inter-domain knowledge exchange in chemical science. Key aspects of the system include configurable metadata extraction and translation, a core schema for scientific pedigree, and a suite of tools for managing data and metadata and visualizing pedigree relationships between data entries. CMCS metadata is represented using Dublin Core with metadata extensions that are useful to both the chemical science community and the science community in general. CMCS is working with several chemistry groups who are using the system to collaboratively assemble and analyze existing data to derive new chemical knowledge. In this paper we discuss the project's metadata-related requirements, the relevant software infrastructure, core metadata schema, and tools that use the metadata to enhance science.
Article
Full-text available
Scientific progress is becoming increasingly dependent on our ability to study phenomena at multiple scales and from multiple perspectives. The ability to recontextualize third-party data within the semantic and syntactic framework of a given research project is increasingly seen as a primary barrier in multi-scale science. Within the Collaboratory for Multi-Scale Chemical Science (CMCS) project, we are developing a general-purpose, informatics-based approach that emphasizes "on-demand" metadata creation, configurable data translations, and semantic mapping to support the rapidly increasing and continually evolving requirements for managing data, metadata, and data relationships in such projects. A concrete example of this approach is the design of the CMCS provenance subsystem. The concept of provenance varies across communities, and multiple independent applications contribute to and use provenance. In the CMCS project, we have developed generic tools for viewing provenance relationships and for using them to, for example, scope notifications and searches. These tools rely on a configurable concept of provenance defined in terms of other relationships. The result is a very flexible mechanism capable of tracking data provenance across many disciplines and supporting multiple uses of provenance information..
Article
A conceptual design is presented for a lineage meta-data base system that documents data sources and geographic information system (GIS) transformations applied to derive cartographic products. Artificial intelligence techniques of semantic networks are used to organize input-output relationships between map layers, and frames are used for storing lineage attributes characterizing source, intermediate, and product layers. An example indicates that a lineage meta-data base enables GIS users to determine the fitness for use of spatial data sets.
Article
Data warehousing systems integrate information from operational data sources into a central repository to enable analysis and mining of the integrated information. During the integration process, source data typically undergoes a series of transformations, which may vary from simple algebraic operations or aggregations to complex “data cleansing” procedures. In a warehousing environment, the data lineage problem is that of tracing warehouse data items back to the original source items from which they were derived. We formally define the lineage tracing problem in the presence of general data warehouse transformations, and we present algorithms for lineage tracing in this environment. Our tracing procedures take advantage of known structure or properties of transformations when present, but also work in the absence of such information. Our results can be used as the basis for a lineage tracing tool in a general warehousing setting, and also can guide the design of data warehouses that enable efficient lineage tracing.