The Provenance of Electronic Data

Luc Moreau, Paul Groth, Simon Miles, Javier Vazquez, John Ibbotson, Sheng Jiang, Steve Munroe, Omer Rana, Andreas Schreiber, Victor Tan, Laszlo Varga

Journal Article: DOI: Moreau, L., Groth, P., Miles, S., Vazquez, J., Ibbotson, J., Jiang, S., Munroe, S., Rana, O., Schreiber, A., Tan, V. and Varga, L. (2008) The Provenance of Electronic Data. Communications of the ACM, 51 (4). pp. 52-58.

Abstract

In the study of fine art, provenance refers to the documented history of some art object. Given that documented history, the object attains an authority that allows scholars to appreciate its importance with respect to other works, whereas, in the absence of such history, the object may be treated with some skepticism. Our IT landscape is evolving as illustrated by applications that are open, composed dynamically, and that discover results and services on the fly. Against this challenging background, it is crucial for users to be able to have confidence in the results produced by such applications. If the provenance of data produced by computer systems could be determined as it can for some works of art, then users, in their daily applications, would be able to interpret and judge the quality of data better. We introduce a provenance lifecycle and advocate an open approach based on two key principles to support a notion of provenance in computer systems: documentation of execution and user-tailored provenance queries.

Source: OAI

Comments on this publication

ResearchGate members can add comments. Sign up now and post your comment!

Similar publications

Page 1
 
Page 2
 
Page 3
 
Page 4
 
Page 5
 
End of preview.
Page 1
The Provenance of Electronic Data
Luc Moreau, Paul Groth, Simon Miles, Javier Va´zquez-Salceda,
John Ibbotson, Sheng Jiang, Steve Munroe, Omer Rana,
Andreas Schreiber, Victor Tan, Laszlo Varga
December 7, 2007
Word count: 3005 cacm06.tex
Abstract
In the study of fine art, provenance refers to the documented history of some art ob-
ject. Given that documented history, the object attains an authority that allows scholars
to appreciate its importance with respect to other works, whereas, in the absence of such
history, the object may be treated with some skepticism. Our IT landscape is evolving as
illustrated by applications that are open, composed dynamically, and that discover results
and services on the fly. Against this challenging background, it is crucial for users to be
able to have confidence in the results produced by such applications. If the provenance of
data produced by computer systems could be determined as it can for some works of art,
then users, in their daily applications, would be able to interpret and judge the quality of
data better. We introduce a provenance lifecycle and advocate an open approach based on
two key principles to support a notion of provenance in computer systems: documentation
of execution and user-tailored provenance queries.
Introduction
Provenance is already well understood in the study of fine art where it refers to the documented
history of some art object. Given that documented history, the object attains an authority that
allows scholars to understand and appreciate its importance and context relative to other works.
Art objects that do not have a proven history may be treated with some skepticism by those who
study them.
Such an idea, transposed into computer systems, has immediate, practical use: if the prove-
nance of data produced by computer systems could be determined as it can for some works of
art, then users would be able to understand how documents were assembled, how simulation
results were determined, or how financial analyses were carried out. Thus, to accomplish such
a vision, computer applications need to be transformed into what we term provenance-aware
applications, for which the provenance of data may be retrieved, analyzed and reasoned over.
The Oxford English Dictionary defines provenance as: “(i) the fact of coming from some
particular source or quarter; origin, derivation; (ii) the history or pedigree of a work of art,
manuscript, rare book, etc.; concretely, a record of the ultimate derivation and passage of an
1
Page 2
item through its various owners.” Hence, we can regard provenance as the derivation from a
particular source to a specific state of an item. The description of such a derivation may take
different forms, or may emphasize different properties according to interest. For instance, for a
work of art, provenance usually identifies its chain of ownership; alternatively, the actual state
of a painting may be understood better by studying the different restorations it underwent.
The above dictionary definition also identifies two distinct understandings of provenance:
first, as a concept, it denotes the source or derivation of an object; second, more concretely,
it is used to refer to a record of such a derivation. Against this background, a computer-based
representation of provenance is crucial for users to perform analysis and reasoning, and decide
whether they have confidence in electronic data.
In this article, we introduce the provenance lifecycle, summarising key principles under-
pinning existing provenance systems. We then examine an open data model for describing how
applications are executed; in this context, provenance is seen as a user query over such descrip-
tions. The vision of provenance-aware applications is illustrated over a concrete example in
healthcare management, before we contrast it with existing systems.
Lifecycle of Provenance in Computer Systems
Both the scientific and business communities [FKNT03, Bur00] have adopted the service-
oriented architectural (SOA) style, which allows services to be discovered and composed dy-
namically. SOA-based applications become more dynamic and open, but equally have to satisfy
new requirements, both in e-Science and business.
In an ideal world, e-Science end-users would be able to: reproduce their results by replaying
previous computations, understand why two seemingly identical runs with the same inputs
produce different results, and find out which data sets, algorithms, or services were involved in
the derivation of their results.
In business and e-Science, some users, reviewers, auditors, or even regulators have to verify
that the process that led to some result is compliant with specific regulations or methodologies;
they have to prove that results are derived independently of services or databases with given
license restrictions; and, they need to establish that data was captured at source by instruments
that possess some precise technical characteristics.
While some users need to perform such tasks today, they cannot do so, or they can do it
only imperfectly, because the underpinning principles have not been investigated, and systems
have not been designed to support such requirements. A key observation is that electronic
data does not typically contain the necessary historical information that would help end-users,
reviewers, or regulators make the necessary verifications. Hence, there is a need to capture extra
information—which we name process documentation— that describes what actually occurred
at execution time. Process documentation is to electronic data what a record of ownership is
to a work of art. Provenance-aware applications create process documentation and store it in a
provenance store, the role of which is to offer a long-term persistent, secure storage of process
documentation (cf. Figure 1). This logical role accomodates various physical deployments: for
instance, a provenance store can be a single, autonomous service or, to be more scalable, it can
be a federation of distributed stores.
Once process documentation has been recorded, the provenance of data results can be re-
2
Page 3
Ele
ctr
on
ic d
ata
Pro
ven
anc
e
Sto
reR
eco
rd
do
cum
ent
atio
n o
f ex
ecu
tio
n
Qu
ery
an
d
rea
son
ov
er
pro
ven
anc
e
of d
ata
Ad
min
iste
r
sto
re
and
its
con
ten
ts
Pro
ven
anc
e-
Aw
are

Ap
plic
atio
n
Figure 1: Provenance Lifecycle
trieved by querying the provenance store, and analyzed to suit the user’s needs. Finally, over
time, the provenance store and its contents may need to be managed, maintained, or curated.
In summary, the provenance lifecyle consists of four different phases: (i) creating, (ii)
recording, (iii) querying, and (iv) managing, all of which provenance systems should cater
for.
An Open Model for Process Documentation
For many applications, process documentation cannot be produced in a single, atomic burst,
but instead its generation must be interleaved continuously with execution. Given this, it is
necessary to distinguish a specific item documenting part of a process from the whole process
documentation. We see the former — referred to as a p-assertion — as an assertion made by
an individual application service involved in the process. Thus, the documentation of a process
consists of a set of p-assertions made by the services involved in the process.
In order to minimize its impact on application performance, documentation needs to be
structured in such a way that it can be constructed and recorded autonomously by services, on
a piecemeal basis. Otherwise, should synchronizations be required between these services to
agree on how and where to document execution, application performance may suffer dramat-
ically. To satisfy this design requirement, various kinds of p-assertions have been identified,
which we expect applications to adopt in order to document their execution. Figure 2 illustrates
a computational service sending and receiving messages, and creating p-assertions describing
its involvment in such activity.
In SOAs, interactions consist of messages exchanged between services. By capturing all
interactions, one can analyze an execution, verify its validity, or compare it with other exe-
cutions. Therefore, process documentation includes interaction p-assertions, where an inter-
action p-assertion is a description of the contents of a message by a service that has sent or
3
Page 4
M1 M2
M3 M4
f1 f2
M3
=
f1(
M1
)
M2
=
f2(
M1
,M
4)
M2
is
in
re
ply
to
M
1
I re
ce
ive
d M
1,
M4
I s
en
t M
2,
M3
Int
era
cti
on

p-a
ss
ert
ion
s
Re
lat
ion
sh
ip
p-a
ss
ert
ion
s
Se
rvi
ce
st
ate
p-a
ss
ert
ion
s
I re
ce
ive
d M
1 a
t ti
me
t
I u
se
d a
lgo
rith
m
x.y
.z
fM
: :
me
ssa
ge
fun
ctio
n
:
com
pu
tat
ion
al
ser
vic
e
Figure 2: Kinds of p-assertions made by a computational service
received that message.
Generally, whether a service returns a result directly or calls other services, the relationship
between its outputs and inputs is not explicitly represented in messages themselves, but can be
understood only by an analysis of the service’s business logic. To promote openness and gener-
ality, we do not make any assumption about the technology used by services to implement their
business logic (such as source code, workflow language, etc.). Instead, we place a requirement
on services to provide some information, in the form of relationship p-assertions: a relation-
ship p-assertion is a description, asserted by a service, of how it obtained output data sent in an
interaction by applying some function, or algorithm, to input data from other interactions. (In
Fig. 2, output message M3 was obtained by applying function f1 to input M1.)
With these two kinds of p-assertions, process documentation as a whole is greater than the
sum of its individual parts. Indeed, while p-assertions are simple pieces of documentation that
can be produced by services autonomously, interaction and relationship p-assertions taken to-
gether capture an explicit description of the flow of data in a process: interaction p-assertions
denote data flows between services, whereas relationship p-assertions denote data flows within
services. Such data flows capture the causal and functional data dependencies that occur in ex-
ecution and, in the most general case, constitute a directed acyclic graph (DAG). For a specific
data item, the data flow DAG indicates how it is produced and used. The data flow DAG is thus
a core element of provenance representation, but it is not the only one, as we now explain.
Beyond the flow of data in a process, internal service states may also be necessary to under-
stand non-functional characteristics of execution, such as performance or accuracy of services,
and therefore the nature of the result they compute. Hence, we define a service state p-assertion
as documentation provided by a service about its internal state in the context of a specific in-
teraction. Service state p-assertions can be extremely varied: they may include the amount of
4
Page 5
disk and CPU time a service used in a computation, its local time when an action occurred, the
floating point precision of the results it produced, or application-specific state descriptions.
In order for provenance-aware applications to be interoperable, it is crucial that the pro-
cess documentation they respectively produce be structured according to a shared data model.
Therefore, the novelty of our approach is the openness of the proposed model of documenta-
tion [GJM+06], which is conceived to be independent of application technologies [MGBM07].
Taken together, these characteristics allow process documentation to be produced autonomously
by application services, and be expressed in an open format, over which provenance queries
can be expressed.
Querying the Provenance of Electronic Data
Provenance queries are user-tailored queries over process documentation aimed at obtaining
the provenance of electronic data. In this context, a first challenge is to characterize the data
item that is of interest to the user. Indeed, since data can be mutable, its provenance, i.e. history,
can vary according to the point in execution from which a user wishes to find its provenance. A
provenance query, therefore, needs to identify a data item with respect to a given documented
event (i.e., sending or receiving a message).
The full details of everything that ultimately caused a data item to be as it is could poten-
tially be very large. For example, the full provenance of an experiment’s results would include
a description of the process that produced the materials used in the experiment, the provenance
of any materials used in producing those materials, the devices and software used in the exper-
iment and their settings, etc. Ultimately, should documentation be available, we would include
details of processes leading back to the beginning of time, or at least the epoch of provenance
awareness.
Thus, users need to be able to express the scope of interest in a process, by means of a
provenance query. Such a query then essentially performs a reverse graph traversal over the
data flow DAG and terminates according to the query-specified scope; the query output is a
DAG subset. Scoping can be based on types of relationships, intermediary results, services, or
subprocesses [GJM+06].
Example: Provenance in Healthcare Management
In order to illustrate the proposed approach, we consider a healthcare management application.
The Organ Transplant Management (OTM) system manages all the activities pertaining to or-
gan transplants across multiple Catalan hospitals and their regulatory authority [AVSK+06].
OTM consists of a complex process, involving the surgery itself, but also a wide range of other
activities, such as data collection and patient organ analysis, which all have to comply with
a set of regulatory rules. Currently, OTM is supported by an IT infrastructure that maintains
records allowing medical personnel to view (and edit) a given patient’s local file within a given
institution or laboratory. However, the system does not connect records, nor capture depen-
dencies between them. It does not allow external auditors or patients’ families to analyze or
understand how decisions are reached.
5
End of preview.
Preview full-text

Science & Research Jobs

Keywords

able
 
advocate
 
applications
 
challenging background
 
computer systems
 
discover results
 
documented history
 
dynamically
 
fine art
 
key principles
 
open
 
open approach
 
provenance
 
provenance lifecycle
 
skepticism
 
user-tailored provenance queries
 
users