Page 1
Provenance-based Trust for Grid Computing
— Position Paper —
Luc Moreau1, Syd Chapman2, Andreas Schreiber3, Rolf Hempel3, Omer
Rana4, Lazslo Varga5, Ulises Cortes6, Steven Willmott6
1 School of Electronics and Computer Science, University of Southampton,
Southampton, SO17 1BJ, United Kingdom
2 IBM United Kingdom Ltd, Hursley Park, Winchester, SO21 2JN, United Kingdom
3 German Aerospace Center (DLR), Linder Hoehe, 51147 Koln, Germany
4 School of Computer Science, University of Wales, 5 the Parade, CF24 3XF, Cardiff,
United Kingdom
5 MTA SZTAKI, Kende u. 13–17, 1111 Budapest, Hungary
6 Software Department (LSI-AI), Universitat Politecnica de Catalunya, 20 Jordi
Girona, 08034 Barcelona, Spain
l.moreau@ecs.soton.ac.uk, syd chapman@uk.ibm.com,
Andreas.Schreiber@dlr.de, Rolf.Hempel@dlr.de, O.F.Rana@cs.cf.ac.uk,
laszlo.varga@sztaki.hu,ia@lsi.upc.es,steve@lsi.upc.es
Abstract. Current evolutions of Internet technology such as Web Ser-
vices, ebXML, peer-to-peer and Grid computing all point to the develop-
ment of large-scale open networks of diverse computing systems interact-
ing with one another to perform tasks. Grid systems (and Web Services)
are exemplary in this respect and are perhaps some of the first large-
scale open computing systems to see widespread use - making them an
important testing ground for problems in trust management which are
likely to arise. From this perspective, today’s grid architectures suffer
from limitations, such as lack of a mechanism to trace results and lack of
infrastructure to build up trust networks. These are important concerns
in open grids, in which “community resources” are owned and managed
by multiple stakeholders, and are dynamically organised in virtual or-
ganisations. Provenance enables users to trace how a particular result
has been arrived at by identifying the individual services and the aggre-
gation of services that produced such a particular output. Against this
background, we present a research agenda to design, conceive and imple-
ment an industrial-strength open provenance architecture for grid sys-
tems. We motivate its use with three complex grid applications, namely
aerospace engineering, organ transplant management and bioinformat-
ics. Industrial-strength provenance support includes a scalable and secure
architecture, an open proposal for standardising the protocols and data
structures, a set of tools for configuring and using the provenance archi-
tecture, an open source reference implementation, and a deployment and
validation in industrial context. The provision of such facilities will en-
rich grid capabilities by including new functionalities required for solving
complex problems such as provenance data to provide complete audit-
trails of process execution and third-party analysis and auditing. As a
result, we anticipate that a larger uptake of grid technology is likely to
Page 2
occur, since unprecedented possibilities will be offered to users and will
give them a competitive edge.
1 Introduction
The Grid [10] is a very large scale computer system which is capable of coordinat-
ing resources that are not subject to centralised control, whilst using standard,
open, general-purpose protocols and interfaces, and delivering non-trivial quali-
ties of service [9]. Grids are therefore likley to become one of the largest classes
of open computing systems and provide us with a well advanced architectural
basis for assessing the type of trust management problems which might arise.
As part of the endeavour to define the Grid, a service-oriented approach has
been adopted, by which computational resources, storage resources, networks,
programs, libraries and databases are all represented by services [11]. In this
context, a service is a network-enabled entity capable of encapsulating diverse
implementations behind a common interface. A service-oriented view is powerful
since it allows the composition of services to form more sophisticated services.
The service-oriented Open Grid Service Architecture (ogsa) defines a Grid Ser-
vice as a Web Service [3] that follows specific conventions and provides a set of
well-defined interfaces [11].
In the “Anatomy of the Grid”, Foster, Kesselman and Tuecke describe the
problem underlying the Grid concept as coordinated resource sharing and prob-
lem solving in dynamic, multi-institutional virtual organisations [12]. While the
underpinning mechanisms for creating and managing such virtual organisations
still remain to be understood, effort is required to allow users to place their trust
in the data produced by such compositions. Understanding how a given service
is likely to modify data flowing into it, and how this data has been generated, is
crucial as illustrated by the following generic question:
In an open grid environment, let us consider a set of services that decide
to form a virtual organisation with the aim to produce a given result; how
can we determine the process that generated the result, especially after
the virtual organisation has been disbanded?
Against this background, provenance is an annotation able to explain how a
particular result has been derived; such provenance information can be used to
better identify the process that was used to reach a particular conclusion.
Provenance is therefore important to enable a user to trace how a particular
result has been arrived at, and the sequence of steps that are involved. Specif-
ically, we consider the specific notion of execution provenance, which identifies
what data is passed between services, what services are available, and how re-
sults are eventually generated for particular sets of input values. Using execution
provenance, a user can trace the “process” that led to the aggregation of services
producing a particular output.
It is our belief that provenance support should be part of a grid infrastruc-
ture, so that users can put their trust into such a new paradigm. The purpose
Page 3
of this position paper is to present a research agenda for trust-based provenance.
This paper is organised as follows. We review background work on provenance
in Section 2. We then present the desired characteritics of a provenance architec-
ture, which would allow users to trust a grid computing environment (Section 3).
We then discuss three grid applications, which would benefit from provenance
and trust in Section 4. We then discuss why provenance offers a good approach
for establishing trust in open environments (Section 5) before concluding the
paper with Section 6.
2 Background on Provenance
The vision of the Grid as an open environment in which collaborations are dy-
namically negotiated and organised with the goal to produce some specific results
has inevitably resulted in the concern that users need to be able to trust results
produced by such computations. This motivated two workshops on provenance
[19, 20]. The idea of providing provenance is relatively new and unexplored. So
far, work on provenance has mainly identified uses, properties and requirements
of provenance in multiple application domains.
In modern information systems, data can be collated from a variety of dis-
tributed and diverse resources and processed to form new data. We can view
the sequence of taking a dataset, processing it and producing a new dataset as
a dataset transformation. In order to provide provenance, all datasets and their
transformations must be recorded. Saltz [23] suggests that we can achieve a sound
lineage record by recording enough information to ensure that any dataset trans-
formation is reproducible. Goble also presents some notable uses of provenance:
reliability and quality; justification and audit; re-usability and reproducibility;
change and evolution [15]. The storage and maintenance of provenance records
is an important consideration. Frew and Bose [13] propose the following re-
quirements for provenance collection. (i) A standard lineage representation is
required so data lineage can be communicated reliably between systems (cur-
rently there is no standard lineage format). (ii) Automated lineage recording
is essential since humans are unlikely to record all the necessary information
manually. (iii) Unobtrusive information collecting is desirable so that current
working practices are not disrupted.
In the context of databases, the provenance of a specific piece of data identi-
fies parts of the database that contributed to it. Buneman et al. [4, 5] distinguish
between “why” provenance (the set of tuples that contribute to the result) and
“where” provenance (the location(s) in the source database from which the re-
sult was extracted). They formulate a precise definition of provenance using a
general data model that applies to both relational databases and hierarchical
data structures such as xml.
Some authors are starting to investigate architectural support for provenance.
The Chimera Virtual Data System [8] comprises a virtual data catalogue, for
representing data derivation procedures and derived data, with a virtual data
language interpreter that translates user requests into data definition and query
Page 4
operations on the database. The explicit representation of data derivation pro-
vides a documentation of data provenance, which can be used to audit and trace
the lineage of derived data produced by computation.
Szomszor and Moreau [24] propose a provenance recording capability for
service-oriented architectures such as the Grid and Web Services. They offer a
Web Service to record provenance information and to view and retrieve prove-
nance. In particular, they provide a provenance-based result-validation mech-
anism by which provenance is used to determine whether previous computed
results are still up to date.
None of these approaches regards the problem of provenance generation as a
collaborative activity between multiple parties. The implication is that, in such
systems, provenance data cannot be trusted or audited by third parties since
provenance is only provided by one component (typically a workflow enactor),
without any verification that execution took place in the way reported by that
very component. It is this specific aspect that will be studied in the recently
UK-funded e-Science project pasoa (www.pasoa.org) led by Southampton and
Cardiff, also authors of this paper. In pasoa, the focus is on the theoretical un-
derpinning and algorithmic foundations of provenance generation and reasoning.
In order to advance the state of the art in grid computing, it is essential to
provide industrial strength provenance support when running complex applica-
tions. Industrial strength provenance support includes the following key aspects:
1. A scalable architecture capable of sustaining high volumes of data, complex
workflows with large number of services, and a high number of requests for
navigating/reasoning over provenance;
2. A secure architecture relying on industrial standards for security to pro-
mote inter-operability, and ensuring that provenance information is securely
managed;
3. Standardisation of protocols and data structures to promote inter-operability
of components provided by multiple manufactures or institutions;
4. A set of tools for configuring and using the provenance architecture;
5. Deployment and validation in industrial context.
By achieving these goals, a provenance infrastructure will become an essential
building block of a Next Generation Grid [1], which will help users to trust the
results delivered by the grid paradigm.
3 A Provenance Architecture
Service composition and orchestration have been identified as key objectives
for the Grid (and Web Services) communities [14]. In particular, workflow en-
gines allow users to identify, choose and compose services based on their own
particular interests. Workflow based computations can be seen as a simplified
(and tractable) form of virtual organisation, scripted explicitly using workflow
languages such as bpel4ws [7], wsfl [17], or xlang [25].
Page 5
A preliminary architecture capable of generating provenance and reasoning
over it is sketched in Figure 1. First, provenance gathering is a collaborative pro-
cess that involves multiple entities, including the workflow enactment engine, the
enactment engine’s client, the service directory, and the invoked services. Prove-
nance data will be submitted to one or more “provenance repositories” acting as
storage for provenance data. Upon user’s requests, some analysis, navigation and
reasoning over provenance data can be undertaken. We foresee here that storage
could be achieved by a provenance service, and that a library, optionally hosted
in the provenance service, would perform the analysis, navigation or reasoning.
Provenance
Service
Provenance
Service
Tool
Workflow enactment
Invoked service
Invoked service
Invoked service
Invoked service
Service Directory
Client
Fig. 1. Provenance Architecture
Coordination is needed between the different entities involved in workflow
enactment. All entitities have to agree on the repositories in which provenance
data is to be stored. The current enactment should also be identified in a unique
manner, and this identification should be shared by all entities involved in it, so
that provenance pertaining to a given execution is not stored in the repository
agreed for another execution. As provenance data may become very quickly huge,
the level of details to be recorded may be agreed before a session is started.
In order for provenance data to provide information that is trustworthy, we
expect such a protocol to support some “classical” properties of distributed
algorithms. For instance, using mutual authentication, an invoked service can
ensure that it submits data to a specific provenance server, and vice-versa, a
provenance server can ensure that it receives data from a given service. With
non-repudiation, we can retain evidence of the fact that a service has committed
to executing a particular invocation and has produced a given result. We antici-
End of preview.