Grid Enabling Your Data Resources with OGSA-DAI.
- SourceAvailable from: ncbi.nlm.nih.gov[Show abstract] [Hide abstract]
ABSTRACT: The Gene Expression Omnibus (GEO) project was initiated in response to the growing demand for a public repository for high-throughput gene expression data. GEO provides a flexible and open design that facilitates submission, storage and retrieval of heterogeneous data sets from high-throughput gene expression and genomic hybridization experiments. GEO is not intended to replace in house gene expression databases that benefit from coherent data sets, and which are constructed to facilitate a particular analytic method, but rather complement these by acting as a tertiary, central data distribution hub. The three central data entities of GEO are platforms, samples and series, and were designed with gene expression and genomic hybridization experiments in mind. A platform is, essentially, a list of probes that define what set of molecules may be detected. A sample describes the set of molecules that are being probed and references a single platform used to generate its molecular abundance data. A series organizes samples into the meaningful data sets which make up an experiment. The GEO repository is publicly accessible through the World Wide Web at http://www.ncbi.nlm.nih.gov/geo.Nucleic Acids Research 02/2002; 30(1):207-10. · 8.81 Impact Factor
Conference Paper: OGSA-DQP: A Service for Distributed Querying on the Grid.[Show abstract] [Hide abstract]
ABSTRACT: OGSA-DQP is a distributed query processor exposed to users as an Open Grid Services Architecture (OGSA)-compliant Grid service. This service supports the compilation and evaluation of queries that combine data obtained from multiple services on the Grid, including Grid Database Services (GDSs) and computational web services. Not only does OGSA-DQP support integrated access to multiple Grid services, it is itself implemented as a collection of interacting Grid services. OGSA-DQP illustrates how Grid service orchestrations can be used to perform complex, data-intensive parallel computations. The OGSA-DQP prototype is downloadable from www.ogsadai.org.uk/dqp/. This demonstration aims to illustrate the capabilities of OGSA-DQP prototype via a GUI Client over a collection of bioinformatics databases and analysis tools.Advances in Database Technology - EDBT 2004, 9th International Conference on Extending Database Technology, Heraklion, Crete, Greece, March 14-18, 2004, Proceedings; 01/2004
- [Show abstract] [Hide abstract]
ABSTRACT: Biomedicine has experienced explosive growth, fueled in parts by the substantial increase of government support, continued development of the biotechnology industry, and the increasing adoption of molecular-based medicine. At its core, it is composed of fiercely independent, innovative, entrepreneurial individuals, organizations, and institutions. The field has developed unprecedented capacity to characterize biologic systems at their most fundamental levels with the use of tools and technologies almost unimaginable a generation ago. Biomedicine is at the precipice of unlocking the very essence of biologic life and enabling a new generation of medicine. Development and deployment of cyberinfrastructure may prove to be on the critical path to obtaining these goals.Science 06/2005; 308(5723):821-4. · 31.20 Impact Factor
Grid Enabling your Data Resources with
Mario Antonioletti1, Malcolm Atkinson2, Neil P. Chue Hong1, Bartosz
Dobrzelecki1, Alastair C. Hume1, Mike Jackson1, Kostas Karasavvas2, Amy
Krause1, Jennifer M. Schopf2,3, Tom Sugden1, and Elias Theocharopoulos2
1EPCC, University of Edinburgh, JCMB, The King’s Buildings, Mayfield Road,
Edinburgh EH9 3JZ, UK.
2National e-Science Centre, University of Edinburgh & Glasgow, Edinburgh EH8
3Distributed System Laboratory, Argonne National Laboratory, Argonne, IL, 60439
Abstract. OGSA-DAI (Open Grid Services Architecture - Data Ac-
cess and Integration) provides an extensible software framework allow-
ing data resources, such as files, relational and XML databases, to be
exposed through Web services acting within collaborative Grid environ-
ments or, more modestly, in stand-alone mode. OGSA-DAI may be de-
ployed to WSRF-based platforms, such as the Globus Toolkit 4, as well
as non-WSRF based ones, such as the UK OMII Server or standard
versions of Tomcat and axis. Regardless of the platform, the core func-
tionality provided remains the same. OGSA-DAI allows data resources
to be accessed and integrated into the main infrastructures presently
being used to construct Grids. OGSA-DAI provides a number of optimi-
sations that reduce unnecessary data movement by shifting work to the
Web service and encapsulating multiple client-Web service interactions
into a single one, and allows for functionality to be added or customised
based on the application. OGSA-DAI is widely used and is available from
www.ogsadai.org.uk. It is also bundled with the OMII-UK and Globus
Toolkit distributions. This paper gives an overview of what OGSA-DAI
is, how it works, presents some usage scenarios, and outlines future en-
Key words:Data, Databases, Grid, OGSA-DAI
With current advances in technology and the decreasing cost of storage, increas-
ingly large amounts of data are being produced, maintained, kept on-line, and
shared within communities. For instance, astronomers are collecting data to-
gether, such as surveys of the sky made at different wavelengths and resolutions,
and making it collectively available through Virtual Observatories ; biologist
are gathering DNA and genomic data from different species and making this
data available to biologists through data stores, providing a rich source of data
to pursue insights into biological systems  and in the health sector, digital
medical data are being collected and maintained by hospitals allowing experts
to collaborate in patient diagnosis and providing case histories that can be used
to inform a prognosis for patients suffering from similar maladies [3, 4].
The need to access disparate data sources, often spanning multiple institu-
tions, can lead to new insights and discoveries to be made. By combining different
wavelength data for the same patch of sky, astronomers have been able to make
new discoveries that would not have otherwise been possible from a single survey
[5, 6]. Biologists now have the capability of performing cross-species comparisons
to determine new genes and their function . Doctors can improve the diagnosis
of breast cancer by comparing current mammographs with old mammographs in
combination with the associated patient histories . The advantage being able
to share data and resources in a controlled manner within a collaborative envi-
ronment is clear. The provision of generic middleware to facilitate this process
is the ethos that is currently driving the evolution of the Grid and, in the data
area, OGSA-DAI (Open Grid Services Architecture - Data Access and Integra-
tion) provides software which makes it easy to publish and share data across
organisational boundaries, and develop applications which use both public and
personal data resources, through a secure, extensible framework based on web
OGSA-DAI is not the only solution currently available for data in the Grid
space. Storage Resource Broker (SRB) , developed by the San Diego Super-
computer Center, provides access to collections of data primarily using attributes
or logical names rather than using the data’s physical names or locations. SRB
is primarily file oriented, although it can also work with various other data ob-
ject types. OGSA-DAI on the other hand takes a database oriented approach to
its access mechanisms. WebSphere Information Integrator (WSII), a commercial
product from IBM, provides data searching capabilities spanning organisational
boundaries, provides a means for federating and replicating data, as well as al-
lowing for data transformations and data event publishing to take place . A
more detailed comparison between OGSA-DAI and WSII can be found in .
Mobius , developed at Ohio State University, provides a set of tools and
services to facilitate the management and sharing of data and metadata in a
Grid environment. To expose XML data in Mobius, the data must be described
using an XML Schema, which is then shared via their Global Model Exchange.
Data can then be accessed by querying the Schema using, for example, XPath.
OGSA-DAI, in contrast, does not require an XML Schema to be created for
each piece of data; rather, it directly exposes that information (data and meta-
data/schema) and relies on the resource’s intrinsic querying mechanisms to query
its data. These three products all provide mechanisms to share data across or-
ganisational boundaries, however they complement the functionality provided
In the remainder of this paper OGSA-DAI will be examined in more detail.
Section 2 gives an overview of the current release of OGSA-DAI explaining the
underlying components and how they operate. Section 3 describes some common
patterns of use for OGSA-DAI, and Section 4 describes some of the future work
planned for the next release. Finally conclusions are provided in Section 5.
2 An Overview of OGSA-DAI
The first thing to note about OGSA-DAI is that it is not targeted directly at
the end user, but rather it gives service providers the base functionality which
they can use to create their own services and clients to expose data tailored
to their own communities. OGSA-DAI has been made extensible by design so
that any missing functionality can be developed and grafted to work within
the same framework. In addition, different security models may be employed,
static metadata can be exposed via configuration files, and dynamic metadata
can be created and exposed at the service via the use of call back functions.
OGSA-DAI may also be extended to support new types of data resources that
are not already supported by the OGSA-DAI distribution. For example, the
WebDB project has extended OGSA-DAI to cater for RDF based data . Use
of OGSA-DAI allows service providers to develop and deploy their own Grid
solutions much more quickly and effectively than might otherwise be the case.
OGSA-DAI is tested and operates well with two current Grid fabric providers,
the Globus Toolkit1and the OMII-UK2and there are plans to port OGSA-DAI
to work with UNICORE3and gLite4under the OMII-Europe project . This
ensures that, if any of the above toolkits is to be used, that the OGSA-DAI
services will meet user and developer needs in a wide variety of environments.
In OGSA-DAI data resource and service capabilities are exposed through
the use of activities, the basic unit of work within OGSA-DAI. At the server,
an activity is described by a piece of XML Schema specifying the syntax of an
XML fragment that is used to activate an associated Java implementation class
that performs the desired task at the server. Different XML activity fragments
may be composed together in a perform document which contains one or more
activities linked together through a named set of inputs and outputs describing
the data flow between them. For example, an XPath query activity can wrap
an XPath expression which then acts on an XML database, the results of this
can then be transformed using XSLT activity and finally the transformed results
may be delivered to a specified third party using a delivery activity. It is this
ability to encapsulate multiple interactions in a single Web service interaction,
through the use of perform documents, which otherwise would require multiple
distinct client-service interactions, coupled with the fact that activities provide
a framework for moving computation close to the data that is seen as one of
the advantages of using OGSA-DAI. More complex behaviour may be obtained
by composing OGSA-DAI services together and using these to provide more
sophisticated capabilities such as Distributed Query Processing as provided by
the OGSA-DQP project .
The OGSA-DAI Client Toolkit (CTk) provides a programmatic interface that
facilitates programming interactions with OGSA-DAI services. The CTk has an
activity representation for each of the server side activities – these representa-
tions are essentially used to produce the XML fragment, within the context of
a perform document, to trigger the corresponding server side activity. The CTk
also provides a programmatic means for composing the client-side activities to-
gether to construct the desired perform document – in this way the user does
not need to have to deal with any of the underlying XML. In addition, the CTk
also handles the interactions with the service and provides methods to add (or
extract) data from the request (or result) messages, respectively. Moreover, the
CTk is agnostic as to whether a WSRF or non-WSRF service is being accessed
providing an additional abstraction layer hiding the particular flavour of OGSA-
DAI service that is being contacted. The overall aim of the CTk is to facilitate
the provision of clients to interact with OGSA-DAI services.
XML Database Files
Fig.1. A schematic representation of an OGSA-DAI service.
Putting the above into context a schematic representation of an OGSA-DAI
service is shown in Figure 1. A client, built using the CTk, sends a perform
document to an OGSA-DAI data service, which in the instance shown has three
types of data resource associated with it. In the WSRF version of OGSA-DAI
WS-Addressing end point references are used to specify the data resource being
targeted by the client . For the non-WSRF version the data resource name,
specified at deployment time, is appended to the service URL, for example
Once a message is accepted by the service interface, the functionality for
both flavours of OGSA-DAI is the same. A perform document and any Grid
credentials are passed through the service layer to the Engine of the targeted
data resource. The Engine coordinates the running of the activities in the per-
form document. A Data Resource Accessor (DRA) wraps the underlying data
resource: this abstraction facilitates the addition of new types of data resources
to OGSA-DAI. The Engine passes any Grid credentials to the DRA and, if these
are valid, the DRA returns an open connection to the data resource that can
then be used by any activity that interacts with the data resource. The DRA
consults a Role Mapper that maps Grid credentials, essentially the distinguished
name, to a database role that can be used to access the database. OGSA-DAI
comes with a basic role mapper that attempts to match a database role (repre-
sented as simple username and password pairs) within an XML file for a given
a set of Grid credentials, thus allowing database systems which do not use Grid
credentials to be accessed, albeit not in a scalable fashion. This is another ex-
tensibility point where service providers would wish to develop their own role
mapper and substitute the existing one: two groups have developed different
solutions to this.
The OGSA-DAI Engine ensures that all activities run correctly and coordi-
nates the passing of data from one to the other. Failure in one activity signifies
failure in the execution of the whole perform document. As yet there is no trans-
actional behaviour, including rollback mechanisms, although this is planned for
a future release. Data may be piped in from a third party using a delivery from
activity or sent to a third party using a delivery to activity, both of these can
use other transport protocols to pipe data into or out of a service obviating the
requirement for SOAP and using, for example, GridFTP, FTP or HTTP to fetch
data or send it to a third party. If the processing completes successfully the data
or status of the processing is sent back to the client in a response document.
From this brief overview we can see that OGSA-DAI is a sophisticated piece
of middleware that provides a uniform access interface to various types of data.
It partially virtualises data: intrinsic connection mechanisms to the underlying
data resource are no longer a concern but a client still needs to know the under-
lying type of data model that is being used – for instance SQL queries need to
be targeted at a relational data resource and will make no sense when targeted
at an XML database. Moreover, query expressions targeted at a particular data
resource are not inspected so any vendor specific language extensions must also
be appropriate for the underlying data resource used. A client is able to deter-
mine the type of the underlying data resource via metadata available through
the service interface that then allows it to direct the appropriate type of queries
for that type of data resource. However, OGSA-DAI does provide the basis for
providing data model integration, through the use of transformation activities
which e.g. translate the results of queries to XML and relational data resources
into WebRowSet before aggregating them. This then outlines the basics of the
OGSA-DAI framework. The next section briefly outlines a couple of usage sce-
OGSA-DAI provides a versatile framework which can be used to provide data
access capabilities within Grid infrastructures. Many projects already use OGSA-
DAI, primarily in research areas such as GIS and bioinformatics. An up to date
list can be found on the OGSA-DAI website5.
Five basic common usage patterns are illustrated in Figure 2.
Fig.2. OGSA-DAI scenarios
The simple intermediary is the simplest archetypal usage scenario supported
by OGSA-DAI, and is the basis for many of the higher-level scenarios. This
scenario consists of an OGSA-DAI service interposed between client applications
and a data resource providing a consistent interface for different kinds of data
and supporting a rich, extensible set of operations that can be performed on that
data. Using this base scenario one can envisage many discoverable OGSA-DAI
services listed in third party registries and used by clients to retrieve data for
their specific ends, all different types of data shared and made available through
a common interface. In addition, examination of this basic usage pattern has
also led to various optimisations being made for the 2.2 release of OGSA-DAI,
see  for more details.
The persistent intermediary scenario illustrates the use of mechanisms for
storing intermediate results which can then be used by subsequent requests.
These intermediate results could be stored transparently in memory, a local
database, the local file systems or some other suitable means on behalf of the
OGSA-DAI service clients. This scenario currently is partially supported by us-
ing the OGSA-DAI dataStore activity which currently holds results in memory,
although this can be extended to hold results in more permanent storage. It can
also be implemented using OGSA-DAI by storing data temporarily in a scratch
database accessible by the service. This functionality allows a coordinating ser-
vice to hold temporary data to perform data joins from multiple data resources.
The redirector scenario allows data to be sent to a third party, including
the originator, as opposed to embedding it in the response. Moreover the third
party delivery protocol does not have to be SOAP based – data can be delivered
using GridFTP, ftp, or some other delivery means. In this instance SOAP is
effectively being used as the control channel while the data channel is done via a
more efficient transport protocol. OGSA-DAI supports this scenario by allowing
a number of alternative data transport mechanisms that can also be used to
transfer data into OGSA-DAI services.
In the coordinator scenario, an OGSA-DAI service interfaces to an arbitrary
number of data resources and presents them as a composite resource to its clients,
producers, and consumers. This means that data can be routed between data
resources or combined from those resources within a single request or session
without routing data via the client. There is already some support for this type
of scenario in OGSA-DAI as multiple data resources can be configured per data
service and used with specialised query activities to provide resilient querying of
a set of data resources sharing a common schema. This presents the set of data
resources as a single virtualised data resource.
In the network assembly scenario an OGSA-DAI service uses an arbitrary
number of other OGSA-DAI services as well as data resources already curated
by the service in order to collect together data. This type of service coordination
successively adds facilities that may be used in combination towards achieving
a data-oriented workflow. The invoked services in this workflow do not have to
be OGSA-DAI services. The multiple services may form a pipeline in order to
draw on additional computation facilities or a tree in order to place parts of
a total query close to the data sources. As this permits arbitrary fan out and
arbitrary recursive composition, many architectures are possible: a simple exam-
ple is shown above. OGSA-DQP provides an instance of the assembly network
pattern using OGSA-DAI services as well as some of their own service types.
In general the documentation of scenarios like those described above is bene-
ficial as a means of providing best practice and guidelines for using the features
and components of OGSA-DAI. There is insufficient space here to go into more
depth but best practice and guidelines are being documented in the OGSA-DAI
Web pages6. These scenarios, as well as other inputs such as performance studies
, are being used to motivate the future directions being taken by OGSA-DAI
which are briefly outlined in the next section.
4 Future Directions
A number of architectural changes are about to be introduced into the next
OGSA-DAI release and following releases. Some of the highlights are:
– Improved scalability by providing load balancing capabilities to dispatch in-
coming request to different JVMs, potentially running on different machines,
to execute perform documents. Initial policies will be simple, eg. round-robin,
but additional more complex policies will be enabled as well.
– Improved robustness by allowing requests to run on different JVMs so if that
a request has aberrant behaviour it does not bring down the whole container
and compromise other jobs.
– Improved activity model to make it easier to develop and maintain activities
while at the same time providing more powerful mechanisms to dynamically
– Improved sessions handling will allow activities to store and retrieve data
from an existing session, which could span multiple requests.
– A new resource model will allow perform documents to contain activities that
can access more than one resource exposed by an OGSA-DAI service (cur-
rently a perform document can only target a single data resource). This new
model will thus allow more powerful user-driven data integration scenarios to
be enacted by an OGSA-DAI service between multiple resources. It will also
allow other OGSA-DAI components to be treated as WSRF-resources. For
example, sessions, requests, and data sinks/sources (input/output streams)
can be modelled as resources which then allows these to be endowed with
mechanisms for lifetime management and authorisation as available in other
– New data integration activities taking advantage of the new resource model a
new set of data integration activities are being designed that should facilitate
the enactment of data integration scenarios.
– Distributed Query Processing capabilities are being introduced through the
absorption of the OGSA-DQP project, currently distributed separately from
OGSA-DAI, into the OGSA-DAI product itself.
6See www.ogsadai.org.uk/documentation/scenarios for details.
– A new tuple intermediate data format for relational data, called an ODTuple
(for OGSA-DAI Tuple), will provide a common way for connected activities
to exchange data. This will minimise the amount of data conversion that is
required take place between activities. This format is:
• able to stream well within and between processes,
• efficient for single types, elements, and tuples,
• able to support base types plus String, File, BLOB, and NULL,
• able to supports warnings, errors and exceptions, and
• easily extensible.
This relational structure can be used to represent the majority of the cur-
rent data formats used within OGSA-DAI, such as WebRowSet and CSV
(Comma Separated Values).
These additions to the next release will make OGSA-DAI a more powerful
framework and increase the support for data integration as well as making the
scenarios described in the previous section easier to implement and extend.
This paper has provided motivation for the production of middleware to facili-
tate the sharing of data within established communities to enable new insights
and discoveries to be produced. The provision of middleware that facilitates this
process is the underlying motivation for OGSA-DAI. OGSA-DAI is not targeted
directly at the end-user but rather it provides a framework that has to be cus-
tomised for a given user-community by its own developers. Through the use of
OGSA-DAI the amount of effort required to produce these targeted data services
and applications should be greatly reduced. A snapshot overview of OGSA-DAI
has been given and some indicators of the future directions that are being taken
to enhance the product and provide additional capabilities for those that rely
on OGSA-DAI for their data access and integration base requirements. More
information about OGSA-DAI and the software may be downloaded from the
project Web site at www.ogsadai.org.uk.
This work is supported by the UK e-Science Grid Core Programme, through
the Open Middleware Infrastructure Institute UK, and by the Mathematical,
Information, and Computational Sciences Division subprogram of the Office of
Advanced Scientific Computing Research, Office of Science, U.S. Department of
Energy, under Contract W-31-109-ENG-38.
 S.G. Djorgovski. Virtual astronomy, information technology, and the new scien-
tific methodology. Proceedings of the Seventh International Workshop on Computer
Architecture for Machine Perception, 2005. CAMP 2005. pp.125-132, 4-6 July 2005.
 R. Edgar, M. Domrachev and A. E. Lash. Gene Expression Omnibus: NCBI gene
expression and hybridization array data repository. Nucleic Acids Research, 2002,
Vol. 30, No. 1 pp. 207-210.
 J.M. Brady, D.J. Gavaghan, A.C. Simpson, M. Mulet-Parada and R.P. Highnam,
eDiaMoND: a Grid-enabled federated database of annotated mammograms. In: F.
Berman, G.C. Fox and A.J.G. Hey, Editors, Grid Computing: Making the Global
Infrastructure a Reality, Wiley Series (2003), pp.923-943.
 K. H. Buetow. Cyberinfrastructure: Empowering a ”Third Way” in Biomedical
Research. Science 6 May 2005: Vol. 308. no. 5723, pp. 821-824.
 Virtual observatory finds black holes in previous data. News in brief. Nature 429,
494-495, June 2004.
 Astronomers Detect New Category of Elusive ’Brown Dwarf‘. The New York Times,
Tuesday, June 1 1999.
 A. Wipat, Y. Sun, M. Pocock, P. Lee, P. Watson and K. Flanagan. Developing
Grid-based Systems for Microbial Genome .Comparisons: The Microbase Project.
Proceedings of the UK e-Science All Hands Meeting 2004.
 A. Solomonides, R. McClatchey, M. Odeh, M. Brady, M. Mulet-Parada, D. Schott-
lander and S.R Amendolia. MammoGrid and eDiamond: Grids Applications in Mam-
mogram Analysis. Proceedings of the IADIS International Conference: e-Society
2003. Lisbon, Portugal. June 2003. A Palma dos Reis and P Isaias, Editors pp 1032-
 Storage Resource Broker (SRB), www.sdsc.edu/srb.
 R. O. Sinnott and D. Houghton, Comparison of Data Access and Integration
Technologies in the Life Science Domain, Proceedings of the UK e-Science All Hands
Meeting 2005, September 2005.
 Mobius, projectmobius.osu.edu.
 WebSphere Information Integrator
 OGSA-DAI-RDF project, www.dbgrid.org.
 OMII-Europe project, www.omii-europe.org.
 InteliGrid project, www.inteligrid.com.
 N. Alpdemir, A. Mukherjee, A. Gounaris, N.W. Paton, P. Watson, and A.A.A.
Fernandes. OGSA-DQP: A Grid service for distributed querying on the Grid. LNCS
Volume 2992, p 858-861, 2004.
 M. Gudgin, M. Hadley, T. Rogers. Web Services Addressing 1.0 - Core (WS-
Addressing). W3C Recommendation, 9 May 2006.
 B. Dobrzelecki, M. Antonioletti, J. M. Schopf, A.C. Hume, M. Atkinson, N.P.
Chue Hong, M. Jackson, K. Karasavvas, A. Krause, M. Parsons, T. Sugden, and
E. Theocharopoulos. Profiling OGSA-DAI Performance for Common Use Patterns.
Proceedings of the UK e-Science All Hands Meeting 2006.
 S. Kottha, K. Abhinav, R. Muller-Pfefferkorn, and H. Mix. Accessing Bio-
Databases with OGSA-DAI - A Performance Analysis. To appear in International
Workshop on Distributed, High Performance and Grid Computing in Computational