Abstract

Web services have become a key technology for bioinformatics, since life science databases are globally decentralized and the exponential increase in the amount of available data demands for efficient systems without the need to transfer entire databases for every step of an analysis. However, various incompatibilities among database resources and analysis services make it difficult to connect and integrate these into interoperable workflows. To resolve this situation, we invited domain specialists from web service providers, client software developers, Open Bio* projects, the BioMoby project and researchers of emerging areas where a standard exchange data format is not well established, for an intensive collaboration entitled the BioHackathon 2008. The meeting was hosted by the Database Center for Life Science (DBCLS) and Computational Biology Research Center (CBRC) and was held in Tokyo from February 11th to 15th, 2008. In this report we highlight the work accomplished and the common issues arisen from this event, including the standardization of data exchange formats and services in the emerging fields of glycoinformatics, biological interaction networks, text mining, and phyloinformatics. In addition, common shared object development based on BioSQL, as well as technical challenges in large data management, asynchronous services, and security are discussed. Consequently, we improved interoperability of web services in several fields, however, further cooperation among major database centers and continued collaborative efforts between service providers and software developers are still necessary for an effective advance in bioinformatics web service technologies.

Source: PubMed

Comments on this publication

ResearchGate members can add comments. Sign up now and post your comment!

Similar publications

Page 1
 
Page 2
 
Page 3
 
Page 4
 
Page 5
 
End of preview.
Page 1
REVIEW Open Access
The DBCLS BioHackathon: standardization and
interoperability for bioinformatics web services
and workflows. The DBCLS BioHackathon
Consortium*
Toshiaki Katayama*, Kazuharu Arakawa, Mitsuteru Nakao, Keiichiro Ono, Kiyoko F Aoki-Kinoshita,
Yasunori Yamamoto, Atsuko Yamaguchi, Shuichi Kawashima, Hong-Woo Chun, Jan Aerts, Bruno Aranda,
Lord Hendrix Barboza, Raoul JP Bonnal, Richard Bruskiewich, Jan C Bryne, José M Fernández, Akira Funahashi,
Paul MK Gordon, Naohisa Goto, Andreas Groscurth, Alex Gutteridge, Richard Holland, Yoshinobu Kano,
Edward A Kawas, Arnaud Kerhornou, Eri Kibukawa, Akira R Kinjo, Michael Kuhn, Hilmar Lapp, Heikki Lehvaslaiho,
Hiroyuki Nakamura, Yasukazu Nakamura, Tatsuya Nishizawa, Chikashi Nobata, Tamotsu Noguchi, Thomas M Oinn,
Shinobu Okamoto, Stuart Owen, Evangelos Pafilis, Matthew Pocock, Pjotr Prins, René Ranzinger, Florian Reisinger,
Lukasz Salwinski, Mark Schreiber, Martin Senger, Yasumasa Shigemoto, Daron M Standley, Hideaki Sugawara,
Toshiyuki Tashiro, Oswaldo Trelles, Rutger A Vos, Mark D Wilkinson, William York, Christian M Zmasek, Kiyoshi Asai,
Toshihisa Takagi
* Correspondence: ktym@hgc.jp
Database Center for Life Science,
Research Organization of
Information and Systems, 2-11-16
Yayoi, Bunkyo-ku, Tokyo, 113-0032,
Japan
Abstract
Web services have become a key technology for bioinformatics, since life science
databases are globally decentralized and the exponential increase in the amount of
available data demands for efficient systems without the need to transfer entire data-
bases for every step of an analysis. However, various incompatibilities among data-
base resources and analysis services make it difficult to connect and integrate these
into interoperable workflows. To resolve this situation, we invited domain specialists
from web service providers, client software developers, Open Bio* projects, the Bio-
Moby project and researchers of emerging areas where a standard exchange data
format is not well established, for an intensive collaboration entitled the BioHacka-
thon 2008. The meeting was hosted by the Database Center for Life Science (DBCLS)
and Computational Biology Research Center (CBRC) and was held in Tokyo from Feb-
ruary 11th to 15th, 2008. In this report we highlight the work accomplished and the
common issues arisen from this event, including the standardization of data
exchange formats and services in the emerging fields of glycoinformatics, biological
interaction networks, text mining, and phyloinformatics. In addition, common shared
object development based on BioSQL, as well as technical challenges in large data
management, asynchronous services, and security are discussed. Consequently, we
improved interoperability of web services in several fields, however, further coopera-
tion among major database centers and continued collaborative efforts between ser-
vice providers and software developers are still necessary for an effective advance in
bioinformatics web service technologies.
Katayama et al. Journal of Biomedical Semantics 2010, 1:8
http://www.jbiomedsem.com/content/1/1/8 JOURNAL OF
BIOMEDICAL SEMANTICS
© 2010 Katayama et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative
Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and
reproduction in any medium, provided the original work is properly cited.
Page 2
Introduction
Web services are software systems designed to be manipulated remotely over a net-
work, often through web-based application programming interfaces (APIs). Through
web services, users can take advantage of the latest maintained data and computational
resources of remote service providers via a thin client. Web services are increasingly
being adopted in the field of bioinformatics as an effective means for data and software
access, especially in light of the rapid accumulation of large amounts of information for
the life sciences [1]. Most of the major bioinformatics centers, including the National
Center for Biotechnology Information (NCBI) in the US [2], the European Bioinfor-
matics Institute (EBI) in the UK [3], and the DNA Data Bank of Japan (DDBJ) [4]/
Kyoto Encyclopedia of Genes and Genomes (KEGG) [5]/Protein Data Bank Japan
(PDBj) [6] in Japan, provide web service interfaces to their databases and computa-
tional resources. Since the web service model is based on open standards, these ser-
vices are designed and expected to be interoperable [7]. However, many of the services
currently available use their own data type definitions and naming conventions, result-
ing in a lack of interoperability that makes it harder for end users and developers to
utilize these services for the creation of biological analysis workflows [8]. Moreover,
these services are often not easily usable from programs written in specific computer
languages, despite the language-independent specification of web services themselves.
Some of the main reasons for that are the use of functionality not supported in a parti-
cular web service software implementation, and the lack of compliance with the SOAP/
WSDL specification in a programming language’s web service libraries.
To overcome this situation and to assure interoperability between web services for
biology, standardization of exchangeable data types and adoption of compatible inter-
faces to each service are essential. As a pilot study, the BioMoby project has tried to
solve these problems by defining ontologies for data types and methods used in its ser-
vices, and by providing a centralized repository for service discovery. Additionally,
Moby client software exists to allow interconnections of multiple web services [9,10].
However, there are still many major service providers that are not yet covered by the
BioMoby framework and the Open Bio* libraries such as BioPerl [11], BioPython [12],
BioRuby [13], and BioJava [14] have independently implemented access modules for
some of these services [15].
To address these issues, we organized the BioHackathon 2008 [16], an international
workshop sponsored by two Japanese bioinformatics centers, the Database Center for
Life Science (DBCLS) [17] and the Computational Biology Research Center (CBRC)
[18], focusing on the standardization and interoperability of web services. The meeting
consisted of two parts: the first day was dedicated to keynote presentations and “open
space” style discussions to identify current problems and to decide on strategies for
possible solutions in each subgroup. The remaining four days were allotted for an
intensive software coding event. Standardization and interoperability of web services
were discussed by experts invited from four different domains: 1) web service provi-
ders, 2) Open Bio* developers, 3) workflow client developers, and 4) BioMoby project
developers. Providers of independent web services were encouraged to address standar-
dization and service integration, and were also asked to implement (and hence increase
the number of) SOAP-compliant services for analysis tools and databases. Open Bio*
developers focused on the utilization of as many bioinformatics web services as
Katayama et al. Journal of Biomedical Semantics 2010, 1:8
http://www.jbiomedsem.com/content/1/1/8
Page 2 of 19
Page 3
possible in four major computer languages (Perl, Python, Ruby, and Java), and collabo-
rated to create compatible data models for common biological objects such as
sequences and phylogenetic trees within the Open Bio* libraries. Workflow client
developers were challenged to create and execute bioinformatics workflows combining
various web service resources, and BioMoby project developers explored the best solu-
tion to define standard objects and ontologies in bioinformatics web services. In the
following sections, we review the outcomes of standardization and interoperability dis-
cussions as well as the future challenges and directions of web services for bioinfor-
matics that were highlighted in this workshop.
Web service technologies
Bioinformatics web services can be categorized into two major functional groups: data
access and analysis. Access to public database repositories is obviously fundamental to
bioinformatics research, and various systems have been developed for this purpose,
such as Entrez at NCBI, Sequence Retrieval System (SRS) and EB-eye at EBI [19], Dis-
tributed Annotation System (DAS) [20], All-round Retrieval for Sequence and Annota-
tion (ARSA) and getentry at DDBJ [21], DBGET at KEGG [22], and XML-based
Protein Structure Search Service (xPSSS) at PDBj [6]. These services provide program-
mable means for text-based keyword search and entry retrieval from their backend
databases, which mostly consist of static entries written either in semi-structured text
or XML. As each entry has a unique identifier it is generally assignable to a URI (Uni-
form Resource Identifiers).
The other group of services provides a variety of methods that require a certain
amount of computation by implementing various algorithms, and they sometimes have
complex input or output data structures. A typical example is a BLAST search, which
needs a nucleic or amino acid sequence, as well as numerous optional arguments in
order to find homologous sequences from a specified database using a dynamic pro-
gramming algorithm. Services in this group sometimes require a large amount of com-
putation time, including those providing certain functionalities of the European
Molecular Biology Open Software Suite (EMBOSS) [23], 3 D structural analysis of pro-
teins, and data mapping on biochemical pathways.
Historically, the term web services was associated with SOAP (Simple Object Access
Protocol), a protocol that transfers messages in a SOAP XML envelope between a ser-
ver and a client, usually over the Hypertext Transfer Protocol, HTTP [24]. SOAP ser-
vices have several accessibility advantages, including an open standard that is
independent from computer programming languages, and the use of the HTTP proto-
col which is usually not filtered by firewalls (SOAP services can therefore be accessed
even from institutions having very strict security policies for Internet access). Since all
SOAP messages are XML documents and the format of the messages are known in
advance from the service description (see below), it is possible to use XML binding to
seamlessly convert the messages to language-specific objects and thus avoid any cus-
tom-programmed parsing. XML binding is often leveraged by SOAP libraries to pro-
vide a programmatic interface to a web service similar to an object oriented API.
Operations provided by SOAP services can consume several arguments, thus a service
that requires a number of parameters can easily be utilized as an API, as if the method
were a function call for a local library of a given programming language.
Katayama et al. Journal of Biomedical Semantics 2010, 1:8
http://www.jbiomedsem.com/content/1/1/8
Page 3 of 19
Page 4
For the purpose of service description, SOAP services usually come with a Web Ser-
vices Description Language (WSDL) [25] file. A WSDL file is an XML formatted docu-
ment that is consumed by a SOAP/WSDL library to allow automatic construction of a
set of functions for the client program. In addition to the list of methods, WSDL con-
tains descriptions for each method, including the types and numbers of input argu-
ments as well as those of output data. WSDL is also capable of describing complex
data models that combine basic data types into nested data objects. In this way, SOAP
services can accept various kinds of complex biological objects, such as a protein
sequence entry accompanied by several annotation properties like the identifier,
description, and source organism.
Recently, another kind of web service model named REST (Representational State
Transfer) has rapidly gained popularity as an effective alternative approach to SOAP-
based web services [1]. REST is an approach whereby an online service is decomposed
into uniquely identifiable, stateless resources that can be called as a URL and return
the relevant data in any format. Typically, many bioinformatics database services return
entries in a text-based flatfile format upon REST calls. The strength of REST is in its
simplicity. Since REST is built on top of HTTP requests, there is no need for support-
ing libraries, unlike SOAP/WSDL services. RESTful URLs are also highly suitable for
permanent resource mapping, such as that between a database entry and a unique
URI; therefore, biological web services that provide data access should ideally be
exposed as simple REST services. On the other hand, REST is less appropriate for
services that require complex input with multiple numbers of parameters, or for time-
consuming and therefore asynchronous and stateful services. For those, SOAP/WSDL-
based services are still more suitable.
WSDL description per se is not enough for the immediate construction of biological
workflows as multiple cascading web services, because of inconsistent data types
defined by each service provider, sometimes even for essentially identical objects.
Therefore, in most cases output of one service cannot be passed to another service as
its input without appropriate conversion of data types or formats. Furthermore, ser-
vices should also be discoverable by the object models they share so they can be linked
in the construction of workflows. To this end, a centralized registry to discover appro-
priate services according to a given set of data types has become essential for web ser-
vice interoperability. The BioMoby project has pioneered this task by providing
MobyCentral, which serves as a central repository for BioMoby compatible web ser-
vices [9]. Service developers are encouraged to register their own service to the reposi-
tory with a description of the service using the BioMoby ontologies that classify the
semantic attributes of the method including the input and output data types. Metadata
and ontologies for service description and discovery discussed during the BioHacka-
thon are listed in Table 1.
To date, several applications that utilize BioMoby services have been developed, such
as Taverna [26], Seahawk [27], MOWserv [8], and G-language Genome Analysis Envir-
onment (G-language GAE) [28]. Taverna is a software tool developed under the
myGrid project [29], written in Java and equipped with a graphical user interface
(GUI) for the construction of workflows by interconnecting existing web services.
Users can start from an initial set of data pipelined to a service, where the input data
is remotely analyzed, resulting in an output of different data types. This output
Katayama et al. Journal of Biomedical Semantics 2010, 1:8
http://www.jbiomedsem.com/content/1/1/8
Page 4 of 19
Page 5
becomes the input for the subsequent analysis step, for which appropriate services that
consume this input data type can be looked up, for example, through MobyCentral.
Iteration of this procedure leads to cascading services forming a bioinformatics work-
flow, which can be repeatedly utilized with different datasets. The strength of Taverna
is in its support of many non-BioMoby services that can be utilized in concert with
BioMoby-based services, and its customizability by enabling small Java plug-ins to be
written, for example to connect two services requiring data format conversion.
Seahawk is another GUI software tool that invokes BioMoby services in a context-
dependent manner, for example, by selecting an amino acid sequence in a website to
use as input data, so that users can analyze data as they browse information on web
pages.
MOWserv [30] is a web application that provides interactive analysis in a web brow-
ser. A web interface is dynamically generated for each BioMoby object and compatible
service. MOWserv implements novel functionality to allow data persistence, user man-
agement, task scheduling and fault-tolerance capabilities. Therefore MOWserv allows
monitoring of long and CPU-intensive tasks and automating the execution of complex
workflows. Invocation of services can be traced in the web interface, including for later
reference. An interesting aspect of MOWserv is that it has extended the BioMoby
ontologies for objects and services through manual curation. This keeps ontologies
clean enough, so that it greatly simplifies interoperability between services and helps in
Table 1 Required metadata for service description and discovery
Required metadata for service description
author contact
authority identification
service version
software title or nature of algorithm (myGrid Task ontology)
software version
bandwidth and/or number of requests per minute
example input
example output and/or REGEXP to test output
some description of error-handling capacity
sync/async
nature of underlying data
organism
biological nature of data (DNA/RNA/Protein, experimental methods or platform)
input parameters and purpose of each
output parameters and purpose of each
usage/license restrictions
authentication (whether required or not)
usage statistics (as per service provider)
usage statistics (as per third party commentary)
protocol (Moby, SOAP, REST, GET, POST, etc.)
mirror servers
Ontologies that could provide the above metadata
myGrid Ontology provides many of the annotation information elements listed above
Moby Object provides an ontology of data-types
Moby Service similar to myGrid’s bioinformatics_task branch of the myGrid Ontology
Katayama et al. Journal of Biomedical Semantics 2010, 1:8
http://www.jbiomedsem.com/content/1/1/8
Page 5 of 19
End of preview.
Preview full-text

Science & Research Jobs

Keywords

analysis services
 
asynchronous services
 
BioHackathon 2008
 
bioinformatics web service technologies
 
client software developers
 
common issues
 
data exchange formats
 
domain specialists
 
efficient systems
 
exponential increase
 
interoperable workflows
 
key technology
 
life science databases
 
major database centers
 
Open Bio* projects
 
service providers
 
technical challenges
 
text mining
 
web service providers
 
web services