Page 1
REVIEW Open Access
The 2nd DBCLS BioHackathon: interoperable
bioinformatics Web services for integrated
applications
Toshiaki Katayama*, Mark D Wilkinson, Rutger Vos, Takeshi Kawashima, Shuichi Kawashima, Mitsuteru Nakao,
Yasunori Yamamoto, Hong-Woo Chun, Atsuko Yamaguchi, Shin Kawano, Jan Aerts, Kiyoko F Aoki-Kinoshita,
Kazuharu Arakawa, Bruno Aranda, Raoul JP Bonnal, José M Fernández, Takatomo Fujisawa, Paul MK Gordon,
Naohisa Goto, Syed Haider, Todd Harris, Takashi Hatakeyama, Isaac Ho, Masumi Itoh, Arek Kasprzyk, Nobuhiro Kido
, Young-Joo Kim, Akira R Kinjo, Fumikazu Konishi, Yulia Kovarskaya, Greg von Kuster, Alberto Labarga,
Vachiranee Limviphuvadh, Luke McCarthy, Yasukazu Nakamura, Yunsun Nam, Kozo Nishida, Kunihiro Nishimura,
Tatsuya Nishizawa, Soichi Ogishima, Tom Oinn, Shinobu Okamoto, Shujiro Okuda, Keiichiro Ono, Kazuki Oshita,
Keun-Joon Park, Nicholas Putnam, Martin Senger, Jessica Severin, Yasumasa Shigemoto, Hideaki Sugawara,
James Taylor, Oswaldo Trelles, Chisato Yamasaki, Riu Yamashita, Noriyuki Satoh and Toshihisa Takagi
* Correspondence: ktym@hgc.jp
Database Center for Life Science,
Research Organization of
Information and Systems, 2-11-16
Yayoi, Bunkyo-ku, Tokyo, 113-0032,
Japan
Abstract
Background: The interaction between biological researchers and the bioinformatics
tools they use is still hampered by incomplete interoperability between such tools.
To ensure interoperability initiatives are effectively deployed, end-user applications
need to be aware of, and support, best practices and standards. Here, we report on
an initiative in which software developers and genome biologists came together to
explore and raise awareness of these issues: BioHackathon 2009.
Results: Developers in attendance came from diverse backgrounds, with experts in
Web services, workflow tools, text mining and visualization. Genome biologists
provided expertise and exemplar data from the domains of sequence and pathway
analysis and glyco-informatics. One goal of the meeting was to evaluate the ability
to address real world use cases in these domains using the tools that the developers
represented. This resulted in i) a workflow to annotate 100,000 sequences from an
invertebrate species; ii) an integrated system for analysis of the transcription factor
binding sites (TFBSs) enriched based on differential gene expression data obtained
from a microarray experiment; iii) a workflow to enumerate putative physical protein
interactions among enzymes in a metabolic pathway using protein structure data;
iv) a workflow to analyze glyco-gene-related diseases by searching for human
homologs of glyco-genes in other species, such as fruit flies, and retrieving their
phenotype-annotated SNPs.
Conclusions: Beyond deriving prototype solutions for each use-case, a second major
purpose of the BioHackathon was to highlight areas of insufficiency. We discuss the
issues raised by our exploration of the problem/solution space, concluding that there
are still problems with the way Web services are modeled and annotated, including:
i) the absence of several useful data or analysis functions in the Web service “space”;
ii) the lack of documentation of methods; iii) lack of compliance with the SOAP/
WSDL specification among and between various programming-language libraries;
Katayama et al. Journal of Biomedical Semantics 2011, 2:4
http://www.jbiomedsem.com/content/2/1/4 JOURNAL OF
BIOMEDICAL SEMANTICS
© 2011 Katayama et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative
Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and
reproduction in any medium, provided the original work is properly cited.
Page 2
and iv) incompatibility between various bioinformatics data formats. Although it was
still difficult to solve real world problems posed to the developers by the biological
researchers in attendance because of these problems, we note the promise of
addressing these issues within a semantic framework.
Background
Life Sciences are facing a new era where unprecedented amounts of genomic-scale data
are produced daily. To handle the data from, for example, next-generation DNA
sequencers, each major genomics institute has been developing local data-management
tools and analytical pipelines. Even small laboratories are now able to plan large
sequencing projects due to significantly reduced sequencing costs and the availability
of commodity or contract-based sequencing centers. However, once in the hands of
biological researchers, these sequences must still undergo significant analyses to yield
novel discoveries, and research groups frequently create their own in-house analytical
pipelines. Such “boutique” analyses can, therefore, become a significant bottleneck for
genomics research when data sets are large. To overcome this, it is necessary to
improve the interaction between biological researchers and the bioinformatics tools
they require. To address these needs, researchers at the Database Center for Life
Science [1] have initiated a series of BioHackathons that bring together researchers
from the global bioinformatics and genomics communities to address specific problems
in a collaborative setting.
The first DBCLS BioHackathon [2] focused on standardization of bioinformatics Web
services, and in particular on standardizing data-exchange formats to increase interoper-
ability in support of simplified bioinformatics workflow construction. Nevertheless, to
make these interoperability initiatives operational for Life Science researchers - genome
biologists in particular - end-user applications need to be aware of, and support, these
best practices and standards. To this end, the second BioHackathon gathered developers
of mashup services and Web service providers together with genome biologists who pro-
vided exemplar data to evaluate the ability of participating services and interfaces to
address real world use cases.
The second DBCLS BioHackathon took place March 11-15, 2009 in Japan, jointly
hosted by the DBCLS and the Okinawa Institute of Science and Technology (OIST
[3]). DBCLS is a national project responsible for developing services to integrate bioin-
formatics resources, while OIST hosts a research unit focusing on marine genomics.
The researchers and developers attending the second DBCLS BioHackathon repre-
sented key resources and projects within a number of related domains (Figure 1):
Web services Among participating Web service projects were a number of key
Japanese projects, including providers of database services (DDBJ WABI [4-6],
PDBj [7,8], KEGG API [9-12]) and providers of data integration and generic APIs.
Examples of the latter are the TogoDB [13] database services which are exposed
through the TogoWS [14,15] architecture. There were also representatives of the
G-language Genome Analysis Environment [16], which is a set of Perl libraries for
genome sequence analysis that is compatible with BioPerl, and equipped with sev-
eral software interfaces (interactive Perl/UNIX shell with persistent data, AJAX
Katayama et al. Journal of Biomedical Semantics 2011, 2:4
http://www.jbiomedsem.com/content/2/1/4
Page 2 of 18
Page 3
Web GUI, Perl API). In addition, there were representatives of projects developing
domain-specific Web services and standards for data formats and exchange proto-
cols (PSICQUIC [17], Glycoinformatics [18]). Lastly, there were representatives
from the Semantic Automated Discovery and Integration framework (SADI
[19,20]).
Workflows - While individual data and analytical tools are made available
through Web services, biological discovery generally requires executing a series of
data retrieval, integration, and analysis steps. Thus, it is important to have environ-
ments within which Web services can be easily pipelined, and where high-through-
put data can be processed without manual intervention. Among the projects that
enable this were participants from ANNOTATOR [21], MOWserv [22]/jORCA
[23], IWWE&M [24], Taverna [25] and RIKEN Life Science Accelerator (EdgeEx-
pressDB [26]). In addition, there were representatives from BioMart [27], Galaxy
[28] and SHARE [29]. BioMart is a query-oriented data management system that is
particularly suited for providing ‘data mining’-like searches of complex descriptive
data. Galaxy is an interactive platform to obtain, process, and analyze biological
data using a variety of server-side tools. The SHARE client is designed specifically
for the SADI Web service framework, where it parses a SPARQL query and auto-
matically maps query clauses to appropriate Web services, thus automatically creat-
ing a query-answering workflow.
Figure 1 Attendees of the DBCLS BioHackathon 2009. The BioHackathon 2009 was attended by
representatives from projects in Web services, Text Mining, Visualization and Workflow development, in
addition to genome biologists who provided real-world use cases from their research.
Katayama et al. Journal of Biomedical Semantics 2011, 2:4
http://www.jbiomedsem.com/content/2/1/4
Page 3 of 18
Page 4
Text mining - Although a significant portion of our knowledge about life science
is stored in a vast number of papers, few linkages exist between the rich knowledge
“hidden” in the scientific literature and the rich data catalogued in our databases.
To bridge them automatically, we first need to annotate those papers manually.
Among the BioHackathon participants were researchers/developers from such
annotation projects, namely Kazusa Annotation [30], and Allie [31].
Visualization - Biological data visualization involves not only providing effective
abstractions of vast amounts of data, but also effective and facile ways to find,
retrieve and store data as-needed by the biologist for a fast and complete visual
exploration. To achieve this, both data providers and tool developers need to work
collaboratively as much of the data that we need to visualize is complex and dis-
persed among a wide variety of non-coordinating providers. To complicate the
field even further, biologists work at a wide range of scales as they attempt to dis-
cover new insights - from meta-genomic, multi-genome comparisons, to single
genomes, to interactome, to single gene or protein, to SNP information. Each scale
and type of data requires a different approach to visualization. At the BioHacka-
thon there were representatives from a number of visualization projects. The gen-
oDive [32] is a genome browser for viewing schematic genome structures and
associated information in 3D space with the ability to execute a “semantic zoom”
(i.e. a zoom which is aware of its context). This novel representation of genomic
information provides an alternative to the more common 2D-track displays. Geno-
meProjector [33] is a tool based on the Google Map API to combine different
views for the genomic data in context of genome, plasmid, pathway and DNA
walk. In addition, there were representatives from Cytoscape [34,35] and GBrowse
[36,37]. GBrowse is a genome viewer and Cytoscape is a visualizer of the biomole-
cular interaction networks.
The participants from these different domains collaboratively challenged real world
issues in genome biology based on use cases described in the Methods section. How-
ever, beyond deriving prototype solutions for each use case, another major purpose of
the BioHackathon series is to identify problems and weaknesses in current technolo-
gies, such that “bio-hackers” can return to their respective groups with a clear focus on
areas of immediate need. As such, we conclude the paper with an extensive discussion
of the issues raised by our exploration of the problem/solution space; in particular,
issues related to data formats, the complexity of Web service interoperability, and the
need for semantics in bioinformatics data and tools.
Methods
The BioHackathon followed a use-case-driven model. First, genome biologists having
developmental, evolutionary, genetic and medical interests explained their data retrie-
val, integration and analysis requirements. From these, four use-cases were developed
spanning three general domains of genomics data.
To address the use cases outlined in the Table 1, developers of the end-user client
tools ANNOTATOR, Galaxy, BioMart, TogoDB, jORCA and Taverna presented the
features of their projects at the BioHackathon and how they might be utilized to solve
the use cases, and then collaboratively worked toward resolution for each. The
Katayama et al. Journal of Biomedical Semantics 2011, 2:4
http://www.jbiomedsem.com/content/2/1/4
Page 4 of 18
Page 5
Table 1 Summary of technical problems and solutions for each use case
Use Case 1 Annotation of 100,000 invertebrate ESTs
Task A researcher needs to annotate 100,000 sequences obtained from an invertebrate species
and also needs to provide
the result as a public database.
Strategy Annotate sequences by similarity and complement these annotations for sequences
showing no similarity by integrated
analysis tools. Then, store the results into BioMart or TogoDB to make the database
publicly available.
Problem Needed to identify which tool was most suitable for each step. Some tools turned out to
require very long time for
execution. The resulting annotations needed to be archived in a database and made
accessible on the Web.
Solution Firstly, use relatively fast tools like Blast2GO and KAAS then use ANNOTATOR for limted
number of sequences.
BioMart is suitable for integration of remote BioMart resources like Ensembl,
while TogoDB can be used to host databases without installation.
Both database systems are accessible through the Web service interface for workflow
tools like jORCA and Taverna.
Tools Blast2GO, KAAS, ANNOTATOR, BioMart, TogoDB, TogoWS, jORCA, Taverna
Databases Ensembl, BioMart, KEGG
Use Case 2 TFBS enrichment within differential microarray gene expression data
Task Identify SNPs in transcription factor binding sites and visualize the result as a genome
browser.
Strategy Retrieve SNP and TSS datasets through the DAS protocol, then compute enrichment and
export results for a DAS viewer.
Problem Needed to integrate information from multiple databases and needed to customize the
visualization.
Solution Developed a custom-made prediction system for the data obtained from DAS sources,
then customize the Ajax
DAS viewer to show the result in a genomic view.
Tools BioDAS, Ajax DAS viewer
Databases FESD II, DBTSS
Use Case 3 Protein interactions among enzymes in a KEGG metabolic pathway
Task Predict interacting pairs of proteins in a given metabolic pathway.
Strategy Retrieve enzymes from a specified pathway and search pairs of homologous proteins
forming complexes in a
strucuture database.
Problem Found version incompatilibity of the server and client implementations of SOAP protocol.
Non-standard BLAST output
format was returned by PDBj Web service. There were no Web services to calculate
phylogenetic profile.
Solution Switch programming languages according to the service in use. Programs are written to
parse BLAST results and to
generate a phylogenetic profile.
Tools Java, OCaml, Perl, Ruby, BLAST, DDBJ WABI, PDBj Mine, KEGG API
Databases DDBJ, KEGG, PDBj, UniProt
Use Case 4 Analyzing glyco-gene-related diseases
Task Find human diseases which are potentially related to SNPs and glycans.
Stragety Retrieve disease genes and search for homologs in other organisms to which glyco-gene
interactions are recoreded,
then search for epitopes to identify glycans and retrieve their structures.
Problem No Web service existed to query GlycoEpitopeDB and to convert a glycan structure in
IUPAC format into KCF format.
The output of OMIM search was in XML including entries which did not contain SNPs.
Solution Implemented and registered BioMoby compliant Web services. Wrote custom BeanShell
script for a Taverna workflow.
Tools Taverna, BioMoby, KEGG API
Databases OMIM, H-InvDB, GlycoEpitopeDB, RINGS, Consortium for Functional Glycomics,
GlycomeDB, GlycoGene DataBase, KEGG
Katayama et al. Journal of Biomedical Semantics 2011, 2:4
http://www.jbiomedsem.com/content/2/1/4
Page 5 of 18
End of preview.