ArticlePDF Available

The Firegoose: two-way integration of diverse data from different bioinformatics web resources with desktop applications

Authors:

Abstract and Figures

Information resources on the World Wide Web play an indispensable role in modern biology. But integrating data from multiple sources is often encumbered by the need to reformat data files, convert between naming systems, or perform ongoing maintenance of local copies of public databases. Opportunities for new ways of combining and re-using data are arising as a result of the increasing use of web protocols to transmit structured data. The Firegoose, an extension to the Mozilla Firefox web browser, enables data transfer between web sites and desktop tools. As a component of the Gaggle integration framework, Firegoose can also exchange data with Cytoscape, the R statistical package, Multiexperiment Viewer (MeV), and several other popular desktop software tools. Firegoose adds the capability to easily use local data to query KEGG, EMBL STRING, DAVID, and other widely-used bioinformatics web sites. Query results from these web sites can be transferred to desktop tools for further analysis with a few clicks. Firegoose acquires data from the web by screen scraping, microformats, embedded XML, or web services. We define a microformat, which allows structured information compatible with the Gaggle to be embedded in HTML documents. We demonstrate the capabilities of this software by performing an analysis of the genes activated in the microbe Halobacterium salinarum NRC-1 in response to anaerobic environments. Starting with microarray data, we explore functions of differentially expressed genes by combining data from several public web resources and construct an integrated view of the cellular processes involved. The Firegoose incorporates Mozilla Firefox into the Gaggle environment and enables interactive sharing of data between diverse web resources and desktop software tools without maintaining local copies. Additional web sites can be incorporated easily into the framework using the scripting platform of the Firefox browser. Performing data integration in the browser allows the excellent search and navigation capabilities of the browser to be used in combination with powerful desktop tools.
Content may be subject to copyright.
BioMed Central
Page 1 of 12
(page number not for citation purposes)
BMC Bioinformatics
Open Access
Software
The Firegoose: two-way integration of diverse data from different
bioinformatics web resources with desktop applications
J Christopher Bare, Paul T Shannon, Amy K Schmid and Nitin S Baliga*
Address: Institute for Systems Biology, 1441 N 34th Street, Seattle, WA 98103, USA
Email: J Christopher Bare - cbare@systemsbiology.org; Paul T Shannon - pshannon@systemsbiology.org;
Amy K Schmid - aschmid@systemsbiology.org; Nitin S Baliga* - nbaliga@systemsbiology.org
* Corresponding author
Abstract
Background: Information resources on the World Wide Web play an indispensable role in
modern biology. But integrating data from multiple sources is often encumbered by the need to
reformat data files, convert between naming systems, or perform ongoing maintenance of local
copies of public databases. Opportunities for new ways of combining and re-using data are arising
as a result of the increasing use of web protocols to transmit structured data.
Results: The Firegoose, an extension to the Mozilla Firefox web browser, enables data transfer
between web sites and desktop tools. As a component of the Gaggle integration framework,
Firegoose can also exchange data with Cytoscape, the R statistical package, Multiexperiment
Viewer (MeV), and several other popular desktop software tools. Firegoose adds the capability to
easily use local data to query KEGG, EMBL STRING, DAVID, and other widely-used bioinformatics
web sites. Query results from these web sites can be transferred to desktop tools for further
analysis with a few clicks.
Firegoose acquires data from the web by screen scraping, microformats, embedded XML, or web
services. We define a microformat, which allows structured information compatible with the
Gaggle to be embedded in HTML documents.
We demonstrate the capabilities of this software by performing an analysis of the genes activated
in the microbe Halobacterium salinarum NRC-1 in response to anaerobic environments. Starting with
microarray data, we explore functions of differentially expressed genes by combining data from
several public web resources and construct an integrated view of the cellular processes involved.
Conclusion: The Firegoose incorporates Mozilla Firefox into the Gaggle environment and enables
interactive sharing of data between diverse web resources and desktop software tools without
maintaining local copies. Additional web sites can be incorporated easily into the framework using
the scripting platform of the Firefox browser. Performing data integration in the browser allows
the excellent search and navigation capabilities of the browser to be used in combination with
powerful desktop tools.
Published: 19 November 2007
BMC Bioinformatics 2007, 8:456 doi:10.1186/1471-2105-8-456
Received: 4 August 2007
Accepted: 19 November 2007
This article is available from: http://www.biomedcentral.com/1471-2105/8/456
© 2007 Bare et al; licensee BioMed Central Ltd.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0
),
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
BMC Bioinformatics 2007, 8:456 http://www.biomedcentral.com/1471-2105/8/456
Page 2 of 12
(page number not for citation purposes)
Background
Information resources on the World Wide Web play an
indispensable role in modern biology. Using a web
browser, the biologist can easily access a wealth of infor-
mation such as sequences, biochemical pathways, protein
interactions, functional domains, annotations, and gene
expression data. Yet, integrating data from diverse sources
remains challenging. Biologists wishing to analyze their
own experimental data in combination with publicly
available data face the cumbersome tasks of converting
file formats, reconciling incompatible schemas, and map-
ping between inconsistent naming systems. Several strate-
gies for data integration have been developed in the
context of biology [1]. Data warehousing and web services
are two general approaches. Semantic mapping between
different data sources has been accomplished by describ-
ing biological entities in terms of shared object models [2]
or curated ontologies [3]. Our team's previous work in
this area resulted in the Gaggle, a software environment
that integrates databases and software tools based on a
few simple data types and the principle of semantic flexi-
bility [4]. Previously, the Gaggle supported integration
with web resources through a special purpose browser of
limited capabilities. An improved solution was needed
due to the ubiquitous role of the browser and the increas-
ing relevance of the web to data integration.
The web is rapidly becoming a channel for structured data
as evidenced by the increasing adoption of web services.
Rich internet applications, microformats [5], and the
Semantic Web [6] are pushing the browser into a central
role as an information broker [7] working with data in
human and machine readable form side-by-side. Several
tools outside the domain of biology have augmented the
browser with these types of capabilities, including Grease-
monkey [8], Piggy Bank [9], and Operator [10]. These
technologies offer the ability to perform ad-hoc manipu-
lation of data from multiple sources on the web using
familiar browser based interfaces.
We developed the Firegoose to bring a full-featured
browser into the Gaggle environment and to allow easy
transfer of data between the web and the desktop. The
Firegoose is a toolbar for the Mozilla Firefox [11] browser
that makes use of a diverse array of web communication
protocols to seamlessly query and retrieve data from cus-
tom databases as well as popular bioinformatics resources
including KEGG [12], EMBL STRING [13,14], and DAVID
[15]. As a member of the Gaggle software environment,
the Firegoose can also exchange data with a growing col-
lection of desktop tools. Bringing the data integration
techniques of the Gaggle into the browser, the Firegoose
facilitates exploratory analysis of systems biology data
using web resources and desktop software tools.
Implementation
The Firegoose is implemented as an add-on extension to
Mozilla Firefox, a popular opensource web browser. Its
user interface is a toolbar defined in XUL, an XML dialect
for describing user interface components, and its actions
are scripted in Javascript. The Firegoose communicates
with the Gaggle via the Java RMI protocol and supports
several modes of information exchange over the web.
The Gaggle approach
While it can be useful outside of the Gaggle, the Firegoose
is a component of the Gaggle software environment. The
goal of the Gaggle is to enable interactive exploration of
interrelated biological data by making data transfer
between analysis tools, databases, and now web
resources, quick and easy. Integration is achieved by sup-
porting the interchange of five simple data types, which
cover a wide variety of use-cases within the domain of sys-
tems biology.
Name list: a list of identifiers.
Associative array: a set of key/value pairs.
Data matrix: a 2 dimensional grid of floating point num-
bers, with labels for each row and column.
Network: a graph in which attributes may be assigned to
nodes and edges.
Cluster: a set of identifiers of interest under a set of condi-
tions.
The central feature of the Gaggle is the broadcast. Applica-
tions become part of the Gaggle by implementing an
interface that allows them to accept broadcasts. A broad-
cast consists of sending a message holding one of the five
data types to a program called the Gaggle Boss, which
then relays the message to one or more receiving applica-
tions or targets.
Using the simple Gaggle data types as an intermediary has
the advantage that applications can share data without
sharing common object models. Each program is free to
handle broadcasts in a way that makes sense in its own
context. This is semantic flexibility, the first of the Gaggle's
guiding principles. The second is to keep the set of data
types as small as possible minimizing the programming
effort required to integrate a new application into the Gag-
gle environment.
The Gaggle environment currently includes Cytoscape
[16] for viewing networks, the R statistical package [17],
Multi-experiment Viewer (MeV) [18] for microarray anal-
ysis, the spreadsheet-like Data Matrix Viewer (DMV), and
BMC Bioinformatics 2007, 8:456 http://www.biomedcentral.com/1471-2105/8/456
Page 3 of 12
(page number not for citation purposes)
several other programs. The term "gaggled" is used to indi-
cate that a software application or database has been con-
figured to exchange messages with the Boss; for example,
Cytoscape has been gaggled through the Cygoose plug-in.
The Firegoose uses the power of the browser to extend the
reach of the Gaggle to target web resources based on
widely supported protocols. The Firegoose communicates
with the Gaggle Boss via Java RMI. To communicate with
web resources, the Firegoose supports standard web pro-
tocols and data encoding schemes such as HTTP, HTML,
XML, and SOAP. A schematic view (Fig. 1) shows the flow
of messages over various protocols between desktop
applications, the Gaggle Boss, the Firegoose, and several
websites.
Exchanging data with web resources
When data is broadcast to the Firegoose, it can be rebroad-
cast, under user control, to any supported website. A
broadcast to a web resource is typically used to build a
query against an underlying database. For example, a set
of nodes representing genes may be selected in a Cyto-
scape network. Broadcasting this data from Cytoscape to
the Firegoose causes a list of gene identifiers to appear in
the Firegoose's broadcast data menu. To broadcast to the
KEGG Pathway database, the user would select the gene
list and then select the target "KEGG Pathway". Clicking
"Broadcast" submits a request to KEGG for the pathways
in which those genes participate (Fig. 2).
More importantly, the Firegoose also brings data back
from the web in a usable form. Traditionally, a web server
responds to a request by sending an HTML document,
which the browser renders for display. The HTML format
in which web pages are encoded describes presentation
but is poorly suited for representing the underlying struc-
ture, or metadata, necessary for integration and re-use by
automated processes. In response to this shortcoming, a
variety of encodings have been developed including XML,
SOAP, and microformats. Firegoose supports several
means of data interchange, described in detail below, in
order to interoperate with a wide variety of data providers.
Screen Scraping
HTML does not represent structured data well. However,
by making some assumptions about how the data are for-
matted, it is possible to reconstruct the structure that is
lost in HTML. This process, known as screen scraping, has
a long history on the web.
The Firegoose scrapes pages generated by the KEGG Path-
ways database to acquire a list of genes present in a given
pathway. The pathway diagrams generated by KEGG con-
tain links from each gene to its sequence and annotations.
The links take the following form: href="/dbget-bin/
www_bget?eco+b0728+b0729".
This allows a simple script to glean the information that
the two Escherichia coli genes b0728 and b0729 play a role
Communication in the GaggleFigure 1
Communication in the Gaggle. Software and databases shown as red dots send and receive broadcasts via Java RMI. The
blue nodes are web resources connected to the Gaggle through the Firegoose and accessed using HTTP with other protocols
and formats such as HTML, XML, and SOAP layered over top.
VISUALIZATION
Cytoscape: Interactions
DMV: Matrices
DATABASES
Microarray
Proteomics
ChIP-chip
Annotations
Protein CoIP
ANALYSIS
R statistical Environment
MeV
BOSS
FIREGOOSE
KEGG
(metabolic pathways)
NCBI
(genes, proteins)
DAVID
(Functional annotations)
EMBL STRING
(Functional Associations)
SBEAMS
(Halobacterium
functional annotations)
HTTP, SOAP
HTML, XML, SOAP
HTT
P
HTTP
HTML, XML
HTTP
HTTP
HTML with
Embedded XML
HTML with Embedded
Microformat
Gaggle
Microformat
RMIRMI
RMI RMI
BMC Bioinformatics 2007, 8:456 http://www.biomedcentral.com/1471-2105/8/456
Page 4 of 12
(page number not for citation purposes)
in the current pathway, in this case the citrate cycle. By
scanning all such links, a list can be compiled of all genes
that are members of the pathway in a selected organism,
which fits nicely into the name-list data type of the Gag-
gle. A biologist may, for example, broadcast this list to a
visualization tool for microarray data and check for evi-
dence of differential expression under specific conditions
of genes whose products play a role in the same metabolic
pathway.
While it can be effective, screen scraping is inelegant and
prone to breakage. It requires code to be written that is
specific for an individual site and will likely require main-
tenance whenever the site makes formatting changes. For
the Firegoose, screen scraping is one means of acquiring
data, but is not the preferred option. One simple solution
to the lack of structure inherent in web pages is found in
microformats.
Microformats and Embedded XML
Microformats embed machine-readable data within valid
HTML. The structure missing in HTML is supplied in CSS
class attributes, giving the parser the necessary clues to
extract data systematically from the page. The same data
elements may do double duty accommodating both on-
screen display and machine-readability. A browser exten-
sion like the Firegoose is ideally positioned to augment
web browsing with new capabilities enabled by data
embedded in the web page.
We have defined a microformat to represent the Gaggle
data types, specifications for which can be found in the
The Firegoose toolbar for Mozilla FirefoxFigure 2
The Firegoose toolbar for Mozilla Firefox. A broadcast is sent by three steps: select data to be broadcast (i), select the
target of the broadcast (ii), and click 'Broadcast' to send (iii). In this example, a list of genes has been broadcast to KEGG. The
result shows that the genes are part of the oxidative phosphorylation pathway.
Administrative
functions
Select data to
be broadcast
Select target for
next broadcast
Send broadcast
The Firegoose toolbar
i
ii
iii
Desktop target geese
Web site targets
BMC Bioinformatics 2007, 8:456 http://www.biomedcentral.com/1471-2105/8/456
Page 5 of 12
(page number not for citation purposes)
Gaggle microformat reference [19]. This format allows
any of the Gaggle data types to be embedded directly into
a web page, either displayed in the page or hidden from
the user. The toolbar detects embedded data and makes
that data available for broadcast to other web sites or
desktop applications. An example of the Gaggle microfor-
mat is shown (Fig. 3) which encodes a list of three genes
involved in signal transduction.
As with microformats, XML can be embedded directly in
web pages. XML has the advantages of a well-defined syn-
tax and good support by parsers and tools. Adoption of
this technique has been hindered by the fact that web
pages containing embedded XML fail HTML and XHTML
validation. This may not be an obstacle in practice since
browsers do not perform validation and simply ignore
unrecognized tags. We implemented an XML format for
Gaggle name lists to communicate lists of genes from a
web view of a database of gene annotations for H. sali-
narum NRC-1 [20].
The benefits of these two approaches are the same. Their
value is that they allow data to be captured without requir-
ing site-specific screen scraping code to be written and
maintained. By adding a small amount of standard
markup to their web pages, website providers make their
site accessible to any tools that recognize that format. In
the case of the Gaggle microformat, websites can be gag-
gled without additional code in the Firegoose and a very
modest effort on the server side.
Embedding data directly in HTML works well for moder-
ate amounts of data. Where the size of the data is prohib-
itive, a microformat embedded in the page could provide
the parameters needed by tools like Firegoose to access
related data through external channels such as web serv-
ices or direct downloads.
Web Services
A web service is a programmatic interface allowing appli-
cations running on heterogeneous platforms to interoper-
ate over a network. Typically, the messages passed back
and forth are XML. The Firegoose can access web services
from within the browser forging a direct link from the
presentation of a web site to its underlying data.
For example, KEGG offers extensive web services. The list
of genes in a pathway can be requested by calling the
get_genes_by_pathway SOAP method, which returns the
same list of genes acquired above by screen scraping, but
in an easily processed XML format. A parameter specifies
the pathway and organism in which we are interested. For
example, the string "path:eco00020" denotes the tricarbo-
xylic acid (TCA) cycle pathway in E. coli. In order to find
these parameters, a nominal amount of screen scraping
and case-specific scripting is still necessary.
Accessing web services from within the browser is not typ-
ical, but offers some compelling advantages. The familiar
user-friendly interface of a web application can be used to
navigate through a database and then the web service can
be invoked to acquire the desired records in structured
form amenable to further computation. The browser sup-
ports an interactive style of usage without the need to
build a customized client to the web service.
Currently, there is no standard way of linking data dis-
played in a web page with a query to a web service that
will return the same data in a structured form. The Fire-
goose uses screen scraping to fill in this gap. A standard-
ized microformat for this purpose could facilitate this
kind of interactive access to structured data.
Website Handlers
A website may provide data via a web service, an embed-
ded microformat, or other channels. If not, screen scrap-
Example of the Gaggle MicroformatFigure 3
Example of the Gaggle Microformat. Microformats embed structured data in web pages by using CSS tags as markup.
<html>
<div class="gaggle-data">
<p>name= <span class="gaggle-name">Signal transduction genes</span></p>
<p>species= <span class="gaggle-species">Halobacterium sp. NRC-1</span></p>
<p>(optional)size= <span class="gaggle-size">3</span></p>
<div class="gaggle-namelist">
<ol>
<li>VNG0355G</li>
<li>VNG0716G</li>
<li>VNG1175G</li>
</ol>
</div>
</div>
</html>
BMC Bioinformatics 2007, 8:456 http://www.biomedcentral.com/1471-2105/8/456
Page 6 of 12
(page number not for citation purposes)
ing may be required. For the sake of modularity, we
encapsulate the details of interacting with each individual
site or specific format into a separate component called a
"handler".
For each target website or format, a handler is written and
packaged in a separate Javascript source file that contains
code specific to that target. The handler implements a
common interface (Fig. 4) and is responsible for recogniz-
ing web pages with which it can interact and transferring
data to and from its target. Recognition can be based
either on the URL or the contents of the page, which
means we can write a handler for a particular web site or
a handler that understands a data format embedded in
pages from several sources.
When a document is loaded in the browser, the toolbar
calls the recognize method for each handler until a handler
recognizes the document. Then the getPageData method
is called for that handler. The handler then inspects the
document and constructs a list of GaggleData objects that
can be acquired from the page. This, in turn, is used to
populate the broadcast menu.
The GaggleData object represents data from the website in
the form of one of the Gaggle data types. Support for lazy
instantiation and asynchronous access to web services can
be neatly hidden behind its simple interface. Typically,
getPageData performs only enough work to generate
descriptive information for each data object. The actual
parsing or issuing a request to a web service can be
deferred until the user requests a broadcast. This mini-
mizes processing overhead and prevents unnecessary net-
work traffic.
Any of the Gaggle data types may be returned from the
getPageData method, but so far we have only imple-
mented broadcasting lists of names to web resources. Typ-
ically, a broadcast to a website is transformed into a query,
which retrieves information relevant to a list of genes, pro-
teins, or other identifiers. In principle, methods could eas-
ily be defined for each of the other data types:
handleNetwork, handleMatrix, etc. In keeping with the
nature of dynamic languages like Javascript, all of the
"handle..." methods are optional.
Results and discussion
Case Study: synthesis of a model of anaerobic physiology
in H. salinarum NRC-1 using Firegoose
To demonstrate the effectiveness of the Firegoose in a typ-
ical systems biology type of investigation, we explore
expression in Halobacterium salinarum NRC-1 in response
to fluctuating oxygen concentration [21]. Briefly, H. sali-
narum NRC-1 is a halophilic archaeon with a small
genome of 2.6 Mb that encodes ~2,400 protein-coding
genes [22]. This organism is most prolific in aerobic con-
ditions but switches facultatively to other modes of energy
production in anoxic environments. We will take as our
starting point a list of genes that were found in microarray
experiments to be actively expressed under low oxygen
conditions. To understand the physiological changes
associated with the anoxic state as completely as possible,
we need to understand the functions of individual pro-
teins, metabolic pathways encoded by genes of known
function, and functional associations among genes
through evolutionary and literature analysis. Not all of
this information is contained within one resource; for
instance whereas KEGG specializes in information regard-
ing metabolic pathways, STRING calculates functional
associations among proteins through comparative analy-
sis of sequence, literature and publicly available experi-
mental data, and DAVID classifies proteins into enriched
functional clusters. Finally, using our own expertise we
have curated function assignments to many proteins in H.
salinarum NRC-1 using a combination of sequence and
structure-based approaches. In this example, we will dem-
onstrate how the Firegoose can significantly aid in func-
tionally characterizing a set of genes by enabling seamless
exploration and integration of several web-based tools
from separate providers including KEGG, EMBL STRING,
DAVID, and an in-house annotation database for H. sali-
narum NRC-1. First, we will briefly explain how these
genes were identified.
Halobacterial cells were subjected to differing levels of
oxygen and samples were collected at varying oxygen con-
Interface of a handler scriptFigure 4
Interface of a handler script. The handler interface pro-
vides an extension point for adding support for new web-
sites, web services, and protocols.
/**
* check the given doc to see if we can parse it.
* Return true if so, or false otherwise.
*/
handler.recognize = function(doc) {...};
/**
* open the web site in a new browser tab.
*/
handler.show = function() {...};
/**
* return a list of GaggleData objects representing the
* data related to the page as one of the Gaggle data
* types.
*/
handler.getPageData = function(doc) {...};
/**
* takes a species and a list of names and submits them
* for processing by the website. List can be either a
* Java Array or a Javascript Array.
*/
handler.handleNameList = function(species, names) {...};
BMC Bioinformatics 2007, 8:456 http://www.biomedcentral.com/1471-2105/8/456
Page 7 of 12
(page number not for citation purposes)
centrations and time. Total RNA from these samples was
analyzed using microarray analysis [23]. The DMV (Data
Matrix Viewer) and the R statistical package were used to
normalize and select genes whose expression was signifi-
cantly changed in response to the perturbations in oxy-
gen. MeV, a microarray data analysis tool, was then used
to cluster the differentially expressed genes by their
expression profiles. Two sets of genes emerged, one acti-
vated in the presence of oxygen and another activated in
absence of oxygen. Additional details about this part of
the analysis are included in the tutorial web page [24]. We
will concentrate on the anaerobically induced genes.
The remainder of the case study traces our inquiry into the
functional roles these genes may be playing. Using Fire-
goose, we consult multiple remote data sources and trans-
fer data between them and local desktop tools. For
demonstration purposes, the lists of genes in both aerobic
and anaerobic clusters were encoded in the Gaggle micro-
format and embedded in the tutorial [24], which docu-
ments the steps of the analysis. The reader can install the
Firegoose, browse to the tutorial web page, and easily
reproduce the analysis that follows and is encouraged to
do so. The analysis is broken into numbered steps to cor-
respond with data transfers labeled in red in the diagram
(Fig. 5).
Step 1
We first consult KEGG, a database of curated biochemical
pathways. Starting at the tutorial web page, we broadcast
the embedded anaerobic gene list to the KEGG Pathway
target. This has the same result as cutting and pasting into
the KEGG interface; KEGG performs a query for biochem-
ical pathways in which these genes participate. Thirty-one
of the 222 genes that were queried matched KEGG path-
ways including amino acid metabolism, ABC transporters,
and active potassium transport. Some of these pathway
matches are consistent with known physiological proper-
ties of H. salinarum NRC-1 such as to facultatively derive
energy by fermenting arginine under anaerobic condi-
tions. Further, this analysis also suggests that uptake sys-
tems for several alternate nutritional sources (e.g.
glycerol) may also be utilized under anaerobic conditions.
Interestingly, KEGG also finds two transcription factors in
this anaerobic set, tfbA and tfbE, providing clues to possi-
ble regulatory mechanisms.
Although the information gathered from this first line of
analysis has yielded considerable insight into the physio-
logical adjustment to an anoxic environment, 191 genes
did not match any enzymes within the KEGG catalog of
pathways. At this point, we could narrow our investiga-
tion to genes in a particular pathway or to the unmatched
genes, which Firegoose can capture and broadcast
onward. But, for the next step, we'll stick with the full set
of anaerobically induced genes.
Step 2
To continue our analysis, we use EMBL STRING, a power-
ful tool focused on protein interactions computed from
supporting evidence such as sequence homology, journal
abstracts, and protein domains. Specifically, we are inter-
ested in finding functional associations among genes
already classified into metabolic pathways by KEGG and
genes of unknown function. We broadcast our list of
anaerobically induced genes from the tutorial page to
STRING, which responds by displaying a network of
nodes (proteins) connected by colored edges representing
functional relationships.
We find strongly interconnected components of the net-
work that are associated with specialized metabolism and
do not have corresponding pathways in KEGG. For
instance, a cluster of six genes appears in the network con-
nected by edges indicating chromosomal proximity, co-
occurrence across genomes, and text mining results.
STRING gives annotations and protein domains that
show these genes to be involved in the use of dimethyl
sulfoxide (DMSO) as an alternative electron acceptor [25].
A second gene cluster, connected through chromosomal
proximity associations, encodes gas vesicles used by the
organism to vertically orient itself in the water column
[26].
Step 3
Another interesting grouping links four genes of unknown
function VNG1183H, VNG1184G, VNG1185G, and
VNG1187G. VNG1187G contains two multicopper oxi-
dase domains. VNG1185G is annotated as a Coenzyme
PQQ synthesis protein and VNG1184G is annotated as a
heme biosynthesis protein. Recentering the network on
VNG1187G and expanding twice (to reveal second neigh-
bors in the network) shows functional relationships to
several other proteins including more heme biosynthesis
proteins. Because not all of these genes were differentially
regulated in oxygen the initial seed generated from expres-
sion analysis may have captured some incomplete path-
ways including this particular one. However, by
combining, filtering and expanding that seed using a vari-
ety of resources we can navigate towards a comprehensive
picture.
Step 4
STRING makes its networks available in an XML format,
which the Firegoose can parse and broadcast to other Gag-
gle tools. Broadcasting the network to Cytoscape allows
the user to work with the network more interactively. For
example, subsets of nodes can be selected and broadcast,
or highlighted in response to broadcasts from other tools.
BMC Bioinformatics 2007, 8:456 http://www.biomedcentral.com/1471-2105/8/456
Page 8 of 12
(page number not for citation purposes)
We select the four-gene cluster noted above and broadcast
to the Firegoose for further investigation.
Step 5
The H. salinarum NRC-1 proteome has been recently re-
annotated using a combination of sequence and structure-
based approaches. These functional annotations are
curated and made publicly available on the web [27]
Using Firegoose to investigate transcription response of H. salinarum to anoxic conditionsFigure 5
Using Firegoose to investigate transcription response of H. salinarum to anoxic conditions. A list of genes with
similar expression profiles is found by microarray analysis. (1) We broadcast these genes to KEGG to query for known bio-
chemical pathways, for example, viewing the Arginine and proline metabolism pathway. (2) We broadcast the genes to EMBL
STRING, a protein interaction database, where we can (3) navigate to functionally related genes. (4) In order to manipulate the
network, we broadcast it to Cytoscape. (5) A local database provides additional information. (6) DAVID performs functional
clustering. (7) For one cluster containing signal transduction genes, we use String's links to protein domains.
Differentially expressed
genes identified in
microarray experiments
1
Broadcast STRING
interaction network
to Cytoscape.
2
View signal transduction
protein domains.
3
4
6
DAVID:
functional annotations
KEGG
Select groups of
genes in Cytoscape
and search organism
specific database
5
STRING
Halobacterium
Annotations
Database (SBEAMS)
Cytoscape
Recenter network on
VNG1187G and expand
to second neighbors.
7
BMC Bioinformatics 2007, 8:456 http://www.biomedcentral.com/1471-2105/8/456
Page 9 of 12
(page number not for citation purposes)
through SBEAMS[20], an open source data management
system. In this database, VNG1187G is annotated as a
putative copper containing nitrite reductase. Drilling
down using the links provided within the SBEAMS data-
base we see that the supporting evidence is a match to
entry 1KBV (the membrane protein AniA, a copper-con-
taining nitrite reductase from Neisseria Gonorrhoeae; e-
value < 10
-166
) in the Protein Data Bank. This function in
conjunction with the functional associations with puta-
tive heme biosynthesis proteins and a putative coenzyme
PQQ synthesis protein form the basis of a hypothesis that
H. salinarum NRC-1 may possess a nitrite reducing path-
way.
Step 6
DAVID is a functional annotation tool that integrates
many of the same primary sources as KEGG and STRING
but displays its results in tabular format rather than as a
network.
DAVID differs from the other resources we have used in
that it doesn't work well with the VNG naming system
used to identify genes in H. salinarum NRC-1. We use a
Gaggle integrated translator utility to translate gene iden-
tifiers between the VNG nomenclature and GI accession
numbers suitable to be broadcast to DAVID. The issue of
multiple naming schemes is a major bottleneck for data
integration in biology. Passing broadcasts through a sim-
ple synonym-mapping translator within the Gaggle helps
overcome this hurdle.
DAVID also clusters genes with related annotations, con-
veniently summarizing the kinds of cellular processes for
which a set of genes is enriched. Our genes of interest
divide into 17 clusters — several of which were consistent
with the results from previous steps. For example, seven
genes, including one member of the DMSO cluster, have
products annotated as being involved in electron trans-
port.
Step 7
DAVID produces another cluster containing signal trans-
duction proteins. Back in STRING, we can click on these
proteins to view functional domains provided by the
SMART database [28,29]. VNG0355G (Htr14) contains
sensory domains associated with chemotaxis. Both
VNG0716G (AfsQ2) and VNG1175G (PhoR) contain
domains implicated in signal transduction and light sen-
sitivity. It is notable that this is biologically meaningful
because light and oxygen physiology in H. salinarum NRC-
1 are tightly coupled with one another and also with phys-
ical relocation (taxis).
Summary of results
In this example we have combined four web-based
resources with several desktop applications to build a uni-
fied understanding of the physiology of H. salinarum
NRC-1 under anaerobic conditions. The depletion of oxy-
gen seems to activate certain aspects of amino acid metab-
olism, and alternate energy transduction pathways such as
phototrophy, arginine fermentation and DMSO respira-
tion. The simultaneous induction of gas vesicle synthesis
is consistent with the anaerobic physiological behavior of
H. salinarum NRC-1 to move towards the surface in search
of alternate energy sources including light. The data also
suggest that H. salinarum NRC-1 may have some specific
nutritional requirements in this anoxic environment that
cause the induction of an array of membrane transport
systems and possibly previously uncharacterized func-
tions such as dissimilatory nitrite reduction. In addition,
putative components of the signaling and regulatory
mechanisms that may mediate the transition to this met-
abolic state were also detected.
It is important to recognize that the information
resources, while overlapping to some degree, provide
complementary perspectives. KEGG provides information
regarding biochemical pathways but does not draw asso-
ciations between these pathways and genes of unknown
function. This information can be obtained from STRING,
which on its own is not cognizant of the metabolic path-
ways. Likewise, DAVID includes information on func-
tional domains within proteins and integrates many of
the same primary sources as KEGG and STRING. But,
DAVID presents the information differently with its
unique clustering feature to provide a statistical evalua-
tion of the enrichment of particular functions among the
queried genes. Finally, our local database provides a
source of data curated manually by experts through years
of careful literature surveys and experimentation. Integra-
tion of these web-based resources with desktop applica-
tions through the Gaggle and the Firegoose enables the
type of analysis that is necessary to understand the com-
plex dynamic regulation of cellular responses (Fig. 6).
Complete data from each of our four sources can be found
in the supplementary table (Additional file 1).
Comparison with workflow software
Firegoose, along with the Gaggle framework, shares sev-
eral features in common with workflow tools such as Tav-
erna[30,31]. They share the strategy of composing distinct
programs and data sources to build larger systems with
rich capabilities. To this end, both Firegoose and work-
flow tools benefit greatly from the availability of program-
matic access to structured data and computational services
over common web protocols.
BMC Bioinformatics 2007, 8:456 http://www.biomedcentral.com/1471-2105/8/456
Page 10 of 12
(page number not for citation purposes)
Workflow tools enable a user to automate a well-defined
and repeatable process, often using web services or mes-
sage queues for interprocess communication. But, before
a well-defined analysis process exists, there is a need for
exploratory analysis, which is necessarily ad-hoc. Gaggle
and the Firegoose seek to enable this kind of interactive
exploration by exploiting the flexibility of web-based
tools in combination with desktop analysis and visualiza-
tion tools. Scripting an interaction with a web site is, of
course, possible using a browser extension, but the
emphasis in the Firegoose is on automating the exchange
of data, leaving the direction of the analysis up to the user.
The difference in emphasis does not rule out using work-
flow engines and Firegoose together. Invoking workflows
on a remote server from within Firegoose is one poten-
tially valuable example.
Integrated visualization of H. salinarum response to anoxic conditionsFigure 6
Integrated visualization of H. salinarum response to anoxic conditions. The transcriptional response to anoxia was
characterized using several data sources. Edges represent several types of evidence for functional association provided by
STRING. Yellow filled nodes indicate genes classified by KEGG. Blue outline nodes indicate genes classified by DAVID. Other
nodes were characterized by an in-house annotation database or other sources, including PFAM, BLAST, and PDB. 102 genes
of unknown function were omitted.
tfbE
6390H
6384H
6365H
6366H
6364H
ugpB
ugpC
gpdB
cysT2
modA
ugpE
hutG
1213C
hutI
aspC2
argG
trpF
hal
thrB
metA
arcB
arcC
mamA
dmsR
dmsC
dmsE
dmsA
dmsD
dmsB
6143H
trsE
6145H
gvpK2
gvpG2
gvpJ2
gvpH2
gvpI2
cdc48d
6203H
kdpA
kdpC
kdpB
cat3
1663C
panF
purU
hisH2
cna
purB
nirJ
pqqE
1183H
nirK
1784C
tfbA
hat2
mutT
6403H
6404H
2678H
1681C
truD
cofH
arsM
0573C
trp3
yjbG
bchP
brp
0875C
phoT1
manC
gvpN2
hepA
aroD
hcpB
1906H
bop
2214G
2002H
trkA2
rhl
1626C
orc2
uppS
afsQ2
0622H
pstC1
6430C
lta
yhdG
1457C
gvpA1
arsA2
phrH
nhaC2
2115H 6157H
6349C
gdb
6258C
mmdA
htr14
ycdH
0654C
0818C
phoR
serA3
minD2
crcB1
repI
0750C
crtB1
gufA
0883H
gap
araL
0409C
2458C
DMSO respiration
Gas vesicle proteins
Phototrophy
ABC transporters
Transporters
Amino acid metabolism
Transcription factors
Signal transduction
Electron transport
putative
chitinase
cobalamin
binding
Cell division /
nucleic acid metabolism
Putative nitrite reduction pathway
archaeal DNA
polymerase
Neighborhood
Gene Fusion
Co-occurrence
Coexpression
Databases
Textmining
BMC Bioinformatics 2007, 8:456 http://www.biomedcentral.com/1471-2105/8/456
Page 11 of 12
(page number not for citation purposes)
Future Work
Additional functionality could be added to the Firegoose
in a number of ways, most easily through the addition of
more handlers for biological websites. We also considered
that users might want to develop their own handler
scripts. We prototyped code for dynamically importing
custom scripts into the Firegoose. Other projects such as
Greasemonkey [8] have had success with similar capabili-
ties. If further developed, this feature would allow a
straightforward mechanism for users or data providers to
contribute scripts.
Supporting the RMI communications protocol requires
Java. Using Java within a Firefox extension is something of
a challenge and code from MIT's Simile project [9] was
extremely helpful in this area. An alternative under con-
sideration is to communicate with the Boss using an XML
based protocol over sockets eliminating the need to run a
Java virtual machine in the browser's process.
We plan to extend the Gaggle microformat to express links
to data in addition to embedding data directly in the page.
This allows large data structures to be transmitted inde-
pendently of the page while preserving the linkage
between presentation in the browser and the underlying
structured data. A standard format would decrease the
need for customized coding for each web site.
RDF (Resource Description Framework) is a data model
designed to represent meta-data for Semantic Web appli-
cations. Incorporating support for RDF into the Firegoose
would allow the Gaggle to exchange data with the seman-
tically rich resources envisioned by proponents of the
Semantic Web project.
Conclusion
The Firegoose incorporates Mozilla Firefox into the Gag-
gle environment providing coordinated access to web
applications and programmatic data sources. Performing
data integration in the browser has several advantages and
is perhaps the most interesting feature of the Firegoose.
Browsers excel at search and navigation. Using the Fire-
goose, a biologist can search and navigate web resources
using familiar browser-based interfaces with the addi-
tional capability of easily moving data from one web-
based resource to another as well as between the web and
the desktop. Interactively integrating specific information
as needed replaces the cumbersome process of maintain-
ing local copies of large databases and manually coercing
data from diverse sources into a compatible format. Using
the Gaggle data types as intermediaries lowers the barrier
between web resources and desktop tools, allowing the
scientist to creatively combine and re-use data in ways that
go beyond those provided by the curators of individual
data sources.
The Firegoose positions the Gaggle to take advantage of
increasing use of web protocols to transmit structured
data. The Firegoose provides a framework in which new
web resources can be integrated into the Gaggle in a
straightforward and easily implemented manner, accom-
modating a variety of protocols. In supporting a number
of protocols, we hope to encourage data providers to
make available structured data in the format of their
choice and to provide the necessary information to link
web interfaces with the underlying data allowing brows-
ing and programmatic access to become seamlessly inte-
grated.
If the web is becoming a channel for structured data,
applications that share data between diverse web
resources and software tools will be of increasing impor-
tance. The Firegoose aims to fill this role for the systems
biology domain.
Availability and requirements
Source code for the Firegoose, along with that of the other
components of the Gaggle, is available at the Gaggle web-
site [32]. Also available are instructions for installing and
uninstalling the Firegoose toolbar [33] and documenta-
tion [34]. Most of the desktop components of the Gaggle
are deployed as Java webstarts, which can be launched by
clicking a link in the browser.
The toolbar is compatible with versions 1.5.x and 2.0.x of
Mozilla Firefox. We anticipate maintaining compatibility
with Firefox 3.x when released.
Java version 5 [35] or higher runtime environment is
required and the Java browser plug in for Firefox must be
installed. Extra attention is often required to install the
Java browser plug-in on Linux. Specific instructions for
most distributions are available on the web.
The source code is distributed under the GNU Lesser Gen-
eral Public License, the text of which is available at: http:/
/www.gnu.org/copyleft/lesser.html.
Authors' contributions
JCB Wrote the manuscript and implemented software.
PTS Conceived and initiated the project, implemented
prototype and provided feedback on the written manu-
script.
AKS Assisted in the conception and implementation of
the case study and provided feedback on the written man-
uscript.
Publish with Bio Med Central and every
scientist can read your work free of charge
"BioMed Central will be the most significant development for
disseminating the results of biomedical research in our lifetime."
Sir Paul Nurse, Cancer Research UK
Your research papers will be:
available free of charge to the entire biomedical community
peer reviewed and published immediately upon acceptance
cited in PubMed and archived on PubMed Central
yours — you keep the copyright
Submit your manuscript here:
http://www.biomedcentral.com/info/publishing_adv.asp
BioMedcentral
BMC Bioinformatics 2007, 8:456 http://www.biomedcentral.com/1471-2105/8/456
Page 12 of 12
(page number not for citation purposes)
NSB Conceived and initiated the project, wrote the man-
uscript and provided direction and feedback on quality of
results and software design.
All authors read and approved the final manuscript.
Additional material
Acknowledgements
We thank Dan Tenenbaum and Ricardo Vencio for critical reading of the
manuscript and helpful suggestions. We also thank John Boyle, Chris
Cavnor, David Shteynberg, and Neils Gelenburg for thoughtful discussions.
This work was supported by the following grants: NSF: DBI-0640950; DoE:
DE-FG02-07ER64327; and NIH: P50 GM076547.
References
1. Stein LD: Integrating biological databases. Nature reviews 2003,
4(5):337-345.
2. Covitz PA, Hartel F, Schaefer C, De Coronado S, Fragoso G, Sahni H,
Gustafson S, Buetow KH: caCORE: a common infrastructure
for cancer informatics. In Bioinformatics Volume 19. Issue 18
Oxford, England; 2003:2404-2412.
3. Wilkinson MD, Links M: BioMOBY: an open source biological
web services proposal. Briefings in bioinformatics 2002,
3(4):331-341.
4. Shannon PT, Reiss DJ, Bonneau R, Baliga NS: The Gaggle: an open-
source software system for integrating bioinformatics soft-
ware and data sources. BMC bioinformatics 2006, 7:176.
5. Microformats.org [http://microformats.org/
]
6. Semantic Web [http://www.w3.org/2001/sw/
]
7. Microformats [http://blog.mozilla.com/faaborg/2006/12/11/micro
formats-part-0-introduction/]
8. Greasemonkey [http://www.greasespot.net/
]
9. Huynh D, Mazzocchi S, Karger D: Piggy Bank: Experience the
Semantic Web Inside Your Web Browser. International Seman-
tic Web Conference: 2005 2005.
10. Operator [https://addons.mozilla.org/en-US/firefox/addon/4106
]
11. Mozilla Firefox [http://www.mozilla.com
]
12. Kanehisa M: The KEGG database. Novartis Foundation symposium
2002, 247:91-101. discussion 101–103, 119–128, 244–152
13. von Mering C, Jensen LJ, Kuhn M, Chaffron S, Doerks T, Kruger B,
Snel B, Bork P: STRING 7 – recent developments in the inte-
gration and prediction of protein interactions. Nucleic acids
research 2007:D358-362.
14. von Mering C, Jensen LJ, Snel B, Hooper SD, Krupp M, Foglierini M,
Jouffre N, Huynen MA, Bork P: STRING: known and predicted
protein-protein associations, integrated and transferred
across organisms. Nucleic acids research 2005:D433-437.
15. Dennis G Jr, Sherman BT, Hosack DA, Yang J, Gao W, Lane HC, Lem-
picki RA: DAVID: Database for Annotation, Visualization, and
Integrated Discovery. Genome biology 2003, 4(5):P3.
16. Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, Amin
N, Schwikowski B, Ideker T: Cytoscape: a software environment
for integrated models of biomolecular interaction networks.
Genome research 2003, 13(11):2498-2504.
17. R Statistical Package [http://www.r-project.org
]
18. Saeed AI, Sharov V, White J, Li J, Liang W, Bhagabati N, Braisted J,
Klapa M, Currier T, Thiagarajan M, et al.: TM4: a free, open-source
system for microarray data management and analysis. Bio-
Techniques 2003, 34(2):374-378.
19. Gaggle Microformat [http://gaggle.systemsbiology.net/docs/
geese/firegoose/microformat/]
20. SBEAMS [http://www.sbeams.org/
]
21. Schmid A, Reiss DJ, Kaur A, Pan M, King N, Van PT, Hohmann L, Mar-
tin DB, Baliga NS: The anatomy of microbial cell state transi-
tions in response to oxygen. Genome research 2007,
17(10):1399-1413.
22. Ng WV, Kennedy SP, Mahairas GG, Berquist B, Pan M, Shukla HD,
Lasky SR, Baliga NS, Thorsson V, Sbrogna J, et al.: Genome
sequence of Halobacterium species NRC-1. Proceedings of the
National Academy of Sciences of the United States of America 2000,
97(22):12176-12181.
23. Ideker T, Thorsson V, Siegel AF, Hood LE: Testing for differen-
tially-expressed genes by maximum-likelihood analysis of
microarray data. J Comput Biol 2000, 7(6):805-817.
24. Gaggle and Firegoose Oxygen Demo [http://gaggle.systemsbi
ology.net/projects/demos/halo_oxygen_analysis/]
25. Muller JA, DasSarma S: Genomic analysis of anaerobic respira-
tion in the archaeon Halobacterium sp. strain NRC-1: dime-
thyl sulfoxide and trimethylamine N-oxide as terminal
electron acceptors. Journal of bacteriology 2005,
187(5):1659-1667.
26. Robb FT, Place AR, Sowers KR, Schreier HJ, DasSarma S, Fleischmann
EM: Archaea: A laboratory manual. Cold Spring Harbor, New
York.: Cold Spring Harbor Laboratory Press; 1995.
27. Halobacterium genome annotations [http://baliga.systemsbiol
ogy.net/halobacterium/]
28. Ponting CP, Schultz J, Milpetz F, Bork P: SMART: identification
and annotation of domains from signalling and extracellular
protein sequences. Nucleic acids research 1999, 27(1):229-232.
29. Schultz J, Milpetz F, Bork P, Ponting CP: SMART, a simple modu-
lar architecture research tool: identification of signaling
domains. Proceedings of the National Academy of Sciences of the United
States of America 1998, 95(11):5857-5864.
30. Hull D, Wolstencroft K, Stevens R, Goble C, Pocock MR, Li P, Oinn
T: Taverna: a tool for building and running workflows of serv-
ices. Nucleic acids research 2006:W729-732.
31. Oinn T, Addis M, Ferris J, Marvin D, Senger M, Greenwood M, Carver
T, Glover K, Pocock MR, Wipat A, et al.: Taverna: a tool for the
composition and enactment of bioinformatics workflows. In
Bioinformatics Volume 20. Issue 17 Oxford, England; 2004:3045-3054.
32. The Gaggle website [http://gaggle.systemsbiology.net
]
33. Firegoose Installation Help [http://gaggle.systemsbiology.net/
docs/geese/firegoose/install/]
34. Firegoose [http://gaggle.systemsbiology.net/docs/geese/firegoose/
]
35. Download Java [http://www.java.com/download/
]
Additional file 1
Characterization of H. salinarum anoxic transcription response by
four data sources. Data about our genes of interest from KEGG,
STRING, DAVID, and the Halobacterium genome annotations database
are presented here in tabular form.
Click here for file
[http://www.biomedcentral.com/content/supplementary/1471-
2105-8-456-S1.xls]

Supplementary resource (1)

... [90]) enable GRN construction, visualization and analysis. FireGoose [91] and ChromeGoose (gaggle. systemsbiology.net/docs/geese/chromegoose) ...
... In the next stage of analysis, we focus on the use of Cytoscape to build, visualize, and interpret the network. Other tools described in Section 2.3.4 are also useful for the task of network analysis in archaea, with case studies published elsewhere [91,92]. In archaea, TFs resembling those of bacteria and eukaryotes work in concert to regulate gene expression [93]. ...
Article
Full-text available
To survive complex and changing environmental conditions, microorganisms use gene regulatory networks (GRNs) composed of interacting regulatory transcription factors (TFs) to control the timing and magnitude of gene expression. Genome-wide datasets; such as transcriptomics and protein-DNA interactions; and experiments such as high throughput growth curves; facilitate the construction of GRNs and provide insight into TF interactions occurring under stress. Systems biology approaches integrate these datasets into models of GRN architecture as well as statistical and/or dynamical models to understand the function of networks occurring in cells. Previously, these types of studies have focused on traditional model organisms (e.g. Escherichia coli, yeast). However, recent advances in archaeal genetics and other tools have enabled a systems approach to understanding GRNs in these relatively less studied archaeal model organisms. In this report, we outline a systems biology workflow for generating and integrating data focusing on the TF regulator. We discuss experimental design, outline the process of data collection, and provide the tools required to produce high confidence regulons for the TFs of interest. We provide a case study as an example of this workflow, describing the construction of a GRN centered on multi-TF coordinate control of gene expression governing the oxidative stress response in the hypersaline-adapted archaeon Halobacterium salinarum. Copyright © 2015. Published by Elsevier Inc.
... Bare et al. (2010) This permits interactive curation of the peak list and analysis of the ChIP-Seq data in the context of other Gaggle-enabled resources. Bare et al. (2007); Shannon et al. (2006) Interactive curation of a microbial ChIP-Seq data set can typically be completed in a few minutes. ...
Preprint
Full-text available
While numerous effective peak finders have been developed for eukaryotic systems, we have found that the approaches used can be error prone when run on high coverage bacterial and archaeal ChIP-Seq datasets. We have developed Pique, an easy to use ChIP-Seq peak finding application for bacterial and archaeal ChIP-Seq experiments. The software is cross-platform and Open Source, and based on only freely licensed dependencies. Output is provided in standardized file formats, and may be easily imported by the Gaggle Genome Browser (Bare et al. 2010) for manual curation and data exploration, or into statistical and graphics software such as R (R Core Team 2013) for further analysis. The software is available under the BSD-3 license, and tutorial and test data are included with the documentation. http://github.com/ryneches/pique.
... Bare et al. (2010) This permits interactive curation of the peak list and analysis of the ChIP-Seq data in the context of other Gaggle-enabled resources. Bare et al. (2007); Shannon et al. (2006) Interactive curation of a microbial ChIP-Seq data set can typically be completed in a few minutes. ...
Preprint
Full-text available
While numerous effective peak finders have been developed for eukaryotic systems, we have found that the approaches used can be error prone when run on high coverage bacterial and archaeal ChIP-Seq datasets. We have developed Pique, an easy to use ChIP-Seq peak finding application for bacterial and archaeal ChIP-Seq experiments. The software is cross-platform and Open Source, and based on only freely licensed dependencies. Output is provided in standardized file formats, and may be easily imported by the Gaggle Genome Browser (Bare et al. 2010) for manual curation and data exploration, or into statistical and graphics software such as R (R Core Team 2013) for further analysis. The software is available under the BSD-3 license, and tutorial and test data are included with the documentation. http://github.com/ryneches/pique.
... Another way of information retrieval is the parsing of rendered user interfaces [48] to extract content types from similar visual features in the synchronized views. Examples of toolkits for scraping data from different sources are the combinations of Firegoose [52], and the Gaggle Tool Creator [53] or SideCache [54] and SideKick [55], which are used in the biomedical domain. ...
Conference Paper
Full-text available
Over the past years, the visualization of large and complex data sets brought up various Visual Analytics (VA) tools in order to solve domain-specific tasks. These VA tools are typically implemented as individual software components in data-flow-oriented models, meaning that data is transferred from one component to the next. While most VA frameworks rely on a monolithic architecture with features for the integration of specialized analysis methods, we consider a loose coupling of independent applications, where autonomous VA tools are used in predefined analysis sequences. To this end, we provide a characterization of the data exchange process among individual VA tools in the form of a taxonomy. This taxonomy can be used as a checklist to identify characteristics and improve the data flow of one's own multi-tool VA setup. For this purpose, we conducted a systematic investigation of the individual aspects of data exchange that are commonly found across different usage scenarios. We apply our taxonomy to three existing multi-tool frameworks, the open-source library ReVize, the toolchain editor AnyProc, and the visualization and monitoring framework Plant@Hand3D.
... database and served by a Solr/Django-powered web interface [https://www.djangoproject.com]. The analysis and visualization layer allow users to query and explore networks at different levels, create and save workspaces for in-depth analysis of networks, and broadcast data via the Gaggle (24)/Firegoose (25) framework to third-party desktop and web applications (Figure 1). ...
Article
Full-text available
The ease of generating high-throughput data has enabled investigations into organismal complexity at the systems level through the inference of networks of interactions among the various cellular components (genes, RNAs, proteins and metabolites). The wider scientific community, however, currently has limited access to tools for network inference, visualization and analysis because these tasks often require advanced computational knowledge and expensive computing resources. We have designed the network portal (http://networks.systemsbiology.net) to serve as a modular database for the integration of user uploaded and public data, with inference algorithms and tools for the storage, visualization and analysis of biological networks. The portal is fully integrated into the Gaggle framework to seamlessly exchange data with desktop and web applications and to allow the user to create, save and modify workspaces, and it includes social networking capabilities for collaborative projects. While the current release of the database contains networks for 13 prokaryotic organisms from diverse phylogenetic clades (4678 co-regulated gene modules, 3466 regulators and 9291 cis-regulatory motifs), it will be rapidly populated with prokaryotic and eukaryotic organisms as relevant data become available in public repositories and through user input. The modular architecture, simple data formats and open API support community development of the portal.
... Just for illustrative purposes, bioinformatics frameworks such as Firegoose [11] and tools such as Protein Information Crawler [12], DrugBank [13], ChemSpider [14], BioSpider [15], OReFil [16] and MEDPIE [17] acknowledge (or had acknowledged at some point) the use of Web data scraping. Moreover, different examples of Web data scrapping can be found in recent domain-specific applications across Life Sciences, such as in Biotechnology and Bioengineering [18][19][20][21], Genetics [22][23][24], Molecular Biology [25,26], Crystallography [27] and Medicine [28][29][30][31]. ...
Article
Web services are the de facto standard in biomedical data integration. However, there are data integration scenarios that cannot be fully covered by Web services. A number of Web databases and tools do not support Web services, and existing Web services do not cover for all possible user data demands. As a consequence, Web data scraping, one of the oldest techniques for extracting Web contents, is still in position to offer a valid and valuable service to a wide range of bioinformatics applications, ranging from simple extraction robots to online meta-servers. This article reviews existing scraping frameworks and tools, identifying their strengths and limitations in terms of extraction capabilities. The main focus is set on showing how straightforward it is today to set up a data scraping pipeline, with minimal programming effort, and answer a number of practical needs. For exemplification purposes, we introduce a biomedical data extraction scenario where the desired data sources, well-known in clinical microbiology and similar domains, do not offer programmatic interfaces yet. Moreover, we describe the operation of WhichGenes and PathJam, two bioinformatics meta-servers that use scraping as means to cope with gene set enrichment analysis.
Article
Full-text available
Backgrounds Hypoxia‐responsive miRs have been frequently reported in the growth of various malignant tumors. The present study was aimed to investigate whether hypoxia‐responsive miR‐141‐3p was implicated in the pathogenesis of breast cancer via mediating HMGB1/HIF‐1α signaling pathway. Material and methods miRs expression profiling was filtrated by miR microarray assays. Gene and protein expression levels were respectively examined by RT‐qPCR and western blotting. Cell migration and invasion were analyzed using a transwell assay. Cell growth was determined using nude‐mouse transplanted tumor experiments. Results miR‐141‐3p was observed as a hypoxia‐responsive miR in breast cancer. miR‐141‐3p was down‐regulated in breast cancer specimens and could serve as an independent prognostic factor for predicting overall survival in breast cancer patients. In addition, overexpression of miR‐141‐3p could inhibit hypoxia‐induced cell migration and impede human breast cancer MDA‐MB‐231 cell growth in vivo. Mechanistically, hypoxia‐related HMGB1/HIF‐1α signaling pathway might be a possible target of miR‐141‐3p to prevent the development of breast cancer. Conclusions Our finding provides a new mechanism that miR‐141‐3p could prevent hypoxia‐induced breast tumorigenesis by post‐transcriptional repression of HMGB1/HIF‐1α signaling pathway.
Article
IntroductionMicroarray-Based RNA MeasurementFrom Chip-Based Transcriptomics to Sequencing-Based TranscriptomicsMicrorna Profiling in Stem CellsSome Examples of Tools/Software Suites for Data Integration, Network Analysis, and Data VisualizationReferences
Article
Full-text available
The cMonkey integrated biclustering algorithm identifies conditionally co-regulated modules of genes (biclusters). cMonkey integrates various orthogonal pieces of information which support evidence of gene co-regulation, and optimizes biclusters to be supported simultaneously by one or more of these prior constraints. The algorithm served as the cornerstone for constructing the first global, predictive Environmental Gene Regulatory Influence Network (EGRIN) model for a free-living cell, and has now been applied to many more organisms. However, due to its computational inefficiencies, long run-time and complexity of various input data types, cMonkey was not readily usable by the wider community. To address these primary concerns, we have significantly updated the cMonkey algorithm and refactored its implementation, improving its usability and extendibility. These improvements provide a fully functioning and user-friendly platform for building co-regulated gene modules and the tools necessary for their exploration and interpretation. We show, via three separate analyses of data for E. coli, M. tuberculosis and H. sapiens, that the updated algorithm and inclusion of novel scoring functions for new data types (e.g. ChIP-seq and transcription factor over-expression [TFOE]) improve discovery of biologically informative co-regulated modules. The complete cMonkey2 software package, including source code, is available at https://github.com/baliga-lab/cmonkey2. © The Author(s) 2015. Published by Oxford University Press on behalf of Nucleic Acids Research.
Article
Full-text available
Background: Functional annotation of differentially expressed genes is a necessary and critical step in the analysis of microarray data. The distributed nature of biological knowledge frequently requires researchers to navigate through numerous web-accessible databases gathering information one gene at a time. A more judicious approach is to provide query-based access to an integrated database that disseminates biologically rich information across large datasets and displays graphic summaries of functional information. Results: Database for Annotation, Visualization, and Integrated Discovery (DAVID; http://www.david.niaid.nih.gov) addresses this need via four web-based analysis modules: 1) Annotation Tool - rapidly appends descriptive data from several public databases to lists of genes; 2) GoCharts - assigns genes to Gene Ontology functional categories based on user selected classifications and term specificity level; 3) KeggCharts - assigns genes to KEGG metabolic processes and enables users to view genes in the context of biochemical pathway maps; and 4) DomainCharts - groups genes according to PFAM conserved protein domains. Conclusions: Analysis results and graphical displays remain dynamically linked to primary data and external data repositories, thereby furnishing in-depth as well as broad-based data coverage. The functionality provided by DAVID accelerates the analysis of genome-scale datasets by facilitating the transition from data collection to biological meaning.
Article
Full-text available
We report the complete sequence of an extreme halophile, Halobacterium sp. NRC-1, harboring a dynamic 2,571,010-bp genome containing 91 insertion sequences representing 12 families and organized into a large chromosome and 2 related minichromosomes. The Halobacterium NRC-1 genome codes for 2,630 predicted proteins, 36% of which are unrelated to any previously reported. Analysis of the genome sequence shows the presence of pathways for uptake and utilization of amino acids, active sodium-proton antiporter and potassium uptake systems, sophisticated photosensory and signal transduction pathways, and DNA replication, transcription, and translation systems resembling more complex eukaryotic organisms. Whole proteome comparisons show the definite archaeal nature of this halophile with additional similarities to the Gram-positive Bacillus subtilis and other bacteria. The ease of culturing Halobacterium and the availability of methods for its genetic manipulation in the laboratory, including construction of gene knockouts and replacements, indicate this halophile can serve as an excellent model system among the archaea.
Article
Full-text available
Accurate multiple alignments of 86 domains that occur in signaling proteins have been constructed and used to provide a Web-based tool (SMART: simple modular architecture research tool) that allows rapid identification and annotation of signaling domain sequences. The majority of signaling proteins are multidomain in character with a considerable variety of domain combinations known. Comparison with established databases showed that 25% of our domain set could not be deduced from SwissProt and 41% could not be annotated by Pfam. SMART is able to determine the modular architectures of single sequences or genomes; application to the entire yeast genome revealed that at least 6.7% of its genes contain one or more signaling domains, approximately 350 greater than previously annotated. The process of constructing SMART predicted (i) novel domain homologues in unexpected locations such as band 4.1-homologous domains in focal adhesion kinases; (ii) previously unknown domain families, including a citron-homology domain; (iii) putative functions of domain families after identification of additional family members, for example, a ubiquitin-binding role for ubiquitin-associated domains (UBA); (iv) cellular roles for proteins, such predicted DEATH domains in netrin receptors further implicating these molecules in axonal guidance; (v) signaling domains in known disease genes such as SPRY domains in both marenostrin/pyrin and Midline 1; (vi) domains in unexpected phylogenetic contexts such as diacylglycerol kinase homologues in yeast and bacteria; and (vii) likely protein misclassifications exemplified by a predicted pleckstrin homology domain in a Candida albicans protein, previously described as an integrin.
Article
Full-text available
SMART is a simple modular architecture research tool and database that provides domain identification and annotation on the WWW (http://coot.embl-heidelberg.de/SMART). The tool compares query sequences with its databases of domain sequences and multiple alignments whilst concurrently identifying compositionally biased regions such as signal peptide, transmembrane and coiled coil segments. Annotated and unannotated regions of the sequence can be used as queries in searches of sequence databases. The SMART alignment collection represents more than 250 signalling and extracellular domains. Each alignment is curated to assign appropriate domain boundaries and to ensure its quality. In addition, each domain is annotated extensively with respect to cellular localisation, species distribution, functional class, tertiary structure and functionally important residues.
Conference Paper
Accurate multiple alignments of 86 domains that occur in signaling proteins have been constructed and used to provide a Web based tool (SMART: simple modular architecture research tool) that allows rapid identification and annotation of signaling domain sequences. The majority of signaling proteins are multidomain in character with a considerable variety of domain combinations known. Comparison with established databases showed that 25% of our domain set could not be deduced from SwissProt and 41% could not be annotated by Pfam, SMART is able to determine the modular architectures of single sequences or genomes; application to the entire yeast genome revealed that at least 6.7% of its genes contain one or more signaling domains, approximately 350 greater than previously annotated. The process of constructing SMART predicted (i) novel domain homologues in unexpected locations such as band 4.1-homologous domains in focal adhesion kinases; (ii) previously unknown domain families, including a citron-homology domain; (iii) putative functions of domain families after identification of additional family members, for example, a ubiquitin-binding role for ubiquitin-associated domains (UBA); (iv) cellular roles for proteins, such predicted DEATH domains in netrin receptors further implicating these molecules in axonal guidance; (v) signaling domains in known disease genes such as SPRY domains in both marenostrin/pyrin and Midline I; (vi) domains in unexpected phylogenetic contexts such as diacylglycerol kinase homologues in yeast and bacteria; and (vii) likely protein misclassifications exemplified by a predicted pleckstrin homology domain in a Candida albicans protein, previously described as an integrin.
Article
The Semantic Web Initiative envisions a Web wherein information is offered free of presentation, allowing more effective exchange and mixing across web sites and across web pages. But without substantial Semantic Web content, few tools will be written to consume it; without many such tools, there is little appeal to publish Semantic Web content. To break this chicken-and-egg problem, thus enabling more flexible information access, we have created a web browser extension called Piggy Bankthat lets users make use of Semantic Web content within Web content as users browse the Web. Wherever Semantic Web content is not available, Piggy Bank can invoke screenscrapers to restructure information within web pages into Semantic Web format. Through the use of Semantic Web technologies, Piggy Bank provides direct, immediate benefits to users in their use of the existing Web. Thus, the existence of even just a few Semantic Web-enabled sites or a few scrapers already benefits users. Piggy Bank thereby offers an easy, incremental upgrade path to users without requiring a wholesale adoption of the Semantic Web’s vision. To further improve this Semantic Web experience, we have created Semantic Bank, a web server application that lets Piggy Bank users share the Semantic Web information they have collected, enabling collaborative efforts to build sophisticated Semantic Web information repositories through simple, everyday’s use of Piggy Bank.
Article
Although two-color fluorescent DNA microarrays are now standard equipment in many molecular biology laboratories, methods for identifying differentially expressed genes in microarray data are still evolving. Here, we report a refined test for differentially expressed genes which does not rely on gene expression ratios but directly compares a series of repeated measurements of the two dye intensities for each gene. This test uses a statistical model to describe multiplicative and additive errors influencing an array experiment, where model parameters are estimated from observed intensities for all genes using the method of maximum likelihood. A generalized likelihood ratio test is performed for each gene to determine whether, under the model, these intensities are significantly different. We use this method to identify significant differences in gene expression among yeast cells growing in galactose-stimulating versus non-stimulating conditions and compare our results with current approaches for identifying differentially-expressed genes. The effect of sample size on parameter optimization is also explored, as is the use of the error model to compare the within- and between-slide intensity variation intrinsic to an array experiment.