ArticlePDF Available

The Firegoose: two-way integration of diverse data from different bioinformatics web resources with desktop applications

February 2007
BMC Bioinformatics 8(1):456

February 2007
8(1):456

DOI:10.1186/1471-2105-8-456

Source
PubMed

License
CC BY 2.0

Authors:

J Christopher Bare

Sage Bionetworks

Paul Shannon

Institute for Systems Biology

Amy K Schmid

Duke University

Nitin Baliga

Institute for Systems Biology

Information resources on the World Wide Web play an indispensable role in modern biology. But integrating data from multiple sources is often encumbered by the need to reformat data files, convert between naming systems, or perform ongoing maintenance of local copies of public databases. Opportunities for new ways of combining and re-using data are arising as a result of the increasing use of web protocols to transmit structured data. The Firegoose, an extension to the Mozilla Firefox web browser, enables data transfer between web sites and desktop tools. As a component of the Gaggle integration framework, Firegoose can also exchange data with Cytoscape, the R statistical package, Multiexperiment Viewer (MeV), and several other popular desktop software tools. Firegoose adds the capability to easily use local data to query KEGG, EMBL STRING, DAVID, and other widely-used bioinformatics web sites. Query results from these web sites can be transferred to desktop tools for further analysis with a few clicks. Firegoose acquires data from the web by screen scraping, microformats, embedded XML, or web services. We define a microformat, which allows structured information compatible with the Gaggle to be embedded in HTML documents. We demonstrate the capabilities of this software by performing an analysis of the genes activated in the microbe Halobacterium salinarum NRC-1 in response to anaerobic environments. Starting with microarray data, we explore functions of differentially expressed genes by combining data from several public web resources and construct an integrated view of the cellular processes involved. The Firegoose incorporates Mozilla Firefox into the Gaggle environment and enables interactive sharing of data between diverse web resources and desktop software tools without maintaining local copies. Additional web sites can be incorporated easily into the framework using the scripting platform of the Firefox browser. Performing data integration in the browser allows the excellent search and navigation capabilities of the browser to be used in combination with powerful desktop tools.

Communication in the Gaggle. Software and databases shown as red dots send and receive broadcasts via Java RMI. The blue nodes are web resources connected to the Gaggle through the Firegoose and accessed using HTTP with other protocols and formats such as HTML, XML, and SOAP layered over top.

…

The Firegoose toolbar for Mozilla Firefox. A broadcast is sent by three steps: select data to be broadcast (i), select the target of the broadcast (ii), and click 'Broadcast' to send (iii). In this example, a list of genes has been broadcast to KEGG. The result shows that the genes are part of the oxidative phosphorylation pathway.

…

Example of the Gaggle Microformat. Microformats embed structured data in web pages by using CSS tags as markup.

…

Interface of a handler script. The handler interface provides an extension point for adding support for new websites, web services, and protocols.

…

Using Firegoose to investigate transcription response of H. salinarum to anoxic conditions. A list of genes with similar expression profiles is found by microarray analysis. (1) We broadcast these genes to KEGG to query for known biochemical pathways, for example, viewing the Arginine and proline metabolism pathway. (2) We broadcast the genes to EMBL STRING, a protein interaction database, where we can (3) navigate to functionally related genes. (4) In order to manipulate the network, we broadcast it to Cytoscape. (5) A local database provides additional information. (6) DAVID performs functional clustering. (7) For one cluster containing signal transduction genes, we use String's links to protein domains.

…

Figures - available via license: Creative Commons Attribution 2.0 Generic

Content may be subject to copyright.

Content uploaded by Paul Shannon

Content may be subject to copyright.

Available via license: CC BY 2.0

Content may be subject to copyright.

BioMed Central

Page 1 of 12

(page number not for citation purposes)

BMC Bioinformatics

Open Access

Software

The Firegoose: two-way integration of diverse data from different

bioinformatics web resources with desktop applications

J Christopher Bare, Paul T Shannon, Amy K Schmid and Nitin S Baliga*

Address: Institute for Systems Biology, 1441 N 34th Street, Seattle, WA 98103, USA

Email: J Christopher Bare - cbare@systemsbiology.org; Paul T Shannon - pshannon@systemsbiology.org;

Amy K Schmid - aschmid@systemsbiology.org; Nitin S Baliga* - nbaliga@systemsbiology.org

* Corresponding author

Abstract

Background: Information resources on the World Wide Web play an indispensable role in

modern biology. But integrating data from multiple sources is often encumbered by the need to

reformat data files, convert between naming systems, or perform ongoing maintenance of local

copies of public databases. Opportunities for new ways of combining and re-using data are arising

as a result of the increasing use of web protocols to transmit structured data.

Results: The Firegoose, an extension to the Mozilla Firefox web browser, enables data transfer

between web sites and desktop tools. As a component of the Gaggle integration framework,

Firegoose can also exchange data with Cytoscape, the R statistical package, Multiexperiment

Viewer (MeV), and several other popular desktop software tools. Firegoose adds the capability to

easily use local data to query KEGG, EMBL STRING, DAVID, and other widely-used bioinformatics

web sites. Query results from these web sites can be transferred to desktop tools for further

analysis with a few clicks.

Firegoose acquires data from the web by screen scraping, microformats, embedded XML, or web

services. We define a microformat, which allows structured information compatible with the

Gaggle to be embedded in HTML documents.

We demonstrate the capabilities of this software by performing an analysis of the genes activated

in the microbe Halobacterium salinarum NRC-1 in response to anaerobic environments. Starting with

microarray data, we explore functions of differentially expressed genes by combining data from

several public web resources and construct an integrated view of the cellular processes involved.

Conclusion: The Firegoose incorporates Mozilla Firefox into the Gaggle environment and enables

interactive sharing of data between diverse web resources and desktop software tools without

maintaining local copies. Additional web sites can be incorporated easily into the framework using

the scripting platform of the Firefox browser. Performing data integration in the browser allows

the excellent search and navigation capabilities of the browser to be used in combination with

powerful desktop tools.

Published: 19 November 2007

BMC Bioinformatics 2007, 8:456 doi:10.1186/1471-2105-8-456

Received: 4 August 2007

Accepted: 19 November 2007

This article is available from: http://www.biomedcentral.com/1471-2105/8/456

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0

which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

BMC Bioinformatics 2007, 8:456 http://www.biomedcentral.com/1471-2105/8/456

Page 2 of 12

(page number not for citation purposes)

Background

Information resources on the World Wide Web play an

indispensable role in modern biology. Using a web

browser, the biologist can easily access a wealth of infor-

mation such as sequences, biochemical pathways, protein

interactions, functional domains, annotations, and gene

expression data. Yet, integrating data from diverse sources

remains challenging. Biologists wishing to analyze their

own experimental data in combination with publicly

available data face the cumbersome tasks of converting

file formats, reconciling incompatible schemas, and map-

ping between inconsistent naming systems. Several strate-

gies for data integration have been developed in the

context of biology [1]. Data warehousing and web services

are two general approaches. Semantic mapping between

different data sources has been accomplished by describ-

ing biological entities in terms of shared object models [2]

or curated ontologies [3]. Our team's previous work in

this area resulted in the Gaggle, a software environment

that integrates databases and software tools based on a

few simple data types and the principle of semantic flexi-

bility [4]. Previously, the Gaggle supported integration

with web resources through a special purpose browser of

limited capabilities. An improved solution was needed

due to the ubiquitous role of the browser and the increas-

ing relevance of the web to data integration.

The web is rapidly becoming a channel for structured data

as evidenced by the increasing adoption of web services.

Rich internet applications, microformats [5], and the

Semantic Web [6] are pushing the browser into a central

role as an information broker [7] working with data in

human and machine readable form side-by-side. Several

tools outside the domain of biology have augmented the

browser with these types of capabilities, including Grease-

monkey [8], Piggy Bank [9], and Operator [10]. These

technologies offer the ability to perform ad-hoc manipu-

lation of data from multiple sources on the web using

familiar browser based interfaces.

We developed the Firegoose to bring a full-featured

browser into the Gaggle environment and to allow easy

transfer of data between the web and the desktop. The

Firegoose is a toolbar for the Mozilla Firefox [11] browser

that makes use of a diverse array of web communication

protocols to seamlessly query and retrieve data from cus-

tom databases as well as popular bioinformatics resources

including KEGG [12], EMBL STRING [13,14], and DAVID

[15]. As a member of the Gaggle software environment,

the Firegoose can also exchange data with a growing col-

lection of desktop tools. Bringing the data integration

techniques of the Gaggle into the browser, the Firegoose

facilitates exploratory analysis of systems biology data

using web resources and desktop software tools.

Implementation

The Firegoose is implemented as an add-on extension to

Mozilla Firefox, a popular opensource web browser. Its

user interface is a toolbar defined in XUL, an XML dialect

for describing user interface components, and its actions

are scripted in Javascript. The Firegoose communicates

with the Gaggle via the Java RMI protocol and supports

several modes of information exchange over the web.

The Gaggle approach

While it can be useful outside of the Gaggle, the Firegoose

is a component of the Gaggle software environment. The

goal of the Gaggle is to enable interactive exploration of

interrelated biological data by making data transfer

between analysis tools, databases, and now web

resources, quick and easy. Integration is achieved by sup-

porting the interchange of five simple data types, which

cover a wide variety of use-cases within the domain of sys-

tems biology.

Name list: a list of identifiers.

Associative array: a set of key/value pairs.

Data matrix: a 2 dimensional grid of floating point num-

bers, with labels for each row and column.

Network: a graph in which attributes may be assigned to

nodes and edges.

Cluster: a set of identifiers of interest under a set of condi-

tions.

The central feature of the Gaggle is the broadcast. Applica-

tions become part of the Gaggle by implementing an

interface that allows them to accept broadcasts. A broad-

cast consists of sending a message holding one of the five

data types to a program called the Gaggle Boss, which

then relays the message to one or more receiving applica-

tions or targets.

Using the simple Gaggle data types as an intermediary has

the advantage that applications can share data without

sharing common object models. Each program is free to

handle broadcasts in a way that makes sense in its own

context. This is semantic flexibility, the first of the Gaggle's

guiding principles. The second is to keep the set of data

types as small as possible minimizing the programming

effort required to integrate a new application into the Gag-

gle environment.

The Gaggle environment currently includes Cytoscape

[16] for viewing networks, the R statistical package [17],

Multi-experiment Viewer (MeV) [18] for microarray anal-

ysis, the spreadsheet-like Data Matrix Viewer (DMV), and

BMC Bioinformatics 2007, 8:456 http://www.biomedcentral.com/1471-2105/8/456

Page 3 of 12

(page number not for citation purposes)

several other programs. The term "gaggled" is used to indi-

cate that a software application or database has been con-

figured to exchange messages with the Boss; for example,

Cytoscape has been gaggled through the Cygoose plug-in.

The Firegoose uses the power of the browser to extend the

reach of the Gaggle to target web resources based on

widely supported protocols. The Firegoose communicates

with the Gaggle Boss via Java RMI. To communicate with

web resources, the Firegoose supports standard web pro-

tocols and data encoding schemes such as HTTP, HTML,

XML, and SOAP. A schematic view (Fig. 1) shows the flow

of messages over various protocols between desktop

applications, the Gaggle Boss, the Firegoose, and several

websites.

Exchanging data with web resources

When data is broadcast to the Firegoose, it can be rebroad-

cast, under user control, to any supported website. A

broadcast to a web resource is typically used to build a

query against an underlying database. For example, a set

of nodes representing genes may be selected in a Cyto-

scape network. Broadcasting this data from Cytoscape to

the Firegoose causes a list of gene identifiers to appear in

the Firegoose's broadcast data menu. To broadcast to the

KEGG Pathway database, the user would select the gene

list and then select the target "KEGG Pathway". Clicking

"Broadcast" submits a request to KEGG for the pathways

in which those genes participate (Fig. 2).

More importantly, the Firegoose also brings data back

from the web in a usable form. Traditionally, a web server

responds to a request by sending an HTML document,

which the browser renders for display. The HTML format

in which web pages are encoded describes presentation

but is poorly suited for representing the underlying struc-

ture, or metadata, necessary for integration and re-use by

automated processes. In response to this shortcoming, a

variety of encodings have been developed including XML,

SOAP, and microformats. Firegoose supports several

means of data interchange, described in detail below, in

order to interoperate with a wide variety of data providers.

Screen Scraping

HTML does not represent structured data well. However,

by making some assumptions about how the data are for-

matted, it is possible to reconstruct the structure that is

lost in HTML. This process, known as screen scraping, has

a long history on the web.

The Firegoose scrapes pages generated by the KEGG Path-

ways database to acquire a list of genes present in a given

pathway. The pathway diagrams generated by KEGG con-

tain links from each gene to its sequence and annotations.

The links take the following form: href="/dbget-bin/

www_bget?eco+b0728+b0729".

This allows a simple script to glean the information that

the two Escherichia coli genes b0728 and b0729 play a role

Communication in the GaggleFigure 1

Communication in the Gaggle. Software and databases shown as red dots send and receive broadcasts via Java RMI. The

blue nodes are web resources connected to the Gaggle through the Firegoose and accessed using HTTP with other protocols

and formats such as HTML, XML, and SOAP layered over top.

VISUALIZATION

Cytoscape: Interactions

DMV: Matrices

DATABASES

Microarray

Proteomics

ChIP-chip

Annotations

Protein CoIP

ANALYSIS

R statistical Environment

MeV

BOSS

FIREGOOSE

KEGG

(metabolic pathways)

NCBI

(genes, proteins)

DAVID

(Functional annotations)

EMBL STRING

(Functional Associations)

SBEAMS

(Halobacterium

functional annotations)

HTTP, SOAP

HTML, XML, SOAP

HTT

HTTP

HTML, XML

HTTP

HTML with

Embedded XML

HTML with Embedded

Microformat

Gaggle

Microformat

RMIRMI

RMI RMI

BMC Bioinformatics 2007, 8:456 http://www.biomedcentral.com/1471-2105/8/456

Page 4 of 12

(page number not for citation purposes)

in the current pathway, in this case the citrate cycle. By

scanning all such links, a list can be compiled of all genes

that are members of the pathway in a selected organism,

which fits nicely into the name-list data type of the Gag-

gle. A biologist may, for example, broadcast this list to a

visualization tool for microarray data and check for evi-

dence of differential expression under specific conditions

of genes whose products play a role in the same metabolic

pathway.

While it can be effective, screen scraping is inelegant and

prone to breakage. It requires code to be written that is

specific for an individual site and will likely require main-

tenance whenever the site makes formatting changes. For

the Firegoose, screen scraping is one means of acquiring

data, but is not the preferred option. One simple solution

to the lack of structure inherent in web pages is found in

microformats.

Microformats and Embedded XML

Microformats embed machine-readable data within valid

HTML. The structure missing in HTML is supplied in CSS

class attributes, giving the parser the necessary clues to

extract data systematically from the page. The same data

elements may do double duty accommodating both on-

screen display and machine-readability. A browser exten-

sion like the Firegoose is ideally positioned to augment

web browsing with new capabilities enabled by data

embedded in the web page.

We have defined a microformat to represent the Gaggle

data types, specifications for which can be found in the

The Firegoose toolbar for Mozilla FirefoxFigure 2

The Firegoose toolbar for Mozilla Firefox. A broadcast is sent by three steps: select data to be broadcast (i), select the

target of the broadcast (ii), and click 'Broadcast' to send (iii). In this example, a list of genes has been broadcast to KEGG. The

result shows that the genes are part of the oxidative phosphorylation pathway.

Administrative

functions

Select data to

be broadcast

Select target for

next broadcast

Send broadcast

The Firegoose toolbar

iii

Desktop target geese

Web site targets

BMC Bioinformatics 2007, 8:456 http://www.biomedcentral.com/1471-2105/8/456

Page 5 of 12

(page number not for citation purposes)

Gaggle microformat reference [19]. This format allows

any of the Gaggle data types to be embedded directly into

a web page, either displayed in the page or hidden from

the user. The toolbar detects embedded data and makes

that data available for broadcast to other web sites or

desktop applications. An example of the Gaggle microfor-

mat is shown (Fig. 3) which encodes a list of three genes

involved in signal transduction.

As with microformats, XML can be embedded directly in

web pages. XML has the advantages of a well-defined syn-

tax and good support by parsers and tools. Adoption of

this technique has been hindered by the fact that web

pages containing embedded XML fail HTML and XHTML

validation. This may not be an obstacle in practice since

browsers do not perform validation and simply ignore

unrecognized tags. We implemented an XML format for

Gaggle name lists to communicate lists of genes from a

web view of a database of gene annotations for H. sali-

narum NRC-1 [20].

The benefits of these two approaches are the same. Their

value is that they allow data to be captured without requir-

ing site-specific screen scraping code to be written and

maintained. By adding a small amount of standard

markup to their web pages, website providers make their

site accessible to any tools that recognize that format. In

the case of the Gaggle microformat, websites can be gag-

gled without additional code in the Firegoose and a very

modest effort on the server side.

Embedding data directly in HTML works well for moder-

ate amounts of data. Where the size of the data is prohib-

itive, a microformat embedded in the page could provide

the parameters needed by tools like Firegoose to access

related data through external channels such as web serv-

ices or direct downloads.

Web Services

A web service is a programmatic interface allowing appli-

cations running on heterogeneous platforms to interoper-

ate over a network. Typically, the messages passed back

and forth are XML. The Firegoose can access web services

from within the browser forging a direct link from the

presentation of a web site to its underlying data.

For example, KEGG offers extensive web services. The list

of genes in a pathway can be requested by calling the

get_genes_by_pathway SOAP method, which returns the

same list of genes acquired above by screen scraping, but

in an easily processed XML format. A parameter specifies

the pathway and organism in which we are interested. For

example, the string "path:eco00020" denotes the tricarbo-

xylic acid (TCA) cycle pathway in E. coli. In order to find

these parameters, a nominal amount of screen scraping

and case-specific scripting is still necessary.

Accessing web services from within the browser is not typ-

ical, but offers some compelling advantages. The familiar

user-friendly interface of a web application can be used to

navigate through a database and then the web service can

be invoked to acquire the desired records in structured

form amenable to further computation. The browser sup-

ports an interactive style of usage without the need to

build a customized client to the web service.

Currently, there is no standard way of linking data dis-

played in a web page with a query to a web service that

will return the same data in a structured form. The Fire-

goose uses screen scraping to fill in this gap. A standard-

ized microformat for this purpose could facilitate this

kind of interactive access to structured data.

Website Handlers

A website may provide data via a web service, an embed-

ded microformat, or other channels. If not, screen scrap-

Example of the Gaggle MicroformatFigure 3

Example of the Gaggle Microformat. Microformats embed structured data in web pages by using CSS tags as markup.

<html>

…

<p>name= <span class="gaggle-name">Signal transduction genes</span></p>

<p>species= <span class="gaggle-species">Halobacterium sp. NRC-1</span></p>

<p>(optional)size= <span class="gaggle-size">3</span></p>

<ol>

</ol>

</div>

…

</html>

BMC Bioinformatics 2007, 8:456 http://www.biomedcentral.com/1471-2105/8/456

Page 6 of 12

(page number not for citation purposes)

ing may be required. For the sake of modularity, we

encapsulate the details of interacting with each individual

site or specific format into a separate component called a

"handler".

For each target website or format, a handler is written and

packaged in a separate Javascript source file that contains

code specific to that target. The handler implements a

common interface (Fig. 4) and is responsible for recogniz-

ing web pages with which it can interact and transferring

data to and from its target. Recognition can be based

either on the URL or the contents of the page, which

means we can write a handler for a particular web site or

a handler that understands a data format embedded in

pages from several sources.

When a document is loaded in the browser, the toolbar

calls the recognize method for each handler until a handler

recognizes the document. Then the getPageData method

is called for that handler. The handler then inspects the

document and constructs a list of GaggleData objects that

can be acquired from the page. This, in turn, is used to

populate the broadcast menu.

The GaggleData object represents data from the website in

the form of one of the Gaggle data types. Support for lazy

instantiation and asynchronous access to web services can

be neatly hidden behind its simple interface. Typically,

getPageData performs only enough work to generate

descriptive information for each data object. The actual

parsing or issuing a request to a web service can be

deferred until the user requests a broadcast. This mini-

mizes processing overhead and prevents unnecessary net-

work traffic.

Any of the Gaggle data types may be returned from the

getPageData method, but so far we have only imple-

mented broadcasting lists of names to web resources. Typ-

ically, a broadcast to a website is transformed into a query,

which retrieves information relevant to a list of genes, pro-

teins, or other identifiers. In principle, methods could eas-

ily be defined for each of the other data types:

handleNetwork, handleMatrix, etc. In keeping with the

nature of dynamic languages like Javascript, all of the

"handle..." methods are optional.

Results and discussion

Case Study: synthesis of a model of anaerobic physiology

in H. salinarum NRC-1 using Firegoose

To demonstrate the effectiveness of the Firegoose in a typ-

ical systems biology type of investigation, we explore

expression in Halobacterium salinarum NRC-1 in response

to fluctuating oxygen concentration [21]. Briefly, H. sali-

narum NRC-1 is a halophilic archaeon with a small

genome of 2.6 Mb that encodes ~2,400 protein-coding

genes [22]. This organism is most prolific in aerobic con-

ditions but switches facultatively to other modes of energy

production in anoxic environments. We will take as our

starting point a list of genes that were found in microarray

experiments to be actively expressed under low oxygen

conditions. To understand the physiological changes

associated with the anoxic state as completely as possible,

we need to understand the functions of individual pro-

teins, metabolic pathways encoded by genes of known

function, and functional associations among genes

through evolutionary and literature analysis. Not all of

this information is contained within one resource; for

instance whereas KEGG specializes in information regard-

ing metabolic pathways, STRING calculates functional

associations among proteins through comparative analy-

sis of sequence, literature and publicly available experi-

mental data, and DAVID classifies proteins into enriched

functional clusters. Finally, using our own expertise we

have curated function assignments to many proteins in H.

salinarum NRC-1 using a combination of sequence and

structure-based approaches. In this example, we will dem-

onstrate how the Firegoose can significantly aid in func-

tionally characterizing a set of genes by enabling seamless

exploration and integration of several web-based tools

from separate providers including KEGG, EMBL STRING,

DAVID, and an in-house annotation database for H. sali-

narum NRC-1. First, we will briefly explain how these

genes were identified.

Halobacterial cells were subjected to differing levels of

oxygen and samples were collected at varying oxygen con-

Interface of a handler scriptFigure 4

Interface of a handler script. The handler interface pro-

vides an extension point for adding support for new web-

sites, web services, and protocols.

/**

* check the given doc to see if we can parse it.

* Return true if so, or false otherwise.

handler.recognize = function(doc) {...};

/**

* open the web site in a new browser tab.

handler.show = function() {...};

/**

* return a list of GaggleData objects representing the

* data related to the page as one of the Gaggle data

* types.

handler.getPageData = function(doc) {...};

/**

* takes a species and a list of names and submits them

* for processing by the website. List can be either a

* Java Array or a Javascript Array.

handler.handleNameList = function(species, names) {...};

BMC Bioinformatics 2007, 8:456 http://www.biomedcentral.com/1471-2105/8/456

Page 7 of 12

(page number not for citation purposes)

centrations and time. Total RNA from these samples was

analyzed using microarray analysis [23]. The DMV (Data

Matrix Viewer) and the R statistical package were used to

normalize and select genes whose expression was signifi-

cantly changed in response to the perturbations in oxy-

gen. MeV, a microarray data analysis tool, was then used

to cluster the differentially expressed genes by their

expression profiles. Two sets of genes emerged, one acti-

vated in the presence of oxygen and another activated in

absence of oxygen. Additional details about this part of

the analysis are included in the tutorial web page [24]. We

will concentrate on the anaerobically induced genes.

The remainder of the case study traces our inquiry into the

functional roles these genes may be playing. Using Fire-

goose, we consult multiple remote data sources and trans-

fer data between them and local desktop tools. For

demonstration purposes, the lists of genes in both aerobic

and anaerobic clusters were encoded in the Gaggle micro-

format and embedded in the tutorial [24], which docu-

ments the steps of the analysis. The reader can install the

Firegoose, browse to the tutorial web page, and easily

reproduce the analysis that follows and is encouraged to

do so. The analysis is broken into numbered steps to cor-

respond with data transfers labeled in red in the diagram

(Fig. 5).

Step 1

We first consult KEGG, a database of curated biochemical

pathways. Starting at the tutorial web page, we broadcast

the embedded anaerobic gene list to the KEGG Pathway

target. This has the same result as cutting and pasting into

the KEGG interface; KEGG performs a query for biochem-

ical pathways in which these genes participate. Thirty-one

of the 222 genes that were queried matched KEGG path-

ways including amino acid metabolism, ABC transporters,

and active potassium transport. Some of these pathway

matches are consistent with known physiological proper-

ties of H. salinarum NRC-1 such as to facultatively derive

energy by fermenting arginine under anaerobic condi-

tions. Further, this analysis also suggests that uptake sys-

tems for several alternate nutritional sources (e.g.

glycerol) may also be utilized under anaerobic conditions.

Interestingly, KEGG also finds two transcription factors in

this anaerobic set, tfbA and tfbE, providing clues to possi-

ble regulatory mechanisms.

Although the information gathered from this first line of

analysis has yielded considerable insight into the physio-

logical adjustment to an anoxic environment, 191 genes

did not match any enzymes within the KEGG catalog of

pathways. At this point, we could narrow our investiga-

tion to genes in a particular pathway or to the unmatched

genes, which Firegoose can capture and broadcast

onward. But, for the next step, we'll stick with the full set

of anaerobically induced genes.

Step 2

To continue our analysis, we use EMBL STRING, a power-

ful tool focused on protein interactions computed from

supporting evidence such as sequence homology, journal

abstracts, and protein domains. Specifically, we are inter-

ested in finding functional associations among genes

already classified into metabolic pathways by KEGG and

genes of unknown function. We broadcast our list of

anaerobically induced genes from the tutorial page to

STRING, which responds by displaying a network of

nodes (proteins) connected by colored edges representing

functional relationships.

We find strongly interconnected components of the net-

work that are associated with specialized metabolism and

do not have corresponding pathways in KEGG. For

instance, a cluster of six genes appears in the network con-

nected by edges indicating chromosomal proximity, co-

occurrence across genomes, and text mining results.

STRING gives annotations and protein domains that

show these genes to be involved in the use of dimethyl

sulfoxide (DMSO) as an alternative electron acceptor [25].

A second gene cluster, connected through chromosomal

proximity associations, encodes gas vesicles used by the

organism to vertically orient itself in the water column

[26].

Step 3

Another interesting grouping links four genes of unknown

function VNG1183H, VNG1184G, VNG1185G, and

VNG1187G. VNG1187G contains two multicopper oxi-

dase domains. VNG1185G is annotated as a Coenzyme

PQQ synthesis protein and VNG1184G is annotated as a

heme biosynthesis protein. Recentering the network on

VNG1187G and expanding twice (to reveal second neigh-

bors in the network) shows functional relationships to

several other proteins including more heme biosynthesis

proteins. Because not all of these genes were differentially

regulated in oxygen the initial seed generated from expres-

sion analysis may have captured some incomplete path-

ways including this particular one. However, by

combining, filtering and expanding that seed using a vari-

ety of resources we can navigate towards a comprehensive

picture.

Step 4

STRING makes its networks available in an XML format,

which the Firegoose can parse and broadcast to other Gag-

gle tools. Broadcasting the network to Cytoscape allows

the user to work with the network more interactively. For

example, subsets of nodes can be selected and broadcast,

or highlighted in response to broadcasts from other tools.

BMC Bioinformatics 2007, 8:456 http://www.biomedcentral.com/1471-2105/8/456

Page 8 of 12

(page number not for citation purposes)

We select the four-gene cluster noted above and broadcast

to the Firegoose for further investigation.

Step 5

The H. salinarum NRC-1 proteome has been recently re-

annotated using a combination of sequence and structure-

based approaches. These functional annotations are

curated and made publicly available on the web [27]

Using Firegoose to investigate transcription response of H. salinarum to anoxic conditionsFigure 5

Using Firegoose to investigate transcription response of H. salinarum to anoxic conditions. A list of genes with

similar expression profiles is found by microarray analysis. (1) We broadcast these genes to KEGG to query for known bio-

chemical pathways, for example, viewing the Arginine and proline metabolism pathway. (2) We broadcast the genes to EMBL

STRING, a protein interaction database, where we can (3) navigate to functionally related genes. (4) In order to manipulate the

network, we broadcast it to Cytoscape. (5) A local database provides additional information. (6) DAVID performs functional

clustering. (7) For one cluster containing signal transduction genes, we use String's links to protein domains.

Differentially expressed

genes identified in

microarray experiments

Broadcast STRING

interaction network

to Cytoscape.

View signal transduction

protein domains.

DAVID:

functional annotations

KEGG

Select groups of

genes in Cytoscape

and search organism

specific database

STRING

Halobacterium

Annotations

Database (SBEAMS)

Cytoscape

Recenter network on

VNG1187G and expand

to second neighbors.

BMC Bioinformatics 2007, 8:456 http://www.biomedcentral.com/1471-2105/8/456

Page 9 of 12

(page number not for citation purposes)

through SBEAMS[20], an open source data management

system. In this database, VNG1187G is annotated as a

putative copper containing nitrite reductase. Drilling

down using the links provided within the SBEAMS data-

base we see that the supporting evidence is a match to

entry 1KBV (the membrane protein AniA, a copper-con-

taining nitrite reductase from Neisseria Gonorrhoeae; e-

value < 10

-166

) in the Protein Data Bank. This function in

conjunction with the functional associations with puta-

tive heme biosynthesis proteins and a putative coenzyme

PQQ synthesis protein form the basis of a hypothesis that

H. salinarum NRC-1 may possess a nitrite reducing path-

way.

Step 6

DAVID is a functional annotation tool that integrates

many of the same primary sources as KEGG and STRING

but displays its results in tabular format rather than as a

network.

DAVID differs from the other resources we have used in

that it doesn't work well with the VNG naming system

used to identify genes in H. salinarum NRC-1. We use a

Gaggle integrated translator utility to translate gene iden-

tifiers between the VNG nomenclature and GI accession

numbers suitable to be broadcast to DAVID. The issue of

multiple naming schemes is a major bottleneck for data

integration in biology. Passing broadcasts through a sim-

ple synonym-mapping translator within the Gaggle helps

overcome this hurdle.

DAVID also clusters genes with related annotations, con-

veniently summarizing the kinds of cellular processes for

which a set of genes is enriched. Our genes of interest

divide into 17 clusters — several of which were consistent

with the results from previous steps. For example, seven

genes, including one member of the DMSO cluster, have

products annotated as being involved in electron trans-

port.

Step 7

DAVID produces another cluster containing signal trans-

duction proteins. Back in STRING, we can click on these

proteins to view functional domains provided by the

SMART database [28,29]. VNG0355G (Htr14) contains

sensory domains associated with chemotaxis. Both

VNG0716G (AfsQ2) and VNG1175G (PhoR) contain

domains implicated in signal transduction and light sen-

sitivity. It is notable that this is biologically meaningful

because light and oxygen physiology in H. salinarum NRC-

1 are tightly coupled with one another and also with phys-

ical relocation (taxis).

Summary of results

In this example we have combined four web-based

resources with several desktop applications to build a uni-

fied understanding of the physiology of H. salinarum

NRC-1 under anaerobic conditions. The depletion of oxy-

gen seems to activate certain aspects of amino acid metab-

olism, and alternate energy transduction pathways such as

phototrophy, arginine fermentation and DMSO respira-

tion. The simultaneous induction of gas vesicle synthesis

is consistent with the anaerobic physiological behavior of

H. salinarum NRC-1 to move towards the surface in search

of alternate energy sources including light. The data also

suggest that H. salinarum NRC-1 may have some specific

nutritional requirements in this anoxic environment that

cause the induction of an array of membrane transport

systems and possibly previously uncharacterized func-

tions such as dissimilatory nitrite reduction. In addition,

putative components of the signaling and regulatory

mechanisms that may mediate the transition to this met-

abolic state were also detected.

It is important to recognize that the information

resources, while overlapping to some degree, provide

complementary perspectives. KEGG provides information

regarding biochemical pathways but does not draw asso-

ciations between these pathways and genes of unknown

function. This information can be obtained from STRING,

which on its own is not cognizant of the metabolic path-

ways. Likewise, DAVID includes information on func-

tional domains within proteins and integrates many of

the same primary sources as KEGG and STRING. But,

DAVID presents the information differently with its

unique clustering feature to provide a statistical evalua-

tion of the enrichment of particular functions among the

queried genes. Finally, our local database provides a

source of data curated manually by experts through years

of careful literature surveys and experimentation. Integra-

tion of these web-based resources with desktop applica-

tions through the Gaggle and the Firegoose enables the

type of analysis that is necessary to understand the com-

plex dynamic regulation of cellular responses (Fig. 6).

Complete data from each of our four sources can be found

in the supplementary table (Additional file 1).

Comparison with workflow software

Firegoose, along with the Gaggle framework, shares sev-

eral features in common with workflow tools such as Tav-

erna[30,31]. They share the strategy of composing distinct

programs and data sources to build larger systems with

rich capabilities. To this end, both Firegoose and work-

flow tools benefit greatly from the availability of program-

matic access to structured data and computational services

over common web protocols.

BMC Bioinformatics 2007, 8:456 http://www.biomedcentral.com/1471-2105/8/456

Page 10 of 12

(page number not for citation purposes)

Workflow tools enable a user to automate a well-defined

and repeatable process, often using web services or mes-

sage queues for interprocess communication. But, before

a well-defined analysis process exists, there is a need for

exploratory analysis, which is necessarily ad-hoc. Gaggle

and the Firegoose seek to enable this kind of interactive

exploration by exploiting the flexibility of web-based

tools in combination with desktop analysis and visualiza-

tion tools. Scripting an interaction with a web site is, of

course, possible using a browser extension, but the

emphasis in the Firegoose is on automating the exchange

of data, leaving the direction of the analysis up to the user.

The difference in emphasis does not rule out using work-

flow engines and Firegoose together. Invoking workflows

on a remote server from within Firegoose is one poten-

tially valuable example.

Integrated visualization of H. salinarum response to anoxic conditionsFigure 6

Integrated visualization of H. salinarum response to anoxic conditions. The transcriptional response to anoxia was

characterized using several data sources. Edges represent several types of evidence for functional association provided by

STRING. Yellow filled nodes indicate genes classified by KEGG. Blue outline nodes indicate genes classified by DAVID. Other

nodes were characterized by an in-house annotation database or other sources, including PFAM, BLAST, and PDB. 102 genes

of unknown function were omitted.

tfbE

6390H

6384H

6365H

6366H

6364H

ugpB

ugpC

gpdB

cysT2

modA

ugpE

hutG

1213C

hutI

aspC2

argG

trpF

hal

thrB

metA

arcB

arcC

mamA

dmsR

dmsC

dmsE

dmsA

dmsD

dmsB

6143H

trsE

6145H

gvpK2

gvpG2

gvpJ2

gvpH2

gvpI2

cdc48d

6203H

kdpA

kdpC

kdpB

cat3

1663C

panF

purU

hisH2

cna

purB

nirJ

pqqE

1183H

nirK

1784C

tfbA

hat2

mutT

6403H

6404H

2678H

1681C

truD

cofH

arsM

0573C

trp3

yjbG

bchP

brp

0875C

phoT1

manC

gvpN2

hepA

aroD

hcpB

1906H

bop

2214G

2002H

trkA2

rhl

1626C

orc2

uppS

afsQ2

0622H

pstC1

6430C

lta

yhdG

1457C

gvpA1

arsA2

phrH

nhaC2

2115H 6157H

6349C

gdb

6258C

mmdA

htr14

ycdH

0654C

0818C

phoR

serA3

minD2

crcB1

repI

0750C

crtB1

gufA

0883H

gap

araL

0409C

2458C

DMSO respiration

Gas vesicle proteins

Phototrophy

ABC transporters

Transporters

Amino acid metabolism

Transcription factors

Signal transduction

Electron transport

putative

chitinase

cobalamin

binding

Cell division /

nucleic acid metabolism

Putative nitrite reduction pathway

archaeal DNA

polymerase

Neighborhood

Gene Fusion

Co-occurrence

Coexpression

Databases

Textmining

BMC Bioinformatics 2007, 8:456 http://www.biomedcentral.com/1471-2105/8/456

Page 11 of 12

(page number not for citation purposes)

Future Work

Additional functionality could be added to the Firegoose

in a number of ways, most easily through the addition of

more handlers for biological websites. We also considered

that users might want to develop their own handler

scripts. We prototyped code for dynamically importing

custom scripts into the Firegoose. Other projects such as

Greasemonkey [8] have had success with similar capabili-

ties. If further developed, this feature would allow a

straightforward mechanism for users or data providers to

contribute scripts.

Supporting the RMI communications protocol requires

Java. Using Java within a Firefox extension is something of

a challenge and code from MIT's Simile project [9] was

extremely helpful in this area. An alternative under con-

sideration is to communicate with the Boss using an XML

based protocol over sockets eliminating the need to run a

Java virtual machine in the browser's process.

We plan to extend the Gaggle microformat to express links

to data in addition to embedding data directly in the page.

This allows large data structures to be transmitted inde-

pendently of the page while preserving the linkage

between presentation in the browser and the underlying

structured data. A standard format would decrease the

need for customized coding for each web site.

RDF (Resource Description Framework) is a data model

designed to represent meta-data for Semantic Web appli-

cations. Incorporating support for RDF into the Firegoose

would allow the Gaggle to exchange data with the seman-

tically rich resources envisioned by proponents of the

Semantic Web project.

Conclusion

The Firegoose incorporates Mozilla Firefox into the Gag-

gle environment providing coordinated access to web

applications and programmatic data sources. Performing

data integration in the browser has several advantages and

is perhaps the most interesting feature of the Firegoose.

Browsers excel at search and navigation. Using the Fire-

goose, a biologist can search and navigate web resources

using familiar browser-based interfaces with the addi-

tional capability of easily moving data from one web-

based resource to another as well as between the web and

the desktop. Interactively integrating specific information

as needed replaces the cumbersome process of maintain-

ing local copies of large databases and manually coercing

data from diverse sources into a compatible format. Using

the Gaggle data types as intermediaries lowers the barrier

between web resources and desktop tools, allowing the

scientist to creatively combine and re-use data in ways that

go beyond those provided by the curators of individual

data sources.

The Firegoose positions the Gaggle to take advantage of

increasing use of web protocols to transmit structured

data. The Firegoose provides a framework in which new

web resources can be integrated into the Gaggle in a

straightforward and easily implemented manner, accom-

modating a variety of protocols. In supporting a number

of protocols, we hope to encourage data providers to

make available structured data in the format of their

choice and to provide the necessary information to link

web interfaces with the underlying data allowing brows-

ing and programmatic access to become seamlessly inte-

grated.

If the web is becoming a channel for structured data,

applications that share data between diverse web

resources and software tools will be of increasing impor-

tance. The Firegoose aims to fill this role for the systems

biology domain.

Availability and requirements

Source code for the Firegoose, along with that of the other

components of the Gaggle, is available at the Gaggle web-

site [32]. Also available are instructions for installing and

uninstalling the Firegoose toolbar [33] and documenta-

tion [34]. Most of the desktop components of the Gaggle

are deployed as Java webstarts, which can be launched by

clicking a link in the browser.

The toolbar is compatible with versions 1.5.x and 2.0.x of

Mozilla Firefox. We anticipate maintaining compatibility

with Firefox 3.x when released.

Java version 5 [35] or higher runtime environment is

required and the Java browser plug in for Firefox must be

installed. Extra attention is often required to install the

Java browser plug-in on Linux. Specific instructions for

most distributions are available on the web.

The source code is distributed under the GNU Lesser Gen-

eral Public License, the text of which is available at: http:/

/www.gnu.org/copyleft/lesser.html.

Authors' contributions

JCB Wrote the manuscript and implemented software.

PTS Conceived and initiated the project, implemented

prototype and provided feedback on the written manu-

script.

AKS Assisted in the conception and implementation of

the case study and provided feedback on the written man-

uscript.

Publish with Bio Med Central and every

scientist can read your work free of charge

"BioMed Central will be the most significant development for

disseminating the results of biomedical research in our lifetime."

Sir Paul Nurse, Cancer Research UK

Your research papers will be:

available free of charge to the entire biomedical community

peer reviewed and published immediately upon acceptance

cited in PubMed and archived on PubMed Central

yours — you keep the copyright

Submit your manuscript here:

http://www.biomedcentral.com/info/publishing_adv.asp

BioMedcentral

BMC Bioinformatics 2007, 8:456 http://www.biomedcentral.com/1471-2105/8/456

Page 12 of 12

(page number not for citation purposes)

NSB Conceived and initiated the project, wrote the man-

uscript and provided direction and feedback on quality of

results and software design.

All authors read and approved the final manuscript.

Additional material

Acknowledgements

We thank Dan Tenenbaum and Ricardo Vencio for critical reading of the

manuscript and helpful suggestions. We also thank John Boyle, Chris

Cavnor, David Shteynberg, and Neils Gelenburg for thoughtful discussions.

This work was supported by the following grants: NSF: DBI-0640950; DoE:

DE-FG02-07ER64327; and NIH: P50 GM076547.

References

1. Stein LD: Integrating biological databases. Nature reviews 2003,

4(5):337-345.

2. Covitz PA, Hartel F, Schaefer C, De Coronado S, Fragoso G, Sahni H,

Gustafson S, Buetow KH: caCORE: a common infrastructure

for cancer informatics. In Bioinformatics Volume 19. Issue 18

Oxford, England; 2003:2404-2412.

3. Wilkinson MD, Links M: BioMOBY: an open source biological

web services proposal. Briefings in bioinformatics 2002,

3(4):331-341.

4. Shannon PT, Reiss DJ, Bonneau R, Baliga NS: The Gaggle: an open-

source software system for integrating bioinformatics soft-

ware and data sources. BMC bioinformatics 2006, 7:176.

5. Microformats.org [http://microformats.org/

]

6. Semantic Web [http://www.w3.org/2001/sw/

]

7. Microformats [http://blog.mozilla.com/faaborg/2006/12/11/micro

formats-part-0-introduction/]

8. Greasemonkey [http://www.greasespot.net/

]

9. Huynh D, Mazzocchi S, Karger D: Piggy Bank: Experience the

Semantic Web Inside Your Web Browser. International Seman-

tic Web Conference: 2005 2005.

10. Operator [https://addons.mozilla.org/en-US/firefox/addon/4106

]

11. Mozilla Firefox [http://www.mozilla.com

]

12. Kanehisa M: The KEGG database. Novartis Foundation symposium

2002, 247:91-101. discussion 101–103, 119–128, 244–152

13. von Mering C, Jensen LJ, Kuhn M, Chaffron S, Doerks T, Kruger B,

Snel B, Bork P: STRING 7 – recent developments in the inte-

gration and prediction of protein interactions. Nucleic acids

research 2007:D358-362.

14. von Mering C, Jensen LJ, Snel B, Hooper SD, Krupp M, Foglierini M,

Jouffre N, Huynen MA, Bork P: STRING: known and predicted

protein-protein associations, integrated and transferred

across organisms. Nucleic acids research 2005:D433-437.

15. Dennis G Jr, Sherman BT, Hosack DA, Yang J, Gao W, Lane HC, Lem-

picki RA: DAVID: Database for Annotation, Visualization, and

Integrated Discovery. Genome biology 2003, 4(5):P3.

16. Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, Amin

N, Schwikowski B, Ideker T: Cytoscape: a software environment

for integrated models of biomolecular interaction networks.

Genome research 2003, 13(11):2498-2504.

17. R Statistical Package [http://www.r-project.org

]

18. Saeed AI, Sharov V, White J, Li J, Liang W, Bhagabati N, Braisted J,

Klapa M, Currier T, Thiagarajan M, et al.: TM4: a free, open-source

system for microarray data management and analysis. Bio-

Techniques 2003, 34(2):374-378.

19. Gaggle Microformat [http://gaggle.systemsbiology.net/docs/

geese/firegoose/microformat/]

20. SBEAMS [http://www.sbeams.org/

]

21. Schmid A, Reiss DJ, Kaur A, Pan M, King N, Van PT, Hohmann L, Mar-

tin DB, Baliga NS: The anatomy of microbial cell state transi-

tions in response to oxygen. Genome research 2007,

17(10):1399-1413.

22. Ng WV, Kennedy SP, Mahairas GG, Berquist B, Pan M, Shukla HD,

Lasky SR, Baliga NS, Thorsson V, Sbrogna J, et al.: Genome

sequence of Halobacterium species NRC-1. Proceedings of the

National Academy of Sciences of the United States of America 2000,

97(22):12176-12181.

23. Ideker T, Thorsson V, Siegel AF, Hood LE: Testing for differen-

tially-expressed genes by maximum-likelihood analysis of

microarray data. J Comput Biol 2000, 7(6):805-817.

24. Gaggle and Firegoose Oxygen Demo [http://gaggle.systemsbi

ology.net/projects/demos/halo_oxygen_analysis/]

25. Muller JA, DasSarma S: Genomic analysis of anaerobic respira-

tion in the archaeon Halobacterium sp. strain NRC-1: dime-

thyl sulfoxide and trimethylamine N-oxide as terminal

electron acceptors. Journal of bacteriology 2005,

187(5):1659-1667.

26. Robb FT, Place AR, Sowers KR, Schreier HJ, DasSarma S, Fleischmann

EM: Archaea: A laboratory manual. Cold Spring Harbor, New

York.: Cold Spring Harbor Laboratory Press; 1995.

27. Halobacterium genome annotations [http://baliga.systemsbiol

ogy.net/halobacterium/]

28. Ponting CP, Schultz J, Milpetz F, Bork P: SMART: identification

and annotation of domains from signalling and extracellular

protein sequences. Nucleic acids research 1999, 27(1):229-232.

29. Schultz J, Milpetz F, Bork P, Ponting CP: SMART, a simple modu-

lar architecture research tool: identification of signaling

domains. Proceedings of the National Academy of Sciences of the United

States of America 1998, 95(11):5857-5864.

30. Hull D, Wolstencroft K, Stevens R, Goble C, Pocock MR, Li P, Oinn

T: Taverna: a tool for building and running workflows of serv-

ices. Nucleic acids research 2006:W729-732.

31. Oinn T, Addis M, Ferris J, Marvin D, Senger M, Greenwood M, Carver

T, Glover K, Pocock MR, Wipat A, et al.: Taverna: a tool for the

composition and enactment of bioinformatics workflows. In

Bioinformatics Volume 20. Issue 17 Oxford, England; 2004:3045-3054.

32. The Gaggle website [http://gaggle.systemsbiology.net

]

33. Firegoose Installation Help [http://gaggle.systemsbiology.net/

docs/geese/firegoose/install/]

34. Firegoose [http://gaggle.systemsbiology.net/docs/geese/firegoose/

]

35. Download Java [http://www.java.com/download/

]

Additional file 1

Characterization of H. salinarum anoxic transcription response by

four data sources. Data about our genes of interest from KEGG,

STRING, DAVID, and the Halobacterium genome annotations database

are presented here in tabular form.

Click here for file

[http://www.biomedcentral.com/content/supplementary/1471-

2105-8-456-S1.xls]

Additional file 1

Data

November 2007

J Christopher Bare · Paul Shannon · Amy K Schmid · Nitin Baliga

Systems biology approaches to defining transcription regulatory networks in halophilic archaea

Article

Full-text available

May 2015
METHODS

To survive complex and changing environmental conditions, microorganisms use gene regulatory networks (GRNs) composed of interacting regulatory transcription factors (TFs) to control the timing and magnitude of gene expression. Genome-wide datasets; such as transcriptomics and protein-DNA interactions; and experiments such as high throughput growth curves; facilitate the construction of GRNs and provide insight into TF interactions occurring under stress. Systems biology approaches integrate these datasets into models of GRN architecture as well as statistical and/or dynamical models to understand the function of networks occurring in cells. Previously, these types of studies have focused on traditional model organisms (e.g. Escherichia coli, yeast). However, recent advances in archaeal genetics and other tools have enabled a systems approach to understanding GRNs in these relatively less studied archaeal model organisms. In this report, we outline a systems biology workflow for generating and integrating data focusing on the TF regulator. We discuss experimental design, outline the process of data collection, and provide the tools required to produce high confidence regulons for the TFs of interest. We provide a case study as an example of this workflow, describing the construction of a GRN centered on multi-TF coordinate control of gene expression governing the oxidative stress response in the hypersaline-adapted archaeon Halobacterium salinarum. Copyright © 2015. Published by Elsevier Inc.

In a fit of pique: Analyzing microbial ChIP-Seq data with Pique

Preprint

Full-text available

Mar 2014

While numerous effective peak finders have been developed for eukaryotic systems, we have found that the approaches used can be error prone when run on high coverage bacterial and archaeal ChIP-Seq datasets. We have developed Pique, an easy to use ChIP-Seq peak finding application for bacterial and archaeal ChIP-Seq experiments. The software is cross-platform and Open Source, and based on only freely licensed dependencies. Output is provided in standardized file formats, and may be easily imported by the Gaggle Genome Browser (Bare et al. 2010) for manual curation and data exploration, or into statistical and graphics software such as R (R Core Team 2013) for further analysis. The software is available under the BSD-3 license, and tutorial and test data are included with the documentation. http://github.com/ryneches/pique.

In a fit of pique: Analyzing microbial ChIP-Seq data with Pique

Preprint

Full-text available

Mar 2014

A Characterization of Data Exchange between Visual Analytics Tools

Conference Paper

Full-text available

Sep 2020

Over the past years, the visualization of large and complex data sets brought up various Visual Analytics (VA) tools in order to solve domain-specific tasks. These VA tools are typically implemented as individual software components in data-flow-oriented models, meaning that data is transferred from one component to the next. While most VA frameworks rely on a monolithic architecture with features for the integration of specialized analysis methods, we consider a loose coupling of independent applications, where autonomous VA tools are used in predefined analysis sequences. To this end, we provide a characterization of the data exchange process among individual VA tools in the form of a taxonomy. This taxonomy can be used as a checklist to identify characteristics and improve the data flow of one's own multi-tool VA setup. For this purpose, we conducted a systematic investigation of the individual aspects of data exchange that are commonly found across different usage scenarios. We apply our taxonomy to three existing multi-tool frameworks, the open-source library ReVize, the toolchain editor AnyProc, and the visualization and monitoring framework Plant@Hand3D.

Network portal: A database for storage, analysis and visualization of biological networks

Article

Full-text available

Nov 2013
NUCLEIC ACIDS RES

The ease of generating high-throughput data has enabled investigations into organismal complexity at the systems level through the inference of networks of interactions among the various cellular components (genes, RNAs, proteins and metabolites). The wider scientific community, however, currently has limited access to tools for network inference, visualization and analysis because these tasks often require advanced computational knowledge and expensive computing resources. We have designed the network portal (http://networks.systemsbiology.net) to serve as a modular database for the integration of user uploaded and public data, with inference algorithms and tools for the storage, visualization and analysis of biological networks. The portal is fully integrated into the Gaggle framework to seamlessly exchange data with desktop and web applications and to allow the user to create, save and modify workspaces, and it includes social networking capabilities for collaborative projects. While the current release of the database contains networks for 13 prokaryotic organisms from diverse phylogenetic clades (4678 co-regulated gene modules, 3466 regulators and 9291 cis-regulatory motifs), it will be rapidly populated with prokaryotic and eukaryotic organisms as relevant data become available in public repositories and through user input. The modular architecture, simple data formats and open API support community development of the portal.

Web scraping technologies in an API world

Article

Apr 2013
BRIEF BIOINFORM

Web services are the de facto standard in biomedical data integration. However, there are data integration scenarios that cannot be fully covered by Web services. A number of Web databases and tools do not support Web services, and existing Web services do not cover for all possible user data demands. As a consequence, Web data scraping, one of the oldest techniques for extracting Web contents, is still in position to offer a valid and valuable service to a wide range of bioinformatics applications, ranging from simple extraction robots to online meta-servers. This article reviews existing scraping frameworks and tools, identifying their strengths and limitations in terms of extraction capabilities. The main focus is set on showing how straightforward it is today to set up a data scraping pipeline, with minimal programming effort, and answer a number of practical needs. For exemplification purposes, we introduce a biomedical data extraction scenario where the desired data sources, well-known in clinical microbiology and similar domains, do not offer programmatic interfaces yet. Moreover, we describe the operation of WhichGenes and PathJam, two bioinformatics meta-servers that use scraping as means to cope with gene set enrichment analysis.

Hypoxia‐responsive miR‐141‐3p is involved in the progression of breast cancer via mediating HMGB1/HIF‐1α signaling pathway

Article

Full-text available

May 2020

Backgrounds Hypoxia‐responsive miRs have been frequently reported in the growth of various malignant tumors. The present study was aimed to investigate whether hypoxia‐responsive miR‐141‐3p was implicated in the pathogenesis of breast cancer via mediating HMGB1/HIF‐1α signaling pathway. Material and methods miRs expression profiling was filtrated by miR microarray assays. Gene and protein expression levels were respectively examined by RT‐qPCR and western blotting. Cell migration and invasion were analyzed using a transwell assay. Cell growth was determined using nude‐mouse transplanted tumor experiments. Results miR‐141‐3p was observed as a hypoxia‐responsive miR in breast cancer. miR‐141‐3p was down‐regulated in breast cancer specimens and could serve as an independent prognostic factor for predicting overall survival in breast cancer patients. In addition, overexpression of miR‐141‐3p could inhibit hypoxia‐induced cell migration and impede human breast cancer MDA‐MB‐231 cell growth in vivo. Mechanistically, hypoxia‐related HMGB1/HIF‐1α signaling pathway might be a possible target of miR‐141‐3p to prevent the development of breast cancer. Conclusions Our finding provides a new mechanism that miR‐141‐3p could prevent hypoxia‐induced breast tumorigenesis by post‐transcriptional repression of HMGB1/HIF‐1α signaling pathway.

SUPPLEMENTARY DATA

Data

Full-text available

Apr 2015

Bioinformatics Strategies for Understanding Gene Expression in Human Pluripotent Cells

Article

Nov 2010

IntroductionMicroarray-Based RNA MeasurementFrom Chip-Based Transcriptomics to Sequencing-Based TranscriptomicsMicrorna Profiling in Stem CellsSome Examples of Tools/Software Suites for Data Integration, Network Analysis, and Data VisualizationReferences

cMonkey2: Automated, systematic, integrated detection of co-regulated gene modules for any organism

Article

Full-text available

Apr 2015
NUCLEIC ACIDS RES

The cMonkey integrated biclustering algorithm identifies conditionally co-regulated modules of genes (biclusters). cMonkey integrates various orthogonal pieces of information which support evidence of gene co-regulation, and optimizes biclusters to be supported simultaneously by one or more of these prior constraints. The algorithm served as the cornerstone for constructing the first global, predictive Environmental Gene Regulatory Influence Network (EGRIN) model for a free-living cell, and has now been applied to many more organisms. However, due to its computational inefficiencies, long run-time and complexity of various input data types, cMonkey was not readily usable by the wider community. To address these primary concerns, we have significantly updated the cMonkey algorithm and refactored its implementation, improving its usability and extendibility. These improvements provide a fully functioning and user-friendly platform for building co-regulated gene modules and the tools necessary for their exploration and interpretation. We show, via three separate analyses of data for E. coli, M. tuberculosis and H. sapiens, that the updated algorithm and inclusion of novel scoring functions for new data types (e.g. ChIP-seq and transcription factor over-expression [TFOE]) improve discovery of biologically informative co-regulated modules. The complete cMonkey2 software package, including source code, is available at https://github.com/baliga-lab/cmonkey2. © The Author(s) 2015. Published by Oxford University Press on behalf of Nucleic Acids Research.

DAVID: Database for Annotation, Visualization, and Integrated Discovery

Article

Full-text available

Apr 2003

Background: Functional annotation of differentially expressed genes is a necessary and critical step in the analysis of microarray data. The distributed nature of biological knowledge frequently requires researchers to navigate through numerous web-accessible databases gathering information one gene at a time. A more judicious approach is to provide query-based access to an integrated database that disseminates biologically rich information across large datasets and displays graphic summaries of functional information. Results: Database for Annotation, Visualization, and Integrated Discovery (DAVID; http://www.david.niaid.nih.gov) addresses this need via four web-based analysis modules: 1) Annotation Tool - rapidly appends descriptive data from several public databases to lists of genes; 2) GoCharts - assigns genes to Gene Ontology functional categories based on user selected classifications and term specificity level; 3) KeggCharts - assigns genes to KEGG metabolic processes and enables users to view genes in the context of biochemical pathway maps; and 4) DomainCharts - groups genes according to PFAM conserved protein domains. Conclusions: Analysis results and graphical displays remain dynamically linked to primary data and external data repositories, thereby furnishing in-depth as well as broad-based data coverage. The functionality provided by DAVID accelerates the analysis of genome-scale datasets by facilitating the transition from data collection to biological meaning.

Genome Sequence of Halobacterium Species NRC-1

Article

Full-text available

Nov 2000

We report the complete sequence of an extreme halophile, Halobacterium sp. NRC-1, harboring a dynamic 2,571,010-bp genome containing 91 insertion sequences representing 12 families and organized into a large chromosome and 2 related minichromosomes. The Halobacterium NRC-1 genome codes for 2,630 predicted proteins, 36% of which are unrelated to any previously reported. Analysis of the genome sequence shows the presence of pathways for uptake and utilization of amino acids, active sodium-proton antiporter and potassium uptake systems, sophisticated photosensory and signal transduction pathways, and DNA replication, transcription, and translation systems resembling more complex eukaryotic organisms. Whole proteome comparisons show the definite archaeal nature of this halophile with additional similarities to the Gram-positive Bacillus subtilis and other bacteria. The ease of culturing Halobacterium and the availability of methods for its genetic manipulation in the laboratory, including construction of gene knockouts and replacements, indicate this halophile can serve as an excellent model system among the archaea.

Archaea: a laboratory manual - halophiles

Article

Full-text available

Jan 1995

Schultz J, Milpetz F, Bork P & Ponting CP. SMART, a simple modular architecture research tool: Identification of signaling domains. Proc Natl Acad Sci USA95: 5857-5864

Article

Full-text available

Jun 1998

Accurate multiple alignments of 86 domains that occur in signaling proteins have been constructed and used to provide a Web-based tool (SMART: simple modular architecture research tool) that allows rapid identification and annotation of signaling domain sequences. The majority of signaling proteins are multidomain in character with a considerable variety of domain combinations known. Comparison with established databases showed that 25% of our domain set could not be deduced from SwissProt and 41% could not be annotated by Pfam. SMART is able to determine the modular architectures of single sequences or genomes; application to the entire yeast genome revealed that at least 6.7% of its genes contain one or more signaling domains, approximately 350 greater than previously annotated. The process of constructing SMART predicted (i) novel domain homologues in unexpected locations such as band 4.1-homologous domains in focal adhesion kinases; (ii) previously unknown domain families, including a citron-homology domain; (iii) putative functions of domain families after identification of additional family members, for example, a ubiquitin-binding role for ubiquitin-associated domains (UBA); (iv) cellular roles for proteins, such predicted DEATH domains in netrin receptors further implicating these molecules in axonal guidance; (v) signaling domains in known disease genes such as SPRY domains in both marenostrin/pyrin and Midline 1; (vi) domains in unexpected phylogenetic contexts such as diacylglycerol kinase homologues in yeast and bacteria; and (vii) likely protein misclassifications exemplified by a predicted pleckstrin homology domain in a Candida albicans protein, previously described as an integrin.

SMART: Identification and annotation of domains from signalling and extracellular protein sequences

Article

Full-text available

Feb 1999

SMART is a simple modular architecture research tool and database that provides domain identification and annotation on the WWW (http://coot.embl-heidelberg.de/SMART). The tool compares query sequences with its databases of domain sequences and multiple alignments whilst concurrently identifying compositionally biased regions such as signal peptide, transmembrane and coiled coil segments. Annotated and unannotated regions of the sequence can be used as queries in searches of sequence databases. The SMART alignment collection represents more than 250 signalling and extracellular domains. Each alignment is curated to assign appropriate domain boundaries and to ensure its quality. In addition, each domain is annotated extensively with respect to cellular localisation, species distribution, functional class, tertiary structure and functionally important residues.

SMART, a simple modular architecture research tool: Identification of signaling domains

Conference Paper

May 1998

Accurate multiple alignments of 86 domains that occur in signaling proteins have been constructed and used to provide a Web based tool (SMART: simple modular architecture research tool) that allows rapid identification and annotation of signaling domain sequences. The majority of signaling proteins are multidomain in character with a considerable variety of domain combinations known. Comparison with established databases showed that 25% of our domain set could not be deduced from SwissProt and 41% could not be annotated by Pfam, SMART is able to determine the modular architectures of single sequences or genomes; application to the entire yeast genome revealed that at least 6.7% of its genes contain one or more signaling domains, approximately 350 greater than previously annotated. The process of constructing SMART predicted (i) novel domain homologues in unexpected locations such as band 4.1-homologous domains in focal adhesion kinases; (ii) previously unknown domain families, including a citron-homology domain; (iii) putative functions of domain families after identification of additional family members, for example, a ubiquitin-binding role for ubiquitin-associated domains (UBA); (iv) cellular roles for proteins, such predicted DEATH domains in netrin receptors further implicating these molecules in axonal guidance; (v) signaling domains in known disease genes such as SPRY domains in both marenostrin/pyrin and Midline I; (vi) domains in unexpected phylogenetic contexts such as diacylglycerol kinase homologues in yeast and bacteria; and (vii) likely protein misclassifications exemplified by a predicted pleckstrin homology domain in a Candida albicans protein, previously described as an integrin.

Ciba Foundation Symposium

Article

Jan 1983

M. Pereira

Piggy Bank: Experience the Semantic Web Inside Your Web Browser

Article

Mar 2007
J WEB SEMANT

The Semantic Web Initiative envisions a Web wherein information is offered free of presentation, allowing more effective exchange and mixing across web sites and across web pages. But without substantial Semantic Web content, few tools will be written to consume it; without many such tools, there is little appeal to publish Semantic Web content. To break this chicken-and-egg problem, thus enabling more flexible information access, we have created a web browser extension called Piggy Bankthat lets users make use of Semantic Web content within Web content as users browse the Web. Wherever Semantic Web content is not available, Piggy Bank can invoke screenscrapers to restructure information within web pages into Semantic Web format. Through the use of Semantic Web technologies, Piggy Bank provides direct, immediate benefits to users in their use of the existing Web. Thus, the existence of even just a few Semantic Web-enabled sites or a few scrapers already benefits users. Piggy Bank thereby offers an easy, incremental upgrade path to users without requiring a wholesale adoption of the Semantic Web’s vision. To further improve this Semantic Web experience, we have created Semantic Bank, a web server application that lets Piggy Bank users share the Semantic Web information they have collected, enabling collaborative efforts to build sophisticated Semantic Web information repositories through simple, everyday’s use of Piggy Bank.

Integrating Biological Databases.

Conference Paper

Jan 2003

Testing for Differentially-Expressed Genes by Maximum-Likelihood Analysis of Microarray Data

Article

Feb 2000

Although two-color fluorescent DNA microarrays are now standard equipment in many molecular biology laboratories, methods for identifying differentially expressed genes in microarray data are still evolving. Here, we report a refined test for differentially expressed genes which does not rely on gene expression ratios but directly compares a series of repeated measurements of the two dye intensities for each gene. This test uses a statistical model to describe multiplicative and additive errors influencing an array experiment, where model parameters are estimated from observed intensities for all genes using the method of maximum likelihood. A generalized likelihood ratio test is performed for each gene to determine whether, under the model, these intensities are significantly different. We use this method to identify significant differences in gene expression among yeast cells growing in galactose-stimulating versus non-stimulating conditions and compare our results with current approaches for identifying differentially-expressed genes. The effect of sample size on parameter optimization is also explored, as is the use of the error model to compare the within- and between-slide intensity variation intrinsic to an array experiment.

The Firegoose: two-way integration of diverse data from different bioinformatics web resources with desktop applications

Abstract and Figures

Supplementary resource (1)

Recommended publications

Vinaitheerthan,Renganathan (2017-08-18). "Tutorial on mining of biomedical literature with the help...

clValid : An R Package for Cluster Validation

Screening biomarkers of bladder cancer using combined miRNA and mRNA microarray analysis

Bioinformatical analysis of metastasis-related genes in pancreatic cancer