BioWarehouse: a bioinformatics database warehouse toolkit.

Bioinformatics Research Group, SRI International, Menlo Park, USA.
BMC Bioinformatics (Impact Factor: 2.67). 02/2006; 7:170. DOI: 10.1186/1471-2105-7-170
Source: PubMed

ABSTRACT This article addresses the problem of interoperation of heterogeneous bioinformatics databases.
We introduce BioWarehouse, an open source toolkit for constructing bioinformatics database warehouses using the MySQL and Oracle relational database managers. BioWarehouse integrates its component databases into a common representational framework within a single database management system, thus enabling multi-database queries using the Structured Query Language (SQL) but also facilitating a variety of database integration tasks such as comparative analysis and data mining. BioWarehouse currently supports the integration of a pathway-centric set of databases including ENZYME, KEGG, and BioCyc, and in addition the UniProt, GenBank, NCBI Taxonomy, and CMR databases, and the Gene Ontology. Loader tools, written in the C and JAVA languages, parse and load these databases into a relational database schema. The loaders also apply a degree of semantic normalization to their respective source data, decreasing semantic heterogeneity. The schema supports the following bioinformatics datatypes: chemical compounds, biochemical reactions, metabolic pathways, proteins, genes, nucleic acid sequences, features on protein and nucleic-acid sequences, organisms, organism taxonomies, and controlled vocabularies. As an application example, we applied BioWarehouse to determine the fraction of biochemically characterized enzyme activities for which no sequences exist in the public sequence databases. The answer is that no sequence exists for 36% of enzyme activities for which EC numbers have been assigned. These gaps in sequence data significantly limit the accuracy of genome annotation and metabolic pathway prediction, and are a barrier for metabolic engineering. Complex queries of this type provide examples of the value of the data warehousing approach to bioinformatics research.
BioWarehouse embodies significant progress on the database integration problem for bioinformatics.

1 Follower
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We describe the potential of current Web 2.0 technologies to achieve data mashup in the health care and life sciences (HCLS) domains, and compare that potential to the nascent trend of performing semantic mashup. After providing an overview of Web 2.0, we demonstrate two scenarios of data mashup, facilitated by the following Web 2.0 tools and sites: Yahoo! Pipes, Dapper, Google Maps and GeoCommons. In the first scenario, we exploited Dapper and Yahoo! Pipes to implement a challenging data integration task in the context of DNA microarray research. In the second scenario, we exploited Yahoo! Pipes, Google Maps, and GeoCommons to create a geographic information system (GIS) interface that allows visualization and integration of diverse categories of public health data, including cancer incidence and pollution prevalence data. Based on these two scenarios, we discuss the strengths and weaknesses of these Web 2.0 mashup technologies. We then describe Semantic Web, the mainstream Web 3.0 technology that enables more powerful data integration over the Web. We discuss the areas of intersection of Web 2.0 and Semantic Web, and describe the potential benefits that can be brought to HCLS research by combining these two sets of technologies.
    Journal of Biomedical Informatics 05/2008; 41(5):694-705. DOI:10.1016/j.jbi.2008.04.001 · 2.48 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Data integration is a perennial issue in bioinformatics, with many systems being developed and many technologies offered as a panacea for its resolution. The fact that it is still a problem indicates a persistence of underlying issues. Progress has been made, but we should ask "what lessons have been learnt?", and "what still needs to be done?" Semantic Web and Web 2.0 technologies are the latest to find traction within bioinformatics data integration. Now we can ask whether the Semantic Web, mashups, or their combination, have the potential to help. This paper is based on the opening invited talk by Carole Goble given at the Health Care and Life Sciences Data Integration for the Semantic Web Workshop collocated with WWW2007. The paper expands on that talk. We attempt to place some perspective on past efforts, highlight the reasons for success and failure, and indicate some pointers to the future.
    Journal of Biomedical Informatics 03/2008; 41(5):687-93. DOI:10.1016/j.jbi.2008.01.008 · 2.48 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The ontologies play an important role in biomedical research. The Gene Ontology (GO) is the most widely accepted and used ontology. This ontology is the result of a collaboration among model organisms databases to generate structured vocabularies with annotation purposes. While GO was designed as a vocabulary for standardization of gene products annotations, many others applications also use it as a tool for semantic computation. This paper is focused on a general description of the constituent parts of GO and on its relationship with other cutting-edge technologies such as the ones known jointly as semantic Web technologies. Furthermore, we show the usefulness of GO by providing some examples of its applications in functional genomics, biomedical text mining and protein function prediction.Finally, we consider some current trends in GO development.


Available from