BioWarehouse: A bioinformatics database warehouse toolkit

Bioinformatics Research Group, SRI International, Menlo Park, USA.
BMC Bioinformatics (Impact Factor: 2.58). 02/2006; 7(1):170. DOI: 10.1186/1471-2105-7-170
Source: PubMed


This article addresses the problem of interoperation of heterogeneous bioinformatics databases.
We introduce BioWarehouse, an open source toolkit for constructing bioinformatics database warehouses using the MySQL and Oracle relational database managers. BioWarehouse integrates its component databases into a common representational framework within a single database management system, thus enabling multi-database queries using the Structured Query Language (SQL) but also facilitating a variety of database integration tasks such as comparative analysis and data mining. BioWarehouse currently supports the integration of a pathway-centric set of databases including ENZYME, KEGG, and BioCyc, and in addition the UniProt, GenBank, NCBI Taxonomy, and CMR databases, and the Gene Ontology. Loader tools, written in the C and JAVA languages, parse and load these databases into a relational database schema. The loaders also apply a degree of semantic normalization to their respective source data, decreasing semantic heterogeneity. The schema supports the following bioinformatics datatypes: chemical compounds, biochemical reactions, metabolic pathways, proteins, genes, nucleic acid sequences, features on protein and nucleic-acid sequences, organisms, organism taxonomies, and controlled vocabularies. As an application example, we applied BioWarehouse to determine the fraction of biochemically characterized enzyme activities for which no sequences exist in the public sequence databases. The answer is that no sequence exists for 36% of enzyme activities for which EC numbers have been assigned. These gaps in sequence data significantly limit the accuracy of genome annotation and metabolic pathway prediction, and are a barrier for metabolic engineering. Complex queries of this type provide examples of the value of the data warehousing approach to bioinformatics research.
BioWarehouse embodies significant progress on the database integration problem for bioinformatics.

1 Follower
36 Reads
  • Source
    • "A draft putative orphan list was assembled by searching for EC numbers that lacked associated amino acid or protein sequence data in each of Enzyme DB, SwissProt, TrEMBL, BioCyc proteins, BioCyc reactions, and Orenza. In March 2009 we performed a number of SQL queries against bioinformatics databases using the BioWarehouse biological database warehousing system [34]. The objectives of the queries were to create a definitive list of all defined EC numbers and create a list of EC numbers with known associated sequence (which were thus ruled out from being orphan enzymes). "
    [Show abstract] [Hide abstract]
    ABSTRACT: Despite advances in sequencing technology, there are still significant numbers of well-characterized enzymatic activities for which there are no known associated sequences. These 'orphan enzymes' represent glaring holes in our biological understanding, and it is a top priority to reunite them with their coding sequences. Here we report a methodology for resolving orphan enzymes through a combination of database search and literature review. Using this method we were able to reconnect over 270 orphan enzymes with their corresponding sequence. This success points toward how we can systematically eliminate the remaining orphan enzymes and prevent the introduction of future orphan enzymes.
    PLoS ONE 05/2014; 9(5):e97250. DOI:10.1371/journal.pone.0097250 · 3.23 Impact Factor
  • Source
    • "Information linkage implementations, like SRS [2] or NCBI Entrez [3], enable users to interrogate several sources through a single Web site and provide results with links to the data sources; yet, they do not integrate the retrieved data. Fully materialized systems, like EnsMart [4] or BioWarehouse [5], integrate data within a warehouse according to a local schema. This approach allows performing easily complex computations on the integrated data, but requires updating often the data warehouse, which generally is a complex task. "
    [Show abstract] [Hide abstract]
    ABSTRACT: The huge amount of biomedical-molecular data increasingly produced is providing scientists with potentially valuable information. Yet, such data quantity makes difficult to find and extract those data that are most reliable and most related to the biomedical questions to be answered, which are increasingly complex and often involve many different biomedical-molecular aspects. Such questions can be addressed only by comprehensively searching and exploring different types of data, which frequently are ordered and provided by different data sources. Search Computing has been proposed for the management and integration of ranked results from heterogeneous search services. Here, we present its novel application to the explorative search of distributed biomedical-molecular data and the integration of the search results to answer complex biomedical questions. A set of available bioinformatics search services has been modelled and registered in the Search Computing framework, and a Bioinformatics Search Computing application (Bio-SeCo) using such services has been created and made publicly available at It offers an integrated environment which eases search, exploration and ranking-aware combination of heterogeneous data provided by the available registered services, and supplies global results that can support answering complex multi-topic biomedical questions. By using Bio-SeCo, scientists can explore the very large and very heterogeneous biomedical-molecular data available. They can easily make different explorative search attempts, inspect obtained results, select the most appropriate, expand or refine them and move forward and backward in the construction of a global complex biomedical query on multiple distributed sources that could eventually find the most relevant results. Thus, it provides an extremely useful automated support for exploratory integrated bio search, which is fundamental for Life Science data driven knowledge discovery.
    BMC Bioinformatics 01/2014; 15 Suppl 1(Suppl 1):S3. DOI:10.1186/1471-2105-15-S1-S3 · 2.58 Impact Factor
  • Source
    • "In ONDEX, the integrated data is accessed by providing a standard pipeline, in which individual filtering and graph layout operations may be combined to process the graph in application-specific ways. BioWarehouse [12] aims to provide generic tools for enabling users to build their own combinations of biological data sources. Their data management approach is rather similar to ONDEX and Biozon, but the data is stored in a relational database with a dedicated table for each data type instead of a generic graph structure. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Background Biological databases contain large amounts of data concerning the functions and associations of genes and proteins. Integration of data from several such databases into a single repository can aid the discovery of previously unknown connections spanning multiple types of relationships and databases. Results Biomine is a system that integrates cross-references from several biological databases into a graph model with multiple types of edges, such as protein interactions, gene-disease associations and gene ontology annotations. Edges are weighted based on their type, reliability, and informativeness. We present Biomine and evaluate its performance in link prediction, where the goal is to predict pairs of nodes that will be connected in the future, based on current data. In particular, we formulate protein interaction prediction and disease gene prioritization tasks as instances of link prediction. The predictions are based on a proximity measure computed on the integrated graph. We consider and experiment with several such measures, and perform a parameter optimization procedure where different edge types are weighted to optimize link prediction accuracy. We also propose a novel method for disease-gene prioritization, defined as finding a subset of candidate genes that cluster together in the graph. We experimentally evaluate Biomine by predicting future annotations in the source databases and prioritizing lists of putative disease genes. Conclusions The experimental results show that Biomine has strong potential for predicting links when a set of selected candidate links is available. The predictions obtained using the entire Biomine dataset are shown to clearly outperform ones obtained using any single source of data alone, when different types of links are suitably weighted. In the gene prioritization task, an established reference set of disease-associated genes is useful, but the results show that under favorable conditions, Biomine can also perform well when no such information is available. The Biomine system is a proof of concept. Its current version contains 1.1 million entities and 8.1 million relations between them, with focus on human genetics. Some of its functionalities are available in a public query interface at, allowing searching for and visualizing connections between given biological entities.
    BMC Bioinformatics 06/2012; 13(1):119. DOI:10.1186/1471-2105-13-119 · 2.58 Impact Factor
Show more

Preview (2 Sources)

36 Reads
Available from