Biomine: predicting links between biological entities using network models of heterogeneous databases

BMC Bioinformatics (Impact Factor: 2.58). 06/2012; 13(1):119. DOI: 10.1186/1471-2105-13-119
Source: PubMed

ABSTRACT Background
Biological databases contain large amounts of data concerning the functions and associations of genes and proteins. Integration of data from several such databases into a single repository can aid the discovery of previously unknown connections spanning multiple types of relationships and databases.

Biomine is a system that integrates cross-references from several biological databases into a graph model with multiple types of edges, such as protein interactions, gene-disease associations and gene ontology annotations. Edges are weighted based on their type, reliability, and informativeness. We present Biomine and evaluate its performance in link prediction, where the goal is to predict pairs of nodes that will be connected in the future, based on current data. In particular, we formulate protein interaction prediction and disease gene prioritization tasks as instances of link prediction. The predictions are based on a proximity measure computed on the integrated graph. We consider and experiment with several such measures, and perform a parameter optimization procedure where different edge types are weighted to optimize link prediction accuracy. We also propose a novel method for disease-gene prioritization, defined as finding a subset of candidate genes that cluster together in the graph. We experimentally evaluate Biomine by predicting future annotations in the source databases and prioritizing lists of putative disease genes.

The experimental results show that Biomine has strong potential for predicting links when a set of selected candidate links is available. The predictions obtained using the entire Biomine dataset are shown to clearly outperform ones obtained using any single source of data alone, when different types of links are suitably weighted. In the gene prioritization task, an established reference set of disease-associated genes is useful, but the results show that under favorable conditions, Biomine can also perform well when no such information is available.
The Biomine system is a proof of concept. Its current version contains 1.1 million entities and 8.1 million relations between them, with focus on human genetics. Some of its functionalities are available in a public query interface at, allowing searching for and visualizing connections between given biological entities.

16 Reads
  • Source
    • "The edges may represent the possibility that the proteins may interact. The labels may correspond to properties of proteins [4]. • In a movie-actor network, the nodes correspond to the actors. "
    [Show abstract] [Hide abstract]
    ABSTRACT: In many real applications that use and analyze networked data, the links in the network graph may be erroneous, or derived from probabilistic techniques. In such cases, the node classification problem can be challenging, since the unreliability of the links may affect the final results of the classification process. If the information about link reliability is not used explicitly, the classification accuracy in the underlying network may be affected adversely. In this paper, we focus on situations that require the analysis of the uncertainty that is present in the graph structure. We study the novel problem of node classification in uncertain graphs, by treating uncertainty as a first-class citizen. We propose two techniques based on a Bayes model and automatic parameter selection, and show that the incorporation of uncertainty in the classification process as a first-class citizen is beneficial. We experimentally evaluate the proposed approach using different real data sets, and study the behavior of the algorithms under different conditions. The results demonstrate the effectiveness and efficiency of our approach.
  • Source
    • "Another important integration feature of GoMapMan database is accessibility to knowledge stored in different biological databases. These readily accessible data spanning gene functional annotations, protein structure and interactions, pathway information and connections with other ontologies can further improve analyses of high-throughput data and aid in the discovery of previously unknown connections (24) and enable easy knowledge access to researchers interested in a particular gene through browse/search features of GoMapMan. External data in MapMan can be accessed from the Gene Details View (described below in chapter ‘Visualization’), where a gene contains links to the external databases, thus enabling easier retrieval of all the relevant information. "
    [Show abstract] [Hide abstract]
    ABSTRACT: GoMapMan ( is an open web-accessible resource for gene functional annotations in the plant sciences. It was developed to facilitate improvement, consolidation and visualization of gene annotations across several plant species. GoMapMan is based on the MapMan ontology, organized in the form of a hierarchical tree of biological concepts, which describe gene functions. Currently, genes of the model species Arabidopsis and three crop species (potato, tomato and rice) are included. The main features of GoMapMan are (i) dynamic and interactive gene product annotation through various curation options; (ii) consolidation of gene annotations for different plant species through the integration of orthologue group information; (iii) traceability of gene ontology changes and annotations; (iv) integration of external knowledge about genes from different public resources; and (v) providing gathered information to high-throughput analysis tools via dynamically generated export files. All of the GoMapMan functionalities are openly available, with the restriction on the curation functions, which require prior registration to ensure traceability of the implemented changes.
    Nucleic Acids Research 11/2013; 42(Database issue). DOI:10.1093/nar/gkt1056 · 9.11 Impact Factor
  • Source
    • "There are various representation formalisms that can be used to represent a network topology, including the directed graphs formalism as used in the Systems Biology Graphical Notation by Le Novère et al. [16] or the modified EPN (mEPN) scheme proposed by Raza et al. [17]. To construct the signalling network topology, different information sources can be used, including pathway databases such as the KEGG Pathway [18], Reactome [19] and BioCyc [20], integrated knowledge sources such as ONDEX [21] and Biomine [22]–[23], and the scientific literature itself. Given that most of human biological knowledge is still stored only in the silos of biological literature, retrieving information from the literature is required when building the signalling network topology. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Plant defence signalling response against various pathogens, including viruses, is a complex phenomenon. In resistant interaction a plant cell perceives the pathogen signal, transduces it within the cell and performs a reprogramming of the cell metabolism leading to the pathogen replication arrest. This work focuses on signalling pathways crucial for the plant defence response, i.e., the salicylic acid, jasmonic acid and ethylene signal transduction pathways, in the Arabidopsis thaliana model plant. The initial signalling network topology was constructed manually by defining the representation formalism, encoding the information from public databases and literature, and composing a pathway diagram. The manually constructed network structure consists of 175 components and 387 reactions. In order to complement the network topology with possibly missing relations, a new approach to automated information extraction from biological literature was developed. This approach, named Bio3graph, allows for automated extraction of biological relations from the literature, resulting in a set of (component1, reaction, component2) triplets and composing a graph structure which can be visualised, compared to the manually constructed topology and examined by the experts. Using a plant defence response vocabulary of components and reaction types, Bio3graph was applied to a set of 9,586 relevant full text articles, resulting in 137 newly detected reactions between the components. Finally, the manually constructed topology and the new reactions were merged to form a network structure consisting of 175 components and 524 reactions. The resulting pathway diagram of plant defence signalling represents a valuable source for further computational modelling and interpretation of omics data. The developed Bio3graph approach, implemented as an executable language processing and graph visualisation workflow, is publically available at can be utilised for modelling other biological systems, given that an adequate vocabulary is provided.
    PLoS ONE 12/2012; 7(12):e51822. DOI:10.1371/journal.pone.0051822 · 3.23 Impact Factor
Show more

Preview (2 Sources)

16 Reads
Available from