Robust and accurate data enrichment statistics via distribution function of sum of weights

National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA.
Bioinformatics (Impact Factor: 4.98). 11/2010; 26(21):2752-9. DOI: 10.1093/bioinformatics/btq511
Source: PubMed

ABSTRACT Term-enrichment analysis facilitates biological interpretation by assigning to experimentally/computationally obtained data annotation associated with terms from controlled vocabularies. This process usually involves obtaining statistical significance for each vocabulary term and using the most significant terms to describe a given set of biological entities, often associated with weights. Many existing enrichment methods require selections of (arbitrary number of) the most significant entities and/or do not account for weights of entities. Others either mandate extensive simulations to obtain statistics or assume normal weight distribution. In addition, most methods have difficulty assigning correct statistical significance to terms with few entities.
Implementing the well-known Lugananni-Rice formula, we have developed a novel approach, called SaddleSum, that is free from all the aforementioned constraints and evaluated it against several existing methods. With entity weights properly taken into account, SaddleSum is internally consistent and stable with respect to the choice of number of most significant entities selected. Making few assumptions on the input data, the proposed method is universal and can thus be applied to areas beyond analysis of microarrays. Employing asymptotic approximation, SaddleSum provides a term-size-dependent score distribution function that gives rise to accurate statistical significance even for terms with few entities. As a consequence, SaddleSum enables researchers to place confidence in its significance assignments to small terms that are often biologically most specific.
Our implementation, which uses Bonferroni correction to account for multiple hypotheses testing, is available at Source code for the standalone version can be downloaded from

18 Reads
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: CytoSaddleSum provides Cytoscape users with access to the functionality of SaddleSum, a functional enrichment tool based on sum-of-weight scores. It operates by querying SaddleSum locally (using the standalone version) or remotely (through an HTTP request to a web server). The functional enrichment results are shown as a term relationship network, where nodes represent terms and edges show term relationships. Furthermore, query results are written as Cytoscape attributes allowing easy saving, retrieval and integration into network-based data analysis workflows. Availability: The source code is placed in Public Domain. Contact: Supplementary information: Supplementary materials are available at Bioinformatics online.
    Bioinformatics 02/2012; 28(6):893-4. DOI:10.1093/bioinformatics/bts041 · 4.98 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In our previous publication, a framework for information flow in interaction networks based on random walks with damping was formulated with two fundamental modes: emitting and absorbing. While many other network analysis methods based on random walks or equivalent notions have been developed before and after our earlier work, one can show that they can all be mapped to one of the two modes. In addition to these two fundamental modes, a major strength of our earlier formalism was its accommodation of context-specific directed information flow that yielded plausible and meaningful biological interpretation of protein functions and pathways. However, the directed flow from origins to destinations was induced via a potential function that was heuristic. Here, with a theoretically sound approach called the channel mode, we extend our earlier work for directed information flow. This is achieved by constructing a potential function facilitating a purely probabilistic interpretation of the channel mode. For each network node, the channel mode combines the solutions of emitting and absorbing modes in the same context, producing what we call a channel tensor. The entries of the channel tensor at each node can be interpreted as the amount of flow passing through that node from an origin to a destination. Similarly to our earlier model, the channel mode encompasses damping as a free parameter that controls the locality of information flow. Through examples involving the yeast pheromone response pathway, we illustrate the versatility and stability of our new framework.
    Journal of computational biology: a journal of computational molecular cell biology 04/2012; 19(4):379-403. DOI:10.1089/cmb.2010.0228 · 1.74 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Inferring the underlying regulatory pathways within a gene interaction network is a fundamental problem in Systems Biology to help understand the complex interactions and the regulation and flow of information within a system-of-interest. Given a weighted gene network and a gene in this network, the goal of an inference algorithm is to identify the potential regulatory pathways passing through this gene. In a departure from previous approaches that largely rely on the random walk model, we propose a novel single-source k-shortest paths based algorithm to address this inference problem. An important element of our approach is to explicitly account for and enhance the diversity of paths discovered by our algorithm. The intuition here is that diversity in paths can help enrich different functions and thereby better position one to understand the underlying system-of-interest. Results on the yeast gene network demonstrate the utility of the proposed approach over extant state-of-the-art inference algorithms. Beyond utility, our algorithm achieves a significant speedup over these baselines. All data and codes are freely available upon request.
    Bioinformatics 06/2012; 28(12):i49-58. DOI:10.1093/bioinformatics/bts212 · 4.98 Impact Factor
Show more

Preview (2 Sources)

18 Reads
Available from