Combining evidence using p-values: application to sequence homology searches

San Diego Supercomputer Center, CA 92186-9784, USA.
Bioinformatics (Impact Factor: 4.62). 02/1998; 14(1):48-54. DOI: 10.1093/bioinformatics/14.1.48
Source: PubMed

ABSTRACT MOTIVATION: To illustrate an intuitive and statistically valid method for combining independent sources of evidence that yields a p-value for the complete evidence, and to apply it to the problem of detecting simultaneous matches to multiple patterns in sequence homology searches. RESULTS: In sequence analysis, two or more (approximately) independent measures of the membership of a sequence (or sequence region) in some class are often available. We would like to estimate the likelihood of the sequence being a member of the class in view of all the available evidence. An example is estimating the significance of the observed match of a macromolecular sequence (DNA or protein) to a set of patterns (motifs) that characterize a biological sequence family. An intuitive way to do this is to express each piece of evidence as a p-value, and then use the product of these p-values as the measure of membership in the family. We derive a formula and algorithm (QFAST) for calculating the statistical distribution of the product of n independent p-values. We demonstrate that sorting sequences by this p-value effectively combines the information present in multiple motifs, leading to highly accurate and sensitive sequence homology searches.

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Tools of molecular biology and the evolving tools of genomics can now be exploited to study the genetic regulatory mechanisms that control cellular responses to a wide variety of stimuli. These responses are highly complex, and involve many genes and gene products. The main objectives of this paper are to describe a novel research program centered on understanding these responses bydeveloping powerful graph algorithms that exploit the innovative principles of fixed parameter tractability in order to generate distilled gene sets;producing scalable, high performance parallel and distributed implementations of these algorithms utilizing cutting-edge computing platforms and auxiliary resources;employing these implementations to identify gene sets suggestive of co-regulation; andperforming sequence analysis and genomic data mining to examine, winnow and highlight the most promising gene sets for more detailed investigation.As a case study, we describe our work aimed at elucidating genetic regulatory mechanisms that control cellular responses to low-dose ionizing radiation (IR). A low-dose exposure, as defined here, is an exposure of at most 10 cGy (rads). While the consequences of high doses of radiation are well known, the net outcome of low-dose exposures continues to be debated, with support in the literature for both detrimental and beneficial effects. We use genome-scale gene expression data collected in response to low-dose IR exposure in vivo to identify the pathways that are activated or repressed as a tissue responds to the radiation insult. The driving motivation is that knowledge of these pathways will help clarify and interpret physiological responses to IR, which will advance our understanding of the health consequences of low-dose radiation exposures.
  • [Show abstract] [Hide abstract]
    ABSTRACT: Riboswitches present a ubiquitous genetic regulatory mechanism for prokaryotes and have been found in HIV1, fungi, plants, and even H. sapiens. We present an overview of approaches to predict riboswitch aptamers and, more generally, RNA conformational switches. © 2015 Elsevier Inc. All rights reserved.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Methylthioalkylmalate synthases (MAMs) encoded by MAM genes are central to the diversification of the glucosinolates, which are important secondary metabolites in Brassicaceae species. However, the evolutionary pathway of MAM genes is poorly understood. We analyzed the phylogenetic and synteny relationships of MAM genes from 13 sequenced Brassicaceae species. Based on these analyses, we propose that the syntenic loci of MAM genes, which underwent frequent tandem duplications, divided into two independent lineage-specific evolution routes and were driven by positive selection after the divergence from Aethionema arabicum. In the lineage I species Capsella rubella, Camelina sativa, Arabidopsis lyrata, and A. thaliana, the MAM loci evolved three tandem genes encoding enzymes responsible for the biosynthesis of aliphatic glucosinolates with different carbon chain-lengths. In lineage II species, the MAM loci encode enzymes responsible for the biosynthesis of short-chain aliphatic glucosinolates. Our proposed model of the evolutionary pathway of MAM genes will be useful for understanding the specific function of these genes in Brassicaceae species.
    Frontiers in Plant Science 02/2015; 6. DOI:10.3389/fpls.2015.00018 · 3.64 Impact Factor

Full-text (2 Sources)

Available from
Jun 3, 2014