Combining evidence using P‐values: application to sequence homology searches

San Diego Supercomputer Center, CA 92186-9784, USA.
Bioinformatics (Impact Factor: 4.98). 02/1998; 14(1):48-54. DOI: 10.1093/bioinformatics/14.1.48
Source: PubMed


MOTIVATION: To illustrate an intuitive and statistically valid method for combining independent sources of evidence that yields
a p-value for the complete evidence, and to apply it to the problem of detecting simultaneous matches to multiple patterns
in sequence homology searches. RESULTS: In sequence analysis, two or more (approximately) independent measures of the membership
of a sequence (or sequence region) in some class are often available. We would like to estimate the likelihood of the sequence
being a member of the class in view of all the available evidence. An example is estimating the significance of the observed
match of a macromolecular sequence (DNA or protein) to a set of patterns (motifs) that characterize a biological sequence
family. An intuitive way to do this is to express each piece of evidence as a p-value, and then use the product of these p-values
as the measure of membership in the family. We derive a formula and algorithm (QFAST) for calculating the statistical distribution
of the product of n independent p-values. We demonstrate that sorting sequences by this p-value effectively combines the information
present in multiple motifs, leading to highly accurate and sensitive sequence homology searches.

Download full-text


Available from: Michael Gribskov
  • Source
    • "(7) Each time a motif is matched to a position in the protein sequence, a p-value is calculated that represents the probability of finding a match as good as the observed match within a random sequence. The p-values for all motifs in a single sequence are then combined using QFAST to obtain the final statistical significance score (final p-value) (Bailey and Gribskov, 1998a). (8) The protein information (including accession numbers, annotations, and species), final p-value, and sequence fragments matched to each queried motif are exported for all sequences with a final p-value more significant than a user-selected p-value. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Peroxiredoxins are cysteine-dependent peroxide reductases that group into 6 different, structurally discernable classes. In 2011, our research team reported the application of a bioinformatic approach called active site profiling to extract active site-proximal sequence segments from the 29 distinct, structurally-characterized peroxiredoxins available at the time. These extracted sequences were then used to create unique profiles for the six groups which were subsequently used to search GenBank(nr), allowing identification of ~3500 peroxiredoxin sequences and their respective subgroups. Summarized in this minireview are the features and phylogenetic distributions of each of these peroxiredoxin subgroups; an example is also provided illustrating the use of the web accessible, searchable database known as PREX to identify subfamily-specific peroxiredoxin sequences for the organism Vitis vinifera (grape).
    Preview · Article · Jan 2016 · Moleculer Cells
  • Source
    • ". Variable width motifs for non-regulated, destabilized and stabilized genes. 3'-UTR analysis was using the MEME software analysis package [1] [2] with parameters adjusted for variable-width motifs (7 – 15 nt) on non-regulated, destabilized, and stabilized transcripts. Significant motifs from MEME analysis are shown for destabilized, stabilized, and non-regulated datasets. "

    Full-text · Dataset · Jan 2016
  • Source
    • "The parameters for the analysis were as follows: number of repetitions, 0 or 1; maximum number of motifs, 14; and optimum motif width, 6–100. The MAST program (Bailey and Gribskov, 1998) was used to search for each of the motifs in the AOP sequences. "
    [Show abstract] [Hide abstract]
    ABSTRACT: The glucosinolate biosynthetic gene AOP2 encodes an enzyme that plays a crucial role in catalysing the conversion of beneficial glucosinolates into anti-nutritional ones. In Brassica rapa, three copies of BrAOP2 have been identified, but their function in establishing the glucosinolate content of B. rapa is poorly understood. Here, we used phylogenetic and gene structure analyses to show that BrAOP2 proteins have evolved via a duplication process retaining two highly conserved domains at the N-terminal and C-terminal regions, while the middle part has experienced structural divergence. Heterologous expression and in vitro enzyme assays and Arabidopsis mutant complementation studies showed that all three BrAOP2 genes encode functional BrAOP2 proteins that convert the precursor methylsulfinyl alkyl glucosinolate to the alkenyl form. Site-directed mutagenesis showed that His356, Asp310, and Arg376 residues are required for the catalytic activity of one of the BrAOP2 proteins (BrAOP2.1). Promoter-β-glucuronidase lines revealed that the BrAOP2.3 gene displayed an overlapping but distinct tissue- and cell-specific expression profile compared with that of the BrAOP2.1 and BrAOP2.2 genes. Quantitative real-time reverse transcription-PCR assays demonstrated that BrAOP2.1 showed a slightly different pattern of expression in below-ground tissue at the seedling stage and in the silique at the reproductive stage compared with BrAOP2.2 and BrAOP2.3 genes in B. rapa. Taken together, our results revealed that all three BrAOP2 paralogues are active in B. rapa but have functionally diverged. © The Author 2015. Published by Oxford University Press on behalf of the Society for Experimental Biology.
    Full-text · Article · Jul 2015 · Journal of Experimental Botany
Show more