Combining evidence using p-values: application to sequence homology searches.

San Diego Supercomputer Center, CA 92186-9784, USA.
Bioinformatics (Impact Factor: 4.62). 02/1998; 14(1):48-54. DOI: 10.1093/bioinformatics/14.1.48
Source: PubMed

ABSTRACT MOTIVATION: To illustrate an intuitive and statistically valid method for combining independent sources of evidence that yields a p-value for the complete evidence, and to apply it to the problem of detecting simultaneous matches to multiple patterns in sequence homology searches. RESULTS: In sequence analysis, two or more (approximately) independent measures of the membership of a sequence (or sequence region) in some class are often available. We would like to estimate the likelihood of the sequence being a member of the class in view of all the available evidence. An example is estimating the significance of the observed match of a macromolecular sequence (DNA or protein) to a set of patterns (motifs) that characterize a biological sequence family. An intuitive way to do this is to express each piece of evidence as a p-value, and then use the product of these p-values as the measure of membership in the family. We derive a formula and algorithm (QFAST) for calculating the statistical distribution of the product of n independent p-values. We demonstrate that sorting sequences by this p-value effectively combines the information present in multiple motifs, leading to highly accurate and sensitive sequence homology searches.

  • [Show abstract] [Hide abstract]
    ABSTRACT: Salvia miltiorrhiza is one of the most economically important medicinal plants. Cytochrome P450 (CYP450) genes have been implicated in the biosynthesis of its active components. However, only a dozen full-length CYP450 genes have been described, and there is no systematic classification of CYP450 genes in S. miltiorrhiza. We obtained 77,549 unigenes from three tissue types of S. miltiorrhiza using RNA-Seq technology. Combining our data with previously identified CYP450 sequences and scanning with the CYP450 model from Pfam resulted in the identification of 116 full-length and 135 partial-length CYP450 genes. The 116 genes were classified into 9 clans and 38 families using standard criteria. The RNA-Seq results showed that 35 CYP450 genes were co-expressed with CYP76AH1, a marker gene for tanshinone biosynthesis, using r≥0.9 as a cutoff. The expression profiles for 16 of 19 randomly selected CYP450 obtained from RNA-Seq were validated by qRT-PCR. Comparing against the KEGG database, 10 CYP450 genes were found to be associated with diterpenoid biosynthesis. Considering all the evidence, 3 CYP450 genes were identified to be potentially involved in terpenoid biosynthesis. Moreover, we found that 15 CYP450 genes were possibly regulated by antisense transcripts (r≥0.9 or r≤-0.9). Lastly, a web resource (SMCYP450, was set up, which allows users to browse, search, retrieve and compare CYP450 genes and can serve as a centralized resource.
    PLoS ONE 12/2014; 9(12):e115149. · 3.53 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Nonallelic homologous recombination (NAHR), occurring between low-copy repeats (LCRs) >10 kb in size and sharing >97% DNA sequence identity, is responsible for the majority of recurrent genomic rearrangements in the human genome. Recent studies have shown that transposable elements (TEs) can also mediate recurrent deletions and translocations, indicating the features of substrates that mediate NAHR may be significantly less stringent than previously believed. Using >4 kb length and >95% sequence identity criteria, we analyzed of the genome-wide distribution of long interspersed element (LINE) retrotransposon and their potential to mediate NAHR. We identified 17 005 directly oriented LINE pairs located <10 Mbp from each other as potential NAHR substrates, placing 82.8% of the human genome at risk of LINE-LINE-mediated instability. Cross-referencing these regions with CNVs in the Baylor College of Medicine clinical chromosomal microarray database of 36 285 patients, we identified 516 CNVs potentially mediated by LINEs. Using long-range PCR of five different genomic regions in a total of 44 patients, we confirmed that the CNV breakpoints in each patient map within the LINE elements. To additionally assess the scale of LINE-LINE/NAHR phenomenon in the human genome, we tested DNA samples from six healthy individuals on a custom aCGH microarray targeting LINE elements predicted to mediate CNVs and identified 25 LINE-LINE rearrangements. Our data indicate that LINE-LINE-mediated NAHR is widespread and under-recognized, and is an important mechanism of structural rearrangement contributing to human genomic variability. © The Author(s) 2015. Published by Oxford University Press on behalf of Nucleic Acids Research.
    Nucleic Acids Research 01/2015; · 8.81 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Methylthioalkylmalate synthases (MAMs) encoded by MAM genes are central to the diversification of the glucosinolates, which are important secondary metabolites in Brassicaceae species. However, the evolutionary pathway of MAM genes is poorly understood. We analyzed the phylogenetic and synteny relationships of MAM genes from 13 sequenced Brassicaceae species. Based on these analyses, we propose that the syntenic loci of MAM genes, which underwent frequent tandem duplications, divided into two independent lineage-specific evolution routes and were driven by positive selection after the divergence from Aethionema arabicum. In the lineage I species Capsella rubella, Camelina sativa, Arabidopsis lyrata, and A. thaliana, the MAM loci evolved three tandem genes encoding enzymes responsible for the biosynthesis of aliphatic glucosinolates with different carbon chain-lengths. In lineage II species, the MAM loci encode enzymes responsible for the biosynthesis of short-chain aliphatic glucosinolates. Our proposed model of the evolutionary pathway of MAM genes will be useful for understanding the specific function of these genes in Brassicaceae species.
    Frontiers in Plant Science 02/2015; 6. · 3.64 Impact Factor

Full-text (2 Sources)

Available from
Jun 3, 2014