New finite-size correction for local alignment score distributions

BMC Research Notes 06/2012; 5(1):286. DOI: 10.1186/1756-0500-5-286
Source: PubMed


Local alignment programs often calculate the probability that a match occurred by chance. The calculation of this probability may require a “finite-size” correction to the lengths of the sequences, as an alignment that starts near the end of either sequence may run out of sequence before achieving a significant score.

We present an improved finite-size correction that considers the distribution of sequence lengths rather than simply the corresponding means. This approach improves sensitivity and avoids substituting an ad hoc length for short sequences that can underestimate the significance of a match. We use a test set derived from ASTRAL to show improved ROC scores, especially for shorter sequences.

The new finite-size correction improves the calculation of probabilities for a local alignment. It is now used in the BLAST+ package and at the NCBI BLAST web site (

Download full-text


Available from: John L Spouge, Oct 04, 2015
25 Reads
  • Source
    • "The search space size mn in Eq [1] is sharpened by edge effect (Altschul and Gish, 1996; Park et al., 2012). "
    [Show abstract] [Hide abstract]
    ABSTRACT: Motivation: The alignment of DNA sequences to proteins, allowing for frameshifts, is a classic method in sequence analysis. It can help identify pseudogenes (which accumulate mutations), analyze raw DNA and RNA sequence data (which may have frameshift sequencing errors), investigate ribosomal frameshifts, etc. Often, however, only ad hoc approximations or simulations are available to provide the statistical significance of a frameshift alignment score. Results: We describe a method to estimate statistical significance of frameshift alignments, similar to classic BLAST statistics. (BLAST presently does not permit its alignments to include frameshifts.) We also illustrate the continuing usefulness of frameshift alignment with two 'post-genomic' applications: (i) when finding pseudogenes within the human genome, frameshift alignments show that most anciently conserved non-coding human elements are recent pseudogenes with conserved ancestral genes; and (ii) when analyzing metagenomic DNA reads from polluted soil, frameshift alignments show that most alignable metagenomic reads contain frameshifts, suggesting that metagenomic analysis needs to use frameshift alignment to derive accurate results.
    Bioinformatics 08/2014; 30(24). DOI:10.1093/bioinformatics/btu576 · 4.98 Impact Factor
  • Source
    • "Assembly of the cleaned 454 ESTs was performed with iAssembler v1.3.0 using default parameters: a maximum of 30bp long end clips, a minimum of 40bp overlap, and 97% identity (Zheng et al., 2011). Stand-alone BLAST (2.2.27+) from NCBI was used with a threshold e-value < 1 × 10−10 (Park et al., 2012). Gene annotation was performed by BLASTX against Arabidopsis thaliana TAIR10 proteins (TAIR10_pep_20101214) (Lamesch et al., 2012). "
    [Show abstract] [Hide abstract]
    ABSTRACT: Common bean (Phaseolus vulgaris) and black gram (Vigna mungo) accumulate γ-Glutamyl-S-methylcysteine and γ-Glutamyl-methionine in seed, respectively. Transcripts were profiled by 454 pyrosequencing data at a similar developmental stage coinciding with the beginning of the accumulation of these metabolites. Expressed sequence tags were assembled into Unigenes, which were assigned to specific genes in the early release chromosomal assembly of the P. vulgaris genome. Genes involved in multiple sulfur metabolic processes were expressed in both species. Expression of Sultr3 members was predominant in P. vulgaris, whereas expression of Sultr5 members predominated in V. mungo. Expression of the cytosolic SERAT1;1 and -1;2 was approximately fourfold higher in P. vulgaris while expression of the plastidic SERAT2;1 was twofold higher in V. mungo. Among BSAS family members, BSAS4;1, encoding a cytosolic cysteine desulfhydrase, and BSAS1;1, encoding a cytosolic O-acetylserine sulphydrylase were most highly expressed in both species. This was followed by BSAS3;1 encoding a plastidic β-cyanoalanine synthase which was more highly expressed by 10-fold in P. vulgaris. The data identify BSAS3;1 as a candidate enzyme for the biosynthesis of S-methylcysteine through the use of methanethiol as substrate instead of cyanide. Expression of GLC1 would provide a complete sequence leading to the biosynthesis of γ-Glutamyl-S-methylcysteine in plastids. The detection of S-methylhomoglutathione in P. vulgaris suggested that homoglutathione synthetase may accept, to some extent, γ-Glutamyl-S-methylcysteine as substrate, which might lead to the formation of S-methylated phytochelatins. In conclusion, 454 sequencing was effective at revealing differences in the expression of sulfur metabolic genes, providing information on candidate genes for the biosynthesis of distinct sulfur amino acid γ-Glutamyl dipeptides between P. vulgaris and V. mungo.
    Frontiers in Plant Science 03/2013; 4:60. DOI:10.3389/fpls.2013.00060 · 3.95 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Paired box (PAX) genes are transcription factors that play important roles in embryonic development. Although the PAX gene family occurs in animals only, it is widely distributed. Among the vertebrates, its 9 genes appear to be the product of complete duplication of an original set of 4 genes, followed by an additional partial duplication. Although some studies of PAX genes have been conducted, no comprehensive survey of these genes across the entire taxonomic unit has yet been attempted. In this study, we conducted a detailed comparison of PAX sequences from 188 chordates, which revealed restricted variation. The absence of PAX4 and PAX8 among some species of reptiles and birds was notable; however, all 9 genes were present in all 74 mammalian genomes investigated. A search for signatures of selection indicated that all genes are subject to purifying selection, with a possible constraint relaxation in PAX4, PAX7, and PAX8. This result indicates asymmetric evolution of PAX family genes, which can be associated with the emergence of adaptive novelties in the chordate evolutionary trajectory.
    PLoS ONE 09/2013; 8(9):e73560. DOI:10.1371/journal.pone.0073560 · 3.23 Impact Factor