Designing sensitive and specific spaced seeds for cross-species mRNA-to-genome alignment

Department of Computer Science, George Washington University, Washington, DC 20052, USA.
Journal of Computational Biology (Impact Factor: 1.74). 04/2007; 14(2):113-30. DOI: 10.1089/cmb.2006.0130
Source: PubMed


As the demand for accurately aligning gene sequences to the genome of a related species grows with the sequencing of new genomes, spaced seeds emerge as a promising vehicle for increasing alignment sensitivity. We extend the existing {0, 1} match-mismatch models for sensitivity evaluation to take into account the compositional structure of coding sequences and ultimately produce seeds better suited to this particular application. Designing seeds for alignment programs, however, needs to balance sensitivity and specificity. We assess the effects of seed variations on both sensitivity and specificity in an extended model that incorporates transitions and differentiates among the three codon positions, and show that spaced seeds with transitions offer a better sensitivity-specificity tradeoff. Furthermore, we propose a theoretical formulation for rigorously assessing seed specificity, starting from Bernoulli and Markov models of the mRNA and genomic sequences. Within this framework, we perform the first comprehensive analysis of seeds to serve as a blueprint for selecting sensitive and specific seeds for practical applications. Our analyses show that specificity is relatively constant for seeds of a given weight, while sensitivity varies widely, with the highest values attained by seeds allowing a small (2-6) number of transitions.A strategy for designing seeds, therefore, is to first select the weight of the seed by identifying the desired sensitivity-specificity tradeoff, then choose the most sensitive seed(s) within that weight group. We illustrate our methods with the alignment of chicken coding sequences against the human genome assembly version HG17.

1 Follower
7 Reads
  • Source
    • "For a spaced seed containing wildcard positions, it is the number of occurrences of all words w ′ that are compatible with w via seed S. For instance, for the word AAGCT and the seed S = 1x111, the set of compatible words is {AAGCT, AGGCT }. In [24] we derived a closed formula and recurrences to efficiently calculate seed specificity for the case of Bernoulli and Markov models of gene sequences and for a Bernoulli or order 1 Markov model of the genome. "
    [Show abstract] [Hide abstract]
    ABSTRACT: We review recent developments in spaced seed design for cross-species sequence alignment. We start with a brief overview of original ideas and early techniques, and then focus on more recent work on finding accurate (sensitive and specific) seeds for cross-species cDNA-to-genome alignment. These recent developments include methods and models for estimating seed specificity and determining sensitive and specific seeds, finding seeds that can be applied to a wide range of comparisons, and applying seed models to other computational biology areas, such as gene finding.
    01/2010; 10:115-136. DOI:10.4310/CIS.2010.v10.n2.a4
  • Source
    • "Unlike continuous seeds, which require an exact match of k contiguous bases and are represented as vectors of 1s, spaced seeds allow for some positions in the seed pattern to vary, for instance the seed 101100001011 has the wildcard positions 2, 5, 6, 7, 8 and 10. The number of 1s in the seed is called the seed weight and controls the specificity (22). The length of the seed is called span. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Advances in sequencing technologies have accelerated the sequencing of new genomes, far outpacing the generation of gene and protein resources needed to annotate them. Direct comparison and alignment of existing cDNA sequences from a related species is an effective and readily available means to determine genes in the new genomes. Current spliced alignment programs are inadequate for comparing sequences between different species, owing to their low sensitivity and splice junction accuracy. A new spliced alignment tool, sim4cc, overcomes problems in the earlier tools by incorporating three new features: universal spaced seeds, to increase sensitivity and allow comparisons between species at various evolutionary distances, and powerful splice signal models and evolutionarily-aware alignment techniques, to improve the accuracy of gene models. When tested on vertebrate comparisons at diverse evolutionary distances, sim4cc had significantly higher sensitivity compared to existing alignment programs, more than 10% higher than the closest competitor for some comparisons, while being comparable in speed to its predecessor, sim4. Sim4cc can be used in one-to-one or one-to-many comparisons of genomic and cDNA sequences, and can also be effectively incorporated into a high-throughput annotation engine, as demonstrated by the mapping of 64,000 Fagus grandifolia 454 ESTs and unigenes to the poplar genome.
    Nucleic Acids Research 06/2009; 37(11):e80. DOI:10.1093/nar/gkp319 · 9.11 Impact Factor
  • Source
    • "This leads to the study of the transition seeds that contain fixed match and transition positions. Transition seeds exhibit a good tradeoff between sensitivity and specificity for homology search in both coding and non-coding regions (Sun and Buhler, 2006; Zhou and Florea, 2007). However, identifying good transition seeds is a difficult task. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Transition seeds exhibit a good tradeoff between sensitivity and specificity for homology search in both coding and non-coding regions. However, identifying good transition seeds is intractable. This work studies the hit probability of high-order seed patterns. Based on our theoretical results, we propose an efficient method for ranking transition seeds for seed design and list good seeds in different Bernoulli sequence models.
    Journal of computational biology: a journal of computational molecular cell biology 01/2009; 15(10):1295-313. DOI:10.1089/cmb.2007.0209 · 1.74 Impact Factor
Show more