SOAPindel: Efficient identification of indels from short paired reads

BGI Shenzhen, Shenzhen 518000, China
Genome Research (Impact Factor: 14.63). 09/2012; 23(1). DOI: 10.1101/gr.132480.111
Source: PubMed


We present a new approach to indel calling which explicitly exploits that indel differences between a reference and a sequenced sample make the mapping of reads less efficient. We assign all unmapped reads with a mapped partner to their expected genomic positions and then perform extensive de novo assembly on the regions with many unmapped reads to resolve homozygous, heterozygous and complex indels by exhaustive traversal of the de Bruijn graph. The method is implemented in the software SOAPindel and provides a list of candidate indels with quality scores. We compare SOAPindel to Dindel, Pindel and GATK on simulated data and find similar or better performance for short indels (<10 bp) and higher sensitivity and specificity for long indels. A validation experiment suggests that SOAPindel has a false positive rate around 10% for long indels (>5 bp) while still providing many more candidate indels than other approaches.

60 Reads
  • [Show abstract] [Hide abstract]
    ABSTRACT: Over the last several years, next-generation sequencing (NGS) has transformed genomic research through substantial advances in technology and reduction in the cost of sequencing, and also in the systems required for analysis of these large volumes of data. This technology is now being used as a standard molecular diagnostic test under particular circumstances in some clinical settings. The advances in sequencing have come so rapidly that the major bottleneck in identification of causal variants is no longer the sequencing but rather the analysis and interpretation. Interpretation of genetic findings in a clinical setting is scarcely a new challenge, but the task is increasingly complex in clinical genome-wide sequencing given the dramatic increase in dataset size and complexity. This increase requires the development of novel or repositioned analysis tools, methodologies, and processes. This unit provides an overview of these items. Specific challenges related to implementation in a clinical setting are discussed. Curr. Protoc. Hum. Genet. 79:9.24.1-9.24.24. © 2013 by John Wiley & Sons, Inc.
    Current protocols in human genetics / editorial board, Jonathan L. Haines ... [et al.] 01/2013; 79:9.24.1-9.24.24. DOI:10.1002/0471142905.hg0924s79
  • [Show abstract] [Hide abstract]
    ABSTRACT: The scale of tumor genomic profiling is rapidly outpacing human cognitive capacity to make clinical decisions without the aid of tools. New frameworks are needed to help researchers and clinicians process the information emerging from the explosive growth in both the number of tumor genetic variants routinely tested and the respective knowledge to interpret their clinical significance. We review the current state, limitations, and future trends in methods to support the clinical analysis and interpretation of cancer genomes. This includes the processes of genome-scale variant identification, including tools for sequence alignment, tumor-germline comparison, and molecular annotation of variants. The process of clinical interpretation of tumor variants includes classification of the effect of the variant, reporting the results to clinicians, and enabling the clinician to make a clinical decision based on the genomic information integrated with other clinical features. We describe existing knowledge bases, databases, algorithms, and tools for identification and visualization of tumor variants and their actionable subsets. With the decreasing cost of tumor gene mutation testing and the increasing number of actionable therapeutics, we expect the methods for analysis and interpretation of cancer genomes to continue to evolve to meet the needs of patient-centered clinical decision making. The science of computational cancer medicine is still in its infancy; however, there is a clear need to continue the development of knowledge bases, best practices, tools, and validation experiments for successful clinical implementation in oncology.
    Journal of Clinical Oncology 04/2013; 31(15). DOI:10.1200/JCO.2013.48.7215 · 18.43 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: MOTIVATION: Simple tandem repeats are highly variable genetic elements and widespread in genomes of many organisms. Next-generation sequencing technologies have enabled a robust comparison of large numbers of simple tandem repeat loci, however analysis of their variation using traditional sequence analysis approaches still remains limiting and problematic due to variants occurring in repeat sequences confusing alignment programs into mapping sequence reads to incorrect loci when the sequence reads are significantly different from the reference sequence. RESULTS: We have developed a program, ReviSTER, which is an automated pipeline using a "local mapping reference reconstruction method" to revise mismapped or partially misaligned reads at simple tandem repeat loci. RevisSTER estimates alleles of repeat loci using a local alignment method and creates temporary local mapping reference sequences, and finally remaps reads to the local mapping references. Using this approach, ReviSTER was able to successfully revise reads misaligned to repeat loci from both simulated data and real data. AVAILABILITY: ReviSTER is open-source software available at CONTACT:; SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
    Bioinformatics 05/2013; 29(14). DOI:10.1093/bioinformatics/btt277 · 4.98 Impact Factor
Show more