SOAPindel: Efficient identification of indels from short paired reads

BGI Shenzhen, Shenzhen 518000, China
Genome Research (Impact Factor: 14.63). 09/2012; 23(1). DOI: 10.1101/gr.132480.111
Source: PubMed


We present a new approach to indel calling which explicitly exploits that indel differences between a reference and a sequenced sample make the mapping of reads less efficient. We assign all unmapped reads with a mapped partner to their expected genomic positions and then perform extensive de novo assembly on the regions with many unmapped reads to resolve homozygous, heterozygous and complex indels by exhaustive traversal of the de Bruijn graph. The method is implemented in the software SOAPindel and provides a list of candidate indels with quality scores. We compare SOAPindel to Dindel, Pindel and GATK on simulated data and find similar or better performance for short indels (<10 bp) and higher sensitivity and specificity for long indels. A validation experiment suggests that SOAPindel has a false positive rate around 10% for long indels (>5 bp) while still providing many more candidate indels than other approaches.

61 Reads
  • Source
    • "The lengths of the detected InDels were within the range 1-5 bp. Gaps supported by ≥3 PE reads were retained using SOAPindel (Li et al., 2013). According to the principle of PE sequencing, 1 PE read should be aligned to the forward sequence and another should be aligned to the reverse in normal situations. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Genome-wide re-sequencing of the Zhenshan 97 (ZS97) and Milyang 46 (MY46) parents of an elite three-line hybrid rice developed in China resulted in the generation of 9.91 G bases of data with an effective sequencing depth of 11.66x and 11.51x, respectively. Detection of genome-wide DNA polymorphisms, single nucleotide polymorphisms (SNPs), short insertions/deletions (InDels; 1-5 bp), and structural variations (SVs), which is an invaluable variation resource for genetic research and molecular marker-assisted breeding, was conducted by comparing whole-genome re-sequencing data. A total of 364,488 SNPs, 61,181 InDels and 6298 SVs were detected in ZS97 and 364,179 SNPs, 61,984 InDels and 6408 SVs were detected in MY46 compared to the 9311 reference sequence. Synteny analysis of the variation revealed a total of 77,013 identical and 181,737 different SNPs and 15,021 identical and 1205 different InDels between ZS97 and MY46, respectively. A total of 180 InDels 3-8 bp in length between ZS97 and MY46 were selected for experimental validation; 160 polymerase chain reaction products were efficiently separated on 6% non-denaturing polyacrylamide gels. Identification of genome-wide variation among the parents of the elite hybrid as well as the set of 160 polymerase chain reaction-based InDel markers will facilitate future genetic studies and the molecular breeding of hybrid rice.
    Genetics and molecular research: GMR 04/2015; 14(2):3209-3222. DOI:10.4238/2015.April.10.33 · 0.78 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: The scale of tumor genomic profiling is rapidly outpacing human cognitive capacity to make clinical decisions without the aid of tools. New frameworks are needed to help researchers and clinicians process the information emerging from the explosive growth in both the number of tumor genetic variants routinely tested and the respective knowledge to interpret their clinical significance. We review the current state, limitations, and future trends in methods to support the clinical analysis and interpretation of cancer genomes. This includes the processes of genome-scale variant identification, including tools for sequence alignment, tumor-germline comparison, and molecular annotation of variants. The process of clinical interpretation of tumor variants includes classification of the effect of the variant, reporting the results to clinicians, and enabling the clinician to make a clinical decision based on the genomic information integrated with other clinical features. We describe existing knowledge bases, databases, algorithms, and tools for identification and visualization of tumor variants and their actionable subsets. With the decreasing cost of tumor gene mutation testing and the increasing number of actionable therapeutics, we expect the methods for analysis and interpretation of cancer genomes to continue to evolve to meet the needs of patient-centered clinical decision making. The science of computational cancer medicine is still in its infancy; however, there is a clear need to continue the development of knowledge bases, best practices, tools, and validation experiments for successful clinical implementation in oncology.
    Journal of Clinical Oncology 04/2013; 31(15). DOI:10.1200/JCO.2013.48.7215 · 18.43 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: MOTIVATION: Simple tandem repeats are highly variable genetic elements and widespread in genomes of many organisms. Next-generation sequencing technologies have enabled a robust comparison of large numbers of simple tandem repeat loci, however analysis of their variation using traditional sequence analysis approaches still remains limiting and problematic due to variants occurring in repeat sequences confusing alignment programs into mapping sequence reads to incorrect loci when the sequence reads are significantly different from the reference sequence. RESULTS: We have developed a program, ReviSTER, which is an automated pipeline using a "local mapping reference reconstruction method" to revise mismapped or partially misaligned reads at simple tandem repeat loci. RevisSTER estimates alleles of repeat loci using a local alignment method and creates temporary local mapping reference sequences, and finally remaps reads to the local mapping references. Using this approach, ReviSTER was able to successfully revise reads misaligned to repeat loci from both simulated data and real data. AVAILABILITY: ReviSTER is open-source software available at CONTACT:; SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
    Bioinformatics 05/2013; 29(14). DOI:10.1093/bioinformatics/btt277 · 4.98 Impact Factor
Show more