Automating resequencing-based detection of
Tushar R Bhangale1,2, Matthew Stephens2,3& Deborah A Nickerson1,2
Structural and insertion-deletion (indel) variants have
received considerable recent attention, partly because of their
phenotypic consequences. Among these variants, the most
common are small indels (B1–30 bp). Identifying and
genotyping indels using sequence traces obtained from diploid
samples requires extensive manual review, which makes
large-scale studies inconvenient. We report a new algorithm,
implemented in available software (PolyPhred version 6.0),
to help automate detection and genotyping of indels from
sequence traces. The algorithm identifies heterozygous
individuals, which permits the discovery of low-frequency
indels. It finds 80% of all indel polymorphisms with almost
no false positives and finds 97% with a false discovery rate of
10%. Additionally, genotyping accuracy exceeds 99%, and
it correctly infers indel length in 96% of the cases. Using this
approach, we identify indels in the HapMap ENCODE regions,
providing the first report of these polymorphisms in this
Recent studies have started to catalog the large number of structural
and indel variants present in human populations1–6. Of these, the
most common are small (B1- to 30-bp) indel polymorphisms7. Small
indels are important both because of their relative abundance (they are
the second most frequent type of polymorphism in humans after
nucleotide substitutions) and their functional significance: indels in
coding regions can cause severe disruptions in coding sequences8,9,
and indels in promoter regions can alter transcriptional activity10,11.
Indeed, small indels currently constitute B24% of all disease-causing
mutations reported at the Human Gene Mutation Database12(as of
August 2006). As the allele frequency spectrum and linkage disequili-
brium (LD) characteristics of indels are similar to substitutions5,7,
indels can improve the resolution of genetic maps to uncover a more
detailed picture of sequence variation and LD in any region and can
have a valuable role in the mapping of complex diseases and traits.
Despite the abundance and potential functional importance of these
small indel polymorphisms, no efficient high-throughput technologies
currently exist to automatically identify and genotype them in
population samples. In principle, as for substitutions13–15, these
tasks can be accomplished by fluorescence-based resequencing. In
particular, individuals heterozygous for an indel allele can be reliably
identified from the complex pattern of multiple heterozygous peaks
(that is, the presence of a peak with a B50% drop in height compared
with a homozygote along with the presence of a second peak of similar
height corresponding to the alternate allele) that occur because of
mismatches in the two allelic sequences downstream of an indel7. This
detection of heterozygotes has a central role in comprehensively
detecting diallelic polymorphisms because for lower-frequency var-
iants, samples will often not include homozygotes for both alleles. In
addition, the pattern of peaks in heterozygotes can be used to identify
the inserted or deleted segment relative to a reference sequence. Thus,
with recent advances in high-throughput sequencing technology and
the rapid increase in resequencing-based polymorphism discovery16–18,
there exists an opportunity for large-scale identification of small
indel polymorphisms. However, although indels can be effectively
identified and genotyped manually using this pattern7, for large-
scale applications, it is impractical to manually examine every
trace. Although existing software tools novoSNP15, InSNP19and
Mutation Surveyor (Softgenetics) help to automate this process,
these approaches still require extensive manual review of the identi-
In this report, we describe a new algorithm to help automate the
identification and genotyping of small (diallelic) indels from sequence
trace data. The method detects heterozygous indel patterns using a
statistical analysis of the base calls, quality and peak height data
obtained from raw sequence traces. In our tests, it is able to identify
80% of indels entirely automatically (without any false positives) and
97% of indels at a false discovery rate (that is, the proportion of false
positives among the positive discoveries) of 0.1. Its genotyping
accuracy exceeds 99%, and it can correctly infer the indel length in
96% of sites. The algorithm, implemented in a software package (Poly-
Phred version 6.0) is available from http://droog.mbt.washington.edu/
PolyPhred.html. We applied the method to analyze sequence trace
data from the ten ENCODE regions, generated as part of the HapMap
project, and our method identified 1,244 potential new indel poly-
morphisms, 1,126 of which (91%) we confirmed to be indels upon
manual inspection of the traces. The manual confirmation process
for 5 Mb of reference sequence took one person roughly 30 h,
demonstrating the potential for large-scale application.
Received 5 June; accepted 17 October; published online 19 November 2006; doi:10.1038/ng1925
1Department of Bioengineering,2Department of Genome Sciences and3Department of Statistics, University of Washington, Seattle, Washington 98195, USA.
Correspondence should be addressed to T.B. (email@example.com) or D.A.N. (firstname.lastname@example.org).
NATURE GENETICS VOLUME 38 [ NUMBER 12 [ DECEMBER 2006 1457
© 2006 Nature Publishing Group http://www.nature.com/naturegenetics
16. Carlson, C.S. et al. Additional SNPs and linkage-disequilibrium analyses are
necessary for whole-genome association studies in humans. Nat. Genet. 33, 518–521
17. Livingston, R.J. et al. Pattern of sequence variation across 213 environmental
response genes. Genome Res. 14, 1821–1831 (2004).
18. International HapMap Consortium. A haplotype map of the human genome. Nature
437, 1299–1320 (2005).
19. Manaster, C. et al. InSNP: a tool for automated detection and visualization of SNPs and
InDels. Hum. Mutat. 26, 11–19 (2005).
20. Ewing, B., Hillier, L., Wendl, M.C. & Green, P. Base-calling of automated sequencer
traces using phred. I. Accuracy assessment. Genome Res. 8, 175–185 (1998).
21. Locke, D.P. et al. Linkage disequilibrium and heritability of copy-number polymorph-
isms within duplicated regions of the human genome. Am. J. Hum. Genet. 79,
22. Newman, T.L. et al. High-throughput genotyping of intermediate-size structural varia-
tion. Hum. Mol. Genet. 15, 1159–1167 (2006).
23. Klein, R.J. et al. Complement factor H polymorphism in age-related macular
degeneration. Science 308, 385–389 (2005).
24. Ahn, J. et al. Cloning of the putative tumour suppressor gene for hereditary multiple
exostoses (EXT1). Nat. Genet. 11, 137–143 (1995).
25. Rockman, M.V. et al. Positive selection on MMP3 regulation has shaped heart disease
risk. Curr. Biol. 14, 1531–1539 (2004).
26. Eichler, E.E. Widening the spectrum of human genetic variation. Nat. Genet. 38, 9–11
27. Weber, J.L. et al. Human diallelic insertion/deletion polymorphisms. Am. J. Hum.
Genet. 71, 854–862 (2002).
28. Mills, R.E. et al. An initial map of insertion and deletion (INDEL) variation in the
human genome. Genome Res. 16, 1182–1190 (2006).
29. Kruglyak, L. & Nickerson, D.A. Variation is the spice of life. Nat. Genet. 27, 234–236
30. Ewing, B. & Green, P. Base-calling of automated sequencer traces using phred. II. Error
probabilities. Genome Res. 8, 186–194 (1998).
31. Gordon, D., Abajian, C. & Green, P. Consed: a graphical tool for sequence finishing.
Genome Res. 8, 195–202 (1998).
32. Needleman, S.B. & Wunsch, C.D. A general method applicable to the search for
similarities in the amino acid sequence of two proteins. J. Mol. Biol.48, 443–453 (1970).
1462VOLUME 38 [ NUMBER 12 [ DECEMBER 2006 NATURE GENETICS
© 2006 Nature Publishing Group http://www.nature.com/naturegenetics