De novo assembly and genotyping of variants using colored de Bruijn graphs

Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, UK.
Nature Genetics (Impact Factor: 29.65). 02/2012; 44(2):226-32. DOI: 10.1038/ng.1028
Source: PubMed

ABSTRACT Detecting genetic variants that are highly divergent from a reference sequence remains a major challenge in genome sequencing. We introduce de novo assembly algorithms using colored de Bruijn graphs for detecting and genotyping simple and complex genetic variants in an individual or population. We provide an efficient software implementation, Cortex, the first de novo assembler capable of assembling multiple eukaryotic genomes simultaneously. Four applications of Cortex are presented. First, we detect and validate both simple and complex structural variations in a high-coverage human genome. Second, we identify more than 3 Mb of sequence absent from the human reference genome, in pooled low-coverage population sequence data from the 1000 Genomes Project. Third, we show how population information from ten chimpanzees enables accurate variant calls without a reference sequence. Last, we estimate classical human leukocyte antigen (HLA) genotypes at HLA-B, the most variable gene in the human genome.

  • Source
    • "Using whole-genome deep-sequencing, we explored whole bacterial genomes at high resolution, allowing more detailed analyses than pathotype or MLST schemes that study only small regions of the genome. We employed a multi-sample de novo assembly algorithm Cortex that simutaneously assembles genomes and calls variants (Iqbal et al., 2012). This method calls variants independently of a reference genome. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Uropathogenic Escherichia coli (UPEC) are phenotypically and genotypically very diverse. This diversity makes it challenging to understand the evolution of UPEC adaptations responsible for causing urinary tract infections (UTI). To gain insight into the relationship between evolutionary divergence and adaptive paths to uropathogenicity, we sequenced at deep coverage (190x) the genomes of 19 E. coli strains from urinary tract infection patients from the same geographic area. Our sample consisted of 14 UPEC isolates and 5 non-UTI-causing (commensal) rectal E. coli isolates. After identifying strain variants using de novo assembly-based methods, we clustered the strains based on pairwise sequence differences using a neighbor-joining algorithm. We examined evolutionary signals on the whole-genome phylogeny and contrasted these signals with those found on gene trees constructed based on specific uropathogenic virulence factors. The whole-genome phylogeny showed that the divergence between UPEC and commensal E. coli strains without known UPEC virulence factors happened over 32 million generations ago. Pairwise diversity between any two strains was also high, suggesting multiple genetic origins of uropathogenic strains in a small geographic region. Constrasting the whole-genome phylogeny with three gene trees constructed from common uropathogenic virulence factors, we detected no selective advantage of these virulence genes over other genomic regions. These results suggest that UPEC acquired uropathogenicity long time ago and used it opportunistically to cause extraintestinal infections. Copyright © 2015. Published by Elsevier B.V.
    Infection, genetics and evolution: journal of molecular epidemiology and evolutionary genetics in infectious diseases 06/2015; DOI:10.1016/j.meegid.2015.06.023 · 3.26 Impact Factor
  • Source
    • "Parrish et al. (Parrish et al., 2013) suggested a reference-guided whole-genome assembly approach that makes the results of several samples more compatible. Iqbal et al. (Iqbal et al., 2012) developed Cortex, a program that rigorously assembles the whole genomes of several individuals at the same time based on colored de-Bruijn graphs. However, the tests in the Cortex paper were limited to relatively few individuals or to pooled data. "
    [Show abstract] [Hide abstract]
    ABSTRACT: The detection of genomic structural variation (SV) has advanced tremendously in recent years due to progress in high-throughput sequencing technologies. Novel sequence insertions, insertions without similarity to a human reference genome, have received less attention than other types of SVs due to the computational challenges in their detection from short read sequencing data, which inherently involves de novo assembly. De novo assembly is not only computationally challenging, but also requires high-quality data. While the reads from a single individual may not always meet this requirement, using reads from multiple individuals can increase power to detect novel insertions. We have developed the program PopIns, which can discover and characterize non-reference insertions of 100 bp or longer on a population scale. In this paper, we describe the approach we implemented in PopIns. It takes as input a reads-to-reference alignment, assembles unaligned reads using a standard assembly tool, merges the contigs of different individuals into high-confidence sequences, anchors the merged sequences into the reference genome, and finally genotypes all individuals for the discovered insertions. Our tests on simulated data indicate that the merging step greatly improves the quality and reliability of predicted insertions and that PopIns shows significantly better recall and precision than the recent tool MindTheGap. Preliminary results on a data set of 305 Icelanders demonstrate the practicality of the new approach. The source code of PopIns is available from © The Author (2015). Published by Oxford University Press. All rights reserved. For Permissions, please email:
    Bioinformatics 04/2015; DOI:10.1093/bioinformatics/btv273 · 4.62 Impact Factor
  • Source
    • "We followed the workflow described in the Cortex paper (Iqbal et al. 2012) to find novel allelic genes in the KHV trio. "
    [Show abstract] [Hide abstract]
    ABSTRACT: We here present the first whole genome analysis of an anonymous Kinh Vietnamese (KHV) trio whose genomes were deeply sequenced to 30-fold average coverage. The resulting short reads covered 99.91 percent of the human reference genome (GRCh37d5). We identified 4,719,412 SNPs and 827,385 short indels that satisfied the Mendelian inheritance law. Among them, 109,914 (2.3 percent) SNPs and 59,119 (7.1 percent) short indels were novel. We also detected 30,171 structural variants of which 27,604 (91.5 percent) were large indels. There were 6,681 large indels in the range 0.1-100 kbp occurring in the child genome that were also confirmed in either the father or mother genome. We compared these large indels against the DGV database and found that 1,499 (22.44 percent) were KHV specific. De novo assembly of high-quality unmapped reads yielded 789 contigs with the length greater than or equal to 300 bp. There were 235 contigs from the child genome of which 199 (84.7 percent) were significantly matched with at least one contig from the father or mother genome. Blasting these 199 contigs against other alternative human genomes revealed 4 novel contigs. The novel variants identified from our study demonstrated the necessity of conducting more genome-wide studies not only for Kinh but also for other ethnic groups in Vietnam.
    Journal of Biosciences 03/2015; 40(1):113-124. DOI:10.1007/s12038-015-9501-0 · 1.94 Impact Factor
Show more


Available from