De novo assembly and genotyping of variants using colored De Bruijn graphs

Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, UK.
Nature Genetics (Impact Factor: 29.35). 02/2012; 44(2):226-32. DOI: 10.1038/ng.1028
Source: PubMed


Detecting genetic variants that are highly divergent from a reference sequence remains a major challenge in genome sequencing. We introduce de novo assembly algorithms using colored de Bruijn graphs for detecting and genotyping simple and complex genetic variants in an individual or population. We provide an efficient software implementation, Cortex, the first de novo assembler capable of assembling multiple eukaryotic genomes simultaneously. Four applications of Cortex are presented. First, we detect and validate both simple and complex structural variations in a high-coverage human genome. Second, we identify more than 3 Mb of sequence absent from the human reference genome, in pooled low-coverage population sequence data from the 1000 Genomes Project. Third, we show how population information from ten chimpanzees enables accurate variant calls without a reference sequence. Last, we estimate classical human leukocyte antigen (HLA) genotypes at HLA-B, the most variable gene in the human genome.

Full-text preview

Available from: PubMed Central
  • Source
    • "However, the reference often contains errors and gaps, creating a different set of challenges and problems. In fact, if the sample is highly divergent from the reference or if the reference is missing large regions, it may even be preferable to use de novo assembly [11]. While the development of methods for sequence assembly received significant attention, ultimate limits of their performance have been less explored. "
    [Show abstract] [Hide abstract]
    ABSTRACT: At the core of high throughput DNA sequencing platforms lies a bio-physical surface process that results in a random geometry of clusters of homogenous short DNA fragments typically hundreds of base pairs long - bridge amplification. The statistical properties of this random process and length of the fragments are critical as they affect the information that can be subsequently extracted, i.e., density of successfully inferred DNA fragment reads. The ensemble of overlapping DNA fragment reads are then used to computationally reconstruct the much longer target genome sequence, e.g, ranging from hundreds of thousands to billions of base pairs. The success of the reconstruction in turn depends on having a sufficiently large ensemble of DNA fragments that are sufficiently long. In this paper using stochastic geometry we model and optimize the end-to-end process linking and partially controlling the statistics of the physical processes to the success of the computational step. This provides, for the first time, a framework capturing salient features of such sequencing platforms that can be used to study cost, performance or sensitivity of the sequencing process.
  • Source
    • "Using whole-genome deep-sequencing, we explored whole bacterial genomes at high resolution, allowing more detailed analyses than pathotype or MLST schemes that study only small regions of the genome. We employed a multi-sample de novo assembly algorithm Cortex that simutaneously assembles genomes and calls variants (Iqbal et al., 2012). This method calls variants independently of a reference genome. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Uropathogenic Escherichia coli (UPEC) are phenotypically and genotypically very diverse. This diversity makes it challenging to understand the evolution of UPEC adaptations responsible for causing urinary tract infections (UTI). To gain insight into the relationship between evolutionary divergence and adaptive paths to uropathogenicity, we sequenced at deep coverage (190x) the genomes of 19 E. coli strains from urinary tract infection patients from the same geographic area. Our sample consisted of 14 UPEC isolates and 5 non-UTI-causing (commensal) rectal E. coli isolates. After identifying strain variants using de novo assembly-based methods, we clustered the strains based on pairwise sequence differences using a neighbor-joining algorithm. We examined evolutionary signals on the whole-genome phylogeny and contrasted these signals with those found on gene trees constructed based on specific uropathogenic virulence factors. The whole-genome phylogeny showed that the divergence between UPEC and commensal E. coli strains without known UPEC virulence factors happened over 32 million generations ago. Pairwise diversity between any two strains was also high, suggesting multiple genetic origins of uropathogenic strains in a small geographic region. Constrasting the whole-genome phylogeny with three gene trees constructed from common uropathogenic virulence factors, we detected no selective advantage of these virulence genes over other genomic regions. These results suggest that UPEC acquired uropathogenicity long time ago and used it opportunistically to cause extraintestinal infections. Copyright © 2015. Published by Elsevier B.V.
    Infection, genetics and evolution: journal of molecular epidemiology and evolutionary genetics in infectious diseases 06/2015; 34. DOI:10.1016/j.meegid.2015.06.023 · 3.02 Impact Factor
  • Source
    • "Parrish et al. (Parrish et al., 2013) suggested a reference-guided whole-genome assembly approach that makes the results of several samples more compatible. Iqbal et al. (Iqbal et al., 2012) developed Cortex, a program that rigorously assembles the whole genomes of several individuals at the same time based on colored de-Bruijn graphs. However, the tests in the Cortex paper were limited to relatively few individuals or to pooled data. "
    [Show abstract] [Hide abstract]
    ABSTRACT: The detection of genomic structural variation (SV) has advanced tremendously in recent years due to progress in high-throughput sequencing technologies. Novel sequence insertions, insertions without similarity to a human reference genome, have received less attention than other types of SVs due to the computational challenges in their detection from short read sequencing data, which inherently involves de novo assembly. De novo assembly is not only computationally challenging, but also requires high-quality data. While the reads from a single individual may not always meet this requirement, using reads from multiple individuals can increase power to detect novel insertions. We have developed the program PopIns, which can discover and characterize non-reference insertions of 100 bp or longer on a population scale. In this paper, we describe the approach we implemented in PopIns. It takes as input a reads-to-reference alignment, assembles unaligned reads using a standard assembly tool, merges the contigs of different individuals into high-confidence sequences, anchors the merged sequences into the reference genome, and finally genotypes all individuals for the discovered insertions. Our tests on simulated data indicate that the merging step greatly improves the quality and reliability of predicted insertions and that PopIns shows significantly better recall and precision than the recent tool MindTheGap. Preliminary results on a data set of 305 Icelanders demonstrate the practicality of the new approach. The source code of PopIns is available from © The Author (2015). Published by Oxford University Press. All rights reserved. For Permissions, please email:
    Bioinformatics 04/2015; DOI:10.1093/bioinformatics/btv273 · 4.98 Impact Factor
Show more