De novo assembly and genotyping of variants using colored De Bruijn graphs

Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, UK.
Nature Genetics (Impact Factor: 29.35). 02/2012; 44(2):226-32. DOI: 10.1038/ng.1028
Source: PubMed


Detecting genetic variants that are highly divergent from a reference sequence remains a major challenge in genome sequencing. We introduce de novo assembly algorithms using colored de Bruijn graphs for detecting and genotyping simple and complex genetic variants in an individual or population. We provide an efficient software implementation, Cortex, the first de novo assembler capable of assembling multiple eukaryotic genomes simultaneously. Four applications of Cortex are presented. First, we detect and validate both simple and complex structural variations in a high-coverage human genome. Second, we identify more than 3 Mb of sequence absent from the human reference genome, in pooled low-coverage population sequence data from the 1000 Genomes Project. Third, we show how population information from ten chimpanzees enables accurate variant calls without a reference sequence. Last, we estimate classical human leukocyte antigen (HLA) genotypes at HLA-B, the most variable gene in the human genome.

  • Source
    • "Nonetheless, the mapping-based approach does not consider the correlation between sequence reads when performing alignment and typically focuses only on a single variant type, which may result in inconsistent variant calling when different types of variants cluster. Several of these fl aws can potentially be avoided through de novo assembly since it is agnostic with regard to variant type and divergence of sample sequence from the reference genome (Carnevali et al. 2012 ;Iqbal et al. 2012 ;Li 2012). Thus, assembly based variant calling is often treated as a complement to mapping based calling. "
    Dataset: chapter4

    Full-text · Dataset · Jan 2016
  • Source
    • "However, the reference often contains errors and gaps, creating a different set of challenges and problems. In fact, if the sample is highly divergent from the reference or if the reference is missing large regions, it may even be preferable to use de novo assembly [11]. While the development of methods for sequence assembly received significant attention, ultimate limits of their performance have been less explored. "
    [Show abstract] [Hide abstract]
    ABSTRACT: At the core of high throughput DNA sequencing platforms lies a bio-physical surface process that results in a random geometry of clusters of homogenous short DNA fragments typically hundreds of base pairs long - bridge amplification. The statistical properties of this random process and length of the fragments are critical as they affect the information that can be subsequently extracted, i.e., density of successfully inferred DNA fragment reads. The ensemble of overlapping DNA fragment reads are then used to computationally reconstruct the much longer target genome sequence, e.g, ranging from hundreds of thousands to billions of base pairs. The success of the reconstruction in turn depends on having a sufficiently large ensemble of DNA fragments that are sufficiently long. In this paper using stochastic geometry we model and optimize the end-to-end process linking and partially controlling the statistics of the physical processes to the success of the computational step. This provides, for the first time, a framework capturing salient features of such sequencing platforms that can be used to study cost, performance or sensitivity of the sequencing process.
    Preview · Article · Aug 2015
  • Source
    • "Using whole-genome deep-sequencing, we explored whole bacterial genomes at high resolution, allowing more detailed analyses than pathotype or MLST schemes that study only small regions of the genome. We employed a multi-sample de novo assembly algorithm Cortex that simutaneously assembles genomes and calls variants (Iqbal et al., 2012). This method calls variants independently of a reference genome. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Uropathogenic Escherichia coli (UPEC) are phenotypically and genotypically very diverse. This diversity makes it challenging to understand the evolution of UPEC adaptations responsible for causing urinary tract infections (UTI). To gain insight into the relationship between evolutionary divergence and adaptive paths to uropathogenicity, we sequenced at deep coverage (190x) the genomes of 19 E. coli strains from urinary tract infection patients from the same geographic area. Our sample consisted of 14 UPEC isolates and 5 non-UTI-causing (commensal) rectal E. coli isolates. After identifying strain variants using de novo assembly-based methods, we clustered the strains based on pairwise sequence differences using a neighbor-joining algorithm. We examined evolutionary signals on the whole-genome phylogeny and contrasted these signals with those found on gene trees constructed based on specific uropathogenic virulence factors. The whole-genome phylogeny showed that the divergence between UPEC and commensal E. coli strains without known UPEC virulence factors happened over 32 million generations ago. Pairwise diversity between any two strains was also high, suggesting multiple genetic origins of uropathogenic strains in a small geographic region. Constrasting the whole-genome phylogeny with three gene trees constructed from common uropathogenic virulence factors, we detected no selective advantage of these virulence genes over other genomic regions. These results suggest that UPEC acquired uropathogenicity long time ago and used it opportunistically to cause extraintestinal infections. Copyright © 2015. Published by Elsevier B.V.
    Full-text · Article · Jun 2015 · Infection, genetics and evolution: journal of molecular epidemiology and evolutionary genetics in infectious diseases
Show more