De novo assembly of highly diverse viral populations

The Broad Institute of MIT and Harvard, Cambridge, MA 02142 USA. .
BMC Genomics (Impact Factor: 4.04). 09/2012; 13(1):475. DOI: 10.1186/1471-2164-13-475
Source: PubMed

ABSTRACT Extensive genetic diversity in viral populations within infected hosts and the divergence of variants from existing reference genomes impede the analysis of deep viral sequencing data. A de novo population consensus assembly is valuable both as a single linear representation of the population and as a backbone on which intra-host variants can be accurately mapped. The availability of consensus assemblies and robustly mapped variants are crucial to the genetic study of viral disease progression, transmission dynamics, and viral evolution. Existing de novo assembly techniques fail to robustly assemble ultra-deep sequence data from genetically heterogeneous populations such as viruses into full-length genomes due to the presence of extensive genetic variability, contaminants, and variable sequence coverage.
We present VICUNA, a de novo assembly algorithm suitable for generating consensus assemblies from genetically heterogeneous populations. We demonstrate its effectiveness on Dengue, Human Immunodeficiency and West Nile viral populations, representing a range of intra-host diversity. Compared to state-of-the-art assemblers designed for haploid or diploid systems, VICUNA recovers full-length consensus and captures insertion/deletion polymorphisms in diverse samples. Final assemblies maintain a high base calling accuracy. VICUNA program is publicly available at: viral-genomics-analysis-software.
We developed VICUNA, a publicly available software tool, that enables consensus assembly of ultra-deep sequence derived from diverse viral populations. While VICUNA was developed for the analysis of viral populations, its application to other heterogeneous sequence data sets such as metagenomic or tumor cell population samples may prove beneficial in these fields of research.

Download full-text


Available from: Xiao Yang, Jul 13, 2015
1 Follower
  • Source
    • "These discordances were however enriched in highly divergent regions, as was shown from the analysis of the Shannon entropy, and the use of a sample-specific reference sequence for read mapping may thus be important for drug resistance studies. As dedicated software for read data pre-processing and analysis is becoming more widely available (Archer et al., 2012; Henn et al., 2012; Yang et al., 2012), it can be expected that the use of NGS platforms in a clinical setting will increase in the near future for HIV (e.g. Dudley et al., 2012; Bellecave et al., 2013), but for HCV this still needs further investigation (Akuta et al., 2013; Nasu et al., 2011). "
    [Show abstract] [Hide abstract]
    ABSTRACT: A near-full genome genotypic assay for HCV1b was developed, which may prove useful to investigate antiviral drug resistance, given new combination therapies for HCV1 infection. The assay consists of 3 partially overlapping PCRs followed by Sanger population or Illumina next-generation sequencing. Seventy-seven therapy-naïve samples, spanning the entire diversity range of currently known HCV1b, were used for optimization of PCRs, of which ten were sequenced using Sanger and of these ten, four using Illumina. The median detection limits for the 3 regions, 5'UTR-NS2, E2-NS5A and NS4B-NS5B, were 570, 5670 and 56670 IU/ml respectively. The number of Illumina reads mapped varied according to the software used, Segminator II being the best performing (81%). Consensus Illumina and Sanger sequencing results accord largely (0.013% major discordances). Differences were due almost exclusively to a larger number of ambiguities (presumably minority variants) scored by Illumina (1.50% minor discordances). The assay is easy to perform in an equipped laboratory, nevertheless it was difficult to reach high sensitivity and reproducibility, due to the high genetic viral variability. This assay proved to be suitable for detecting drug resistance mutations and can also be used for epidemiological research, even though only a limited set of samples was used for validation.
    Journal of Virological Methods 09/2014; 209. DOI:10.1016/j.jviromet.2014.09.009 · 1.88 Impact Factor
  • Source
    • "We build a consensus from paired-end reads using Vicuna (Yang et al., 2012). Our sequencing method should not contain any particularly low coverage region allowing reconstruction of population consensus for viral sample. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Motivation: Next-generation sequencing technologies sequence viruses with ultra-deep coverage, thus promising to revolutionize our understanding of the underlying diversity of viral populations. While the sequencing coverage is high enough that even rare viral variants are sequenced, the presence of sequencing errors makes it difficult to distinguish between rare variants and sequencing errors. Results: In this article, we present a method to overcome the limitations of sequencing technologies and assemble a diverse viral population that allows for the detection of previously undiscovered rare variants. The proposed method consists of a high-fidelity sequencing protocol and an accurate viral population assembly method, referred to as Viral Genome Assembler (VGA). The proposed protocol is able to eliminate sequencing errors by using individual barcodes attached to the sequencing fragments. Highly accurate data in combination with deep coverage allow VGA to assemble rare variants. VGA uses an expectation–maximization algorithm to estimate abundances of the assembled viral variants in the population. Results on both synthetic and real datasets show that our method is able to accurately assemble an HIV viral population and detect rare variants previously undetectable due to sequencing errors. VGA outperforms state-of-the-art methods for genome-wide viral assembly. Furthermore, our method is the first viral assembly method that scales to millions of sequencing reads. Availability: Our tool VGA is freely available at Contact:;
    Bioinformatics 06/2014; 30(12):i329-i337. DOI:10.1093/bioinformatics/btu295 · 4.62 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: To date, very large scale sequencing of many clinically important RNA viruses has been complicated by their high population molecular variation, which creates challenges for polymerase chain reaction and sequencing primer design. Many RNA viruses are also difficult or currently not possible to culture, severely limiting the amount and purity of available starting material. Here, we describe a simple, novel, high-throughput approach to Norovirus and Hepatitis C virus whole genome sequence determination based on RNA shotgun sequencing (also known as RNA-Seq). We demonstrate the effectiveness of this method by sequencing three Norovirus samples from faeces and two Hepatitis C virus samples from blood, on an Illumina MiSeq benchtop sequencer. More than 97% of reference genomes were recovered. Compared with Sanger sequencing, our method had no nucleotide differences in 14,019 nucleotides (nt) for Noroviruses (from a total of 2 Norovirus genomes obtained with Sanger sequencing), and 8 variants in 9,542 nt for Hepatitis C virus (1 variant per 1,193 nt). The three Norovirus samples had 2, 3, and 2 distinct positions called as heterozygous, while the two Hepatitis C virus samples had 117 and 131 positions called as heterozygous. To confirm that our sample and library preparation could be scaled to true high-throughput, we prepared and sequenced an additional 77 Norovirus samples in a single batch on an Illumina HiSeq 2000 sequencer, recovering >90% of the reference genome in all but one sample. No discrepancies were observed across 118,757 nt compared between Sanger and our custom RNA-Seq method in 16 samples. By generating viral genomic sequences that are not biased by primer-specific amplification or enrichment, this method offers the prospect of large-scale, affordable studies of RNA viruses which could be adapted to routine diagnostic laboratory workflows in the near future, with the potential to directly characterize within-host viral diversity.
    PLoS ONE 06/2013; 8(6):e66129. DOI:10.1371/journal.pone.0066129 · 3.53 Impact Factor
Show more