De novo assembly of highly diverse viral populations

The Broad Institute of MIT and Harvard, Cambridge, MA 02142 USA. .
BMC Genomics (Impact Factor: 3.99). 09/2012; 13(1):475. DOI: 10.1186/1471-2164-13-475
Source: PubMed


Extensive genetic diversity in viral populations within infected hosts and the divergence of variants from existing reference genomes impede the analysis of deep viral sequencing data. A de novo population consensus assembly is valuable both as a single linear representation of the population and as a backbone on which intra-host variants can be accurately mapped. The availability of consensus assemblies and robustly mapped variants are crucial to the genetic study of viral disease progression, transmission dynamics, and viral evolution. Existing de novo assembly techniques fail to robustly assemble ultra-deep sequence data from genetically heterogeneous populations such as viruses into full-length genomes due to the presence of extensive genetic variability, contaminants, and variable sequence coverage.
We present VICUNA, a de novo assembly algorithm suitable for generating consensus assemblies from genetically heterogeneous populations. We demonstrate its effectiveness on Dengue, Human Immunodeficiency and West Nile viral populations, representing a range of intra-host diversity. Compared to state-of-the-art assemblers designed for haploid or diploid systems, VICUNA recovers full-length consensus and captures insertion/deletion polymorphisms in diverse samples. Final assemblies maintain a high base calling accuracy. VICUNA program is publicly available at: viral-genomics-analysis-software.
We developed VICUNA, a publicly available software tool, that enables consensus assembly of ultra-deep sequence derived from diverse viral populations. While VICUNA was developed for the analysis of viral populations, its application to other heterogeneous sequence data sets such as metagenomic or tumor cell population samples may prove beneficial in these fields of research.

Download full-text


Available from: Xiao Yang,
  • Source
    • "Reads of low quality (quality score < 30) or missing Illumina barcodes were removed. De novo assembly was performed with VICUNA, which was designed to assemble RNA viruses (Yang et al., 2012; Langmead and Salzberg, 2012). The de novo PRRSV sequence was remapped using Bowtie 2 and compared to the output of a de novo assembly to confirm sequence authenticity (Langmead and Salzberg, 2012). "
    [Show abstract] [Hide abstract]
    ABSTRACT: In early 2014, a Minnesota sow farm with a solid vaccination history suffered a severe porcine reproductive and respiratory syndrome (PRRS) outbreak with unusually high morbidity and mortality in piglets and sows, as well as anorexia and secondary bacteria infections in nursery pigs. Due to the unusual clinical severity in a PRRS-immune herd, genetic characteristics of the virus were examined to determine if a new PRRSV genotype had emerged. Phylogenetic analysis indicated that the virulent strain (PRRSV2/USA/Minnesota414/2014) was related to virulent strains circulating in the mid-western United States in recent years, and that the nonstructural protein 2 (nsp2) gene of MN414 contained an insertion-deletion pattern typical of existing type 2 virulent strains. We conclude that the MN414 isolate is a recently evolved member of the virulent lineage 1 family of type 2 PRRSV. Copyright © 2015. Published by Elsevier B.V.
    Virus Research 07/2015; 210. DOI:10.1016/j.virusres.2015.07.004 · 2.32 Impact Factor
  • Source
    • "These discordances were however enriched in highly divergent regions, as was shown from the analysis of the Shannon entropy, and the use of a sample-specific reference sequence for read mapping may thus be important for drug resistance studies. As dedicated software for read data pre-processing and analysis is becoming more widely available (Archer et al., 2012; Henn et al., 2012; Yang et al., 2012), it can be expected that the use of NGS platforms in a clinical setting will increase in the near future for HIV (e.g. Dudley et al., 2012; Bellecave et al., 2013), but for HCV this still needs further investigation (Akuta et al., 2013; Nasu et al., 2011). "
    [Show abstract] [Hide abstract]
    ABSTRACT: A near-full genome genotypic assay for HCV1b was developed, which may prove useful to investigate antiviral drug resistance, given new combination therapies for HCV1 infection. The assay consists of 3 partially overlapping PCRs followed by Sanger population or Illumina next-generation sequencing. Seventy-seven therapy-naïve samples, spanning the entire diversity range of currently known HCV1b, were used for optimization of PCRs, of which ten were sequenced using Sanger and of these ten, four using Illumina. The median detection limits for the 3 regions, 5'UTR-NS2, E2-NS5A and NS4B-NS5B, were 570, 5670 and 56670 IU/ml respectively. The number of Illumina reads mapped varied according to the software used, Segminator II being the best performing (81%). Consensus Illumina and Sanger sequencing results accord largely (0.013% major discordances). Differences were due almost exclusively to a larger number of ambiguities (presumably minority variants) scored by Illumina (1.50% minor discordances). The assay is easy to perform in an equipped laboratory, nevertheless it was difficult to reach high sensitivity and reproducibility, due to the high genetic viral variability. This assay proved to be suitable for detecting drug resistance mutations and can also be used for epidemiological research, even though only a limited set of samples was used for validation.
    Journal of Virological Methods 09/2014; 209. DOI:10.1016/j.jviromet.2014.09.009 · 1.78 Impact Factor
  • Source
    • "We build a consensus from paired-end reads using Vicuna (Yang et al., 2012). Our sequencing method should not contain any particularly low coverage region allowing reconstruction of population consensus for viral sample. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Motivation: Next-generation sequencing technologies sequence viruses with ultra-deep coverage, thus promising to revolutionize our understanding of the underlying diversity of viral populations. While the sequencing coverage is high enough that even rare viral variants are sequenced, the presence of sequencing errors makes it difficult to distinguish between rare variants and sequencing errors. Results: In this article, we present a method to overcome the limitations of sequencing technologies and assemble a diverse viral population that allows for the detection of previously undiscovered rare variants. The proposed method consists of a high-fidelity sequencing protocol and an accurate viral population assembly method, referred to as Viral Genome Assembler (VGA). The proposed protocol is able to eliminate sequencing errors by using individual barcodes attached to the sequencing fragments. Highly accurate data in combination with deep coverage allow VGA to assemble rare variants. VGA uses an expectation–maximization algorithm to estimate abundances of the assembled viral variants in the population. Results on both synthetic and real datasets show that our method is able to accurately assemble an HIV viral population and detect rare variants previously undetectable due to sequencing errors. VGA outperforms state-of-the-art methods for genome-wide viral assembly. Furthermore, our method is the first viral assembly method that scales to millions of sequencing reads. Availability: Our tool VGA is freely available at Contact:;
    Bioinformatics 06/2014; 30(12):i329-i337. DOI:10.1093/bioinformatics/btu295 · 4.62 Impact Factor
Show more