De novo assembly of highly diverse viral populations

The Broad Institute of MIT and Harvard, Cambridge, MA 02142 USA. .
BMC Genomics (Impact Factor: 4.04). 09/2012; 13:475. DOI: 10.1186/1471-2164-13-475
Source: PubMed

ABSTRACT Extensive genetic diversity in viral populations within infected hosts and the divergence of variants from existing reference genomes impede the analysis of deep viral sequencing data. A de novo population consensus assembly is valuable both as a single linear representation of the population and as a backbone on which intra-host variants can be accurately mapped. The availability of consensus assemblies and robustly mapped variants are crucial to the genetic study of viral disease progression, transmission dynamics, and viral evolution. Existing de novo assembly techniques fail to robustly assemble ultra-deep sequence data from genetically heterogeneous populations such as viruses into full-length genomes due to the presence of extensive genetic variability, contaminants, and variable sequence coverage.
We present VICUNA, a de novo assembly algorithm suitable for generating consensus assemblies from genetically heterogeneous populations. We demonstrate its effectiveness on Dengue, Human Immunodeficiency and West Nile viral populations, representing a range of intra-host diversity. Compared to state-of-the-art assemblers designed for haploid or diploid systems, VICUNA recovers full-length consensus and captures insertion/deletion polymorphisms in diverse samples. Final assemblies maintain a high base calling accuracy. VICUNA program is publicly available at: viral-genomics-analysis-software.
We developed VICUNA, a publicly available software tool, that enables consensus assembly of ultra-deep sequence derived from diverse viral populations. While VICUNA was developed for the analysis of viral populations, its application to other heterogeneous sequence data sets such as metagenomic or tumor cell population samples may prove beneficial in these fields of research.


Available from: Xiao Yang, Jun 15, 2015
1 Follower
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Advances in next generation sequencing make it possible to obtain high-coverage sequence data for large numbers of viral strains in a short time. However, since most bioinformatics tools are developed for command line use, the selection and accessibility of computational tools for genome assembly and variation analysis limits the ability of individual labs to perform further bioinformatics analysis. We have developed a multi-step viral genome assembly pipeline named VirAmp, which combines existing tools and techniques and presents them to end users via a web-enabled Galaxy interface. Our pipeline allows users to assemble, analyze, and interpret high coverage viral sequencing data with an ease and efficiency that was not possible previously. Our software makes a large number of genome assembly and related tools available to life scientists and automates the currently recommended best practices into a single, easy to use interface. We tested our pipeline with three different datasets from human herpes simplex virus (HSV). VirAmp provides a user-friendly interface and a complete pipeline for viral genome analysis. We make our software available via an Amazon Elastic Cloud disk image that can be easily launched by anyone with an Amazon web service account. A fully functional demonstration instance of our system can be found at We also maintain detailed documentation on each tool and methodology at
    04/2015; 4(1). DOI:10.1186/s13742-015-0060-y
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: To date, inter-genotypic recombinant hepatitis C viruses (HCV) and their treatment outcomes have not been well characterized. This study characterized 12 novel HCV recombinant strains and their response to sofosbuvir in combination with ribavirin (SOF/RBV) treatment. Across the phase II/III studies of sofosbuvir, HCV samples were genotyped using both the Siemens VERSANT® HCV Genotype INNO-LiPA 2.0 assay and NS5B sequencing. Among these patient samples, genotype assignment discordance between the two methods was found in 0.5% of all cases (12/2363) of which all were identified as genotype 2 by INNO-LiPA (12/487=2.5%). HCV full genome sequences were obtained for these 12 samples by a sequence-independent amplification method coupled with next-generation sequencing. HCV full genome sequencing revealed that these viruses were recombinant HCV strains with the 5’ part corresponding to genotype 2 and the 3’ part corresponding to genotype 1. The recombination breakpoint between genotype 2 and genotype 1 was consistently located within 80 amino acids of the NS2/NS3 junction. Interestingly, one of the recombinant viruses had a 34 amino acid duplication at the location of the recombination breakpoint. Eleven of these 12 patients were treated with a regimen for genotype 2 HCV infection, but responded like they had genotype 1 infection; one patient had received placebo. Conclusion: Twelve new HCV inter-genotypic recombinant genotype 2/1 viruses have been characterized. The antiviral response to a 12-16 week course of SOF/RBV treatment in these patients was more similar to responses among genotype 1 patients than genotype 2 patients, consistent with their genotype 1 NS5B gene. (Hepatology 2014;)
    Hepatology 08/2014; 61(2). DOI:10.1002/hep.27361 · 11.19 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: A near-full genome genotypic assay for HCV1b was developed, which may prove useful to investigate antiviral drug resistance, given new combination therapies for HCV1 infection. The assay consists of 3 partially overlapping PCRs followed by Sanger population or Illumina next-generation sequencing. Seventy-seven therapy-naïve samples, spanning the entire diversity range of currently known HCV1b, were used for optimization of PCRs, of which ten were sequenced using Sanger and of these ten, four using Illumina. The median detection limits for the 3 regions, 5'UTR-NS2, E2-NS5A and NS4B-NS5B, were 570, 5670 and 56670 IU/ml respectively. The number of Illumina reads mapped varied according to the software used, Segminator II being the best performing (81%). Consensus Illumina and Sanger sequencing results accord largely (0.013% major discordances). Differences were due almost exclusively to a larger number of ambiguities (presumably minority variants) scored by Illumina (1.50% minor discordances). The assay is easy to perform in an equipped laboratory, nevertheless it was difficult to reach high sensitivity and reproducibility, due to the high genetic viral variability. This assay proved to be suitable for detecting drug resistance mutations and can also be used for epidemiological research, even though only a limited set of samples was used for validation.
    Journal of Virological Methods 09/2014; 209. DOI:10.1016/j.jviromet.2014.09.009 · 1.88 Impact Factor