Integrated annotation and analysis of genetic variants from next-generation sequencing studies with variant tools

Department of Epidemiology, University of Texas, MD Anderson Cancer Center, Houston, TX, USA.
Bioinformatics (Impact Factor: 4.98). 12/2011; 28(3):421-2. DOI: 10.1093/bioinformatics/btr667
Source: PubMed


Storing, annotating and analyzing variants from next-generation sequencing projects can be difficult due to the availability of a wide array of data formats, tools and annotation sources, as well as the sheer size of the data files. Useful tools, including the GATK, ANNOVAR and BEDTools can be integrated into custom pipelines for annotating and analyzing sequence variants. However, building flexible pipelines that support the tracking of variants alongside their samples, while enabling updated annotation and reanalyses, is not a simple task.
We have developed variant tools, a flexible annotation and analysis toolset that greatly simplifies the storage, annotation and filtering of variants and the analysis of the underlying samples. variant tools can be used to manage and analyze genetic variants obtained from sequence alignments, and the command-line driven toolset could be used as a foundation for building more sophisticated analytical methods.
variant tools consists of two command-line driven programs vtools and vtools_report. It is freely available at, distributed under a GPL license.

Full-text preview

Available from:
  • Source
    • "The most frequent methods used to annotate variants reported were Annovar [44] (52%), in-house developed software (17%), and Ingenuity (Redwood City, CA, USA) (12%). Other tools reported were Variant Tools [45], KggSeq [46], SG-ADVISER (Scripps Genome Annotation and Distributed Variant Interpretation Server, La Jolla, CA, USA), Genome Trax (Wolfenbüttel, Germany), VAAST (Variant Annotation and Search Tool) [47], Omicia Opal [48], MapSNPs [49], in-house pipelines, and combinations thereof. There were a large variety of annotation sources (see Table 4), including but not limited to: OMIM [50], Uniprot [51], SeattleSeq [52], SNPedia [53], NCBI ClinVar, PharmGKB [54], Human Gene Mutation Database [55], dbNSFP [56], and in-house annotations. "
    [Show abstract] [Hide abstract]
    ABSTRACT: There is tremendous potential for genome sequencing to improve clinical diagnosis and care once it becomes routinely accessible, but this will require formalizing research methods into clinical best practices in the areas of sequence data generation, analysis, interpretation and reporting. The CLARITY Challenge was designed to spur convergence in methods for diagnosing genetic disease starting from clinical case history and genome sequencing data. DNA samples were obtained from three families with heritable genetic disorders and genomic sequence data was donated by sequencing platform vendors. The challenge was to analyze and interpret these data with the goals of identifying disease causing variants and reporting the findings in a clinically useful format. Participating contestant groups were solicited broadly, and an independent panel of judges evaluated their performance. A total of 30 international groups were engaged. The entries reveal a general convergence of practices on most elements of the analysis and interpretation process. However, even given this commonality of approach, only two groups identified the consensus candidate variants in all disease cases, demonstrating a need for consistent fine-tuning of the generally accepted methods. There was greater diversity of the final clinical report content and in the patient consenting process, demonstrating that these areas require additional exploration and standardization. The CLARITY Challenge provides a comprehensive assessment of current practices for using genome sequencing to diagnose and report genetic diseases. There is remarkable convergence in bioinformatic techniques, but medical interpretation and reporting are areas that require further development by many groups.
    Full-text · Article · Mar 2014 · Genome biology
    • "Although most commercial tests have analysis packages alongside the sequencer, information about variants is limited to identifiers in dbSNP [Sherry et al., 2001] and COSMIC [Shepherd et al., 2011]; thus, sufficient information for functional interpretation of each identified variant is lacking. Several software packages were developed to tackle this issue [Wang et al., 2010; Asmann et al., 2012; San Lucas et al., 2012], but informatics expertise is required to operate command-line-driven software to obtain the highest yield of information. Furthermore, most existing packages are designed to interpret variants at the level of individual samples [Wang et al., 2010; Asmann et al., 2012; Douville et al., 2013] and leave cross-sample analysis to the end user, thus making it difficult to assess variants in a disease cohort study. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Targeted sequencing using next-generation sequencing technologies is currently being rapidly adopted for clinical sequencing and cancer marker tests. However, no existing bioinformatics tool is available for the analysis and visualization of multiple targeted sequencing datasets. In the present study, we use cancer panel targeted sequencing datasets generated by the Life Technologies Ion Personal Genome Machine (PGM) Sequencer as an example to illustrate how to develop an automated pipeline for the comparative analyses of multiple datasets. Cancer Panel Analysis Pipeline (CPAP) uses standard output files from variant calling software to generate a distribution map of SNPs among all of the samples in a circular diagram generated by Circos. The diagram is hyper-linked to a dynamic HTML table that allows the users to identify target SNPs by using different filters. CPAP also integrates additional information about the identified SNPs by linking to an integrated SQL database compiled from SNP-related databases, including dbSNP, 1000 Genomes Project, COSMIC and dbNSFP. CPAP only takes 17 minutes to complete a comparative analysis of 500 datasets. CPAP not only provides an automated platform for the analysis of multiple cancer panel datasets but can also serve as a model for any customized targeted sequencing project. This article is protected by copyright. All rights reserved.
    No preview · Article · Oct 2013 · Human Mutation
  • Source
    • "It has been recommended by the Faculty of 1000 and adopted by various software (e.g. Lindenbaum et al. 2011; Li et al. 2012; San Lucas et al. 2012; Chang and Wang 2012; Sifrim et al. 2012; Zhang et al. 2013) and databases (e.g. Li et al. 2011). "
    [Show abstract] [Hide abstract]
    ABSTRACT: dbNSFP is a database developed for functional prediction and annotation of all potential non-synonymous single-nucleotide variants (nsSNVs) in the human genome. This database significantly facilitates the process of querying predictions and annotations from different databases/web-servers for large amounts of nsSNVs discovered in exome-sequencing studies. Here we report a recent major update of the database to version 2.0. We have rebuilt the SNV collection based on GENCODE 9 and currently the database includes 87,347,043 nsSNVs and 2,270,742 essential splice site SNVs (an 18% increase compared to dbNSFP v1.0). For each nsSNV dbNSFP v2.0 has added two prediction scores (MutationAssessor and FATHMM) and two conservation scores (GERP++ and SiPhy). The original five prediction and conservation scores in v1.0 (SIFT, Polyphen2, LRT, MutationTaster and PhyloP) have been updated. Rich functional annotations for SNVs and genes have also been added into the new version, including allele frequencies observed in the 1000 Genomes Project phase 1 data and the NHLBI Exome Sequencing Project, various gene IDs from different databases, functional descriptions of genes, gene expression and gene interaction information, among others. dbNSFP v2.0 is freely available for download at ©2013 Wiley-Liss, Inc.
    Full-text · Article · Sep 2013 · Human Mutation
Show more