Vol. 28 no. 3 2012, pages 421–422
Integrated annotation and analysis of genetic variants from
next-generation sequencing studies with variant tools
F. Anthony San Lucas1,2, Gao Wang3, Paul Scheet1,2and Bo Peng4,∗
1Department of Epidemiology, University of Texas, MD Anderson Cancer Center,2Program in Biomathematics and
Biostatistics, University of Texas Graduate School of Biomedical Sciences and3Department of Molecular and Human
Genetics, Baylor College of Medicine, Houston, TX, USA,4Department of Genetics, University of Texas,
MD Anderson Cancer Center
Associate Editor: Alex Bateman
Advance Access publication December 2, 2011
next-generation sequencing projects can be difficult due to the
availability of a wide array of data formats, tools and annotation
sources, as well as the sheer size of the data files. Useful tools,
including the GATK, ANNOVAR and BEDTools can be integrated into
custom pipelines for annotating and analyzing sequence variants.
However, building flexible pipelines that support the tracking of
variants alongside their samples, while enabling updated annotation
and reanalyses, is not a simple task.
Results: We have developed variant tools, a flexible annotation
and analysis toolset that greatly simplifies the storage, annotation
and filtering of variants and the analysis of the underlying samples.
variant tools can be used to manage and analyze genetic variants
obtained from sequence alignments, and the command-line driven
toolset could be used as a foundation for building more sophisticated
Availability and implementation: variant tools consists of two
command-line driven programs vtools and vtools_report. It
is freely available at http://varianttools.sourceforge.net, distributed
under a GPL license.
Storing, annotating and analyzing variants from
Received on August 18, 2011; revised on November 23, 2011;
accepted on November 29, 2011
Tracking samples and predicted variants from next-generation
sequencing projects often requires building custom analysis
pipelines. Data standards such as the Browser Extensible Data
(BED) (Hinrichs et al., 2006), General Feature Format and Variant
Call Format (VCF) (Danecek et al., 2011) file specifications can be
used to represent these variants in a common format, simplifying
integration of tools and the construction of these analysis pipelines.
Difficulties include the integration of diverse annotation sources and
of predicted variants and millions more associated annotations
for each sample. These annotation sources and intermediate files
often have fundamental inconsistencies using either 0- or 1-based
coordinates and potentially different genomic builds, which can
complicate their management and integration.
∗To whom correspondence should be addressed.
For biologists or analysts who have familiarity with programming
and running tools from the command line, there are many
useful tools that can be integrated into custom pipelines to
annotate and filter variants. These tools include ANNOVAR
(Wang et al., 2010) and BEDTools (Quinlan and Hall, 2010).
However, building effective pipelines that relate variants to their
samples and sample attributes (such as cases and controls), while
applying multiple annotation sources require a large customization
effort. A framework for building pipelines that facilitate simple,
reproducible and recurrent analyses is currently lacking. Therefore,
we have developed variant tools, a flexible, open-source toolset
upon which custom pipelines can be easily constructed. This toolset
facilitates the storage of variants (alongside their sample details) as
well as the annotation, filtering and reporting of these variants at
multiple levels—starting with variant reports based on individual
samples to project-wide variant reports.
variant tools is a command-line driven toolset written in the Python scripting
language that incorporates either SQLite or MySQL as a backend database
management system. The toolset is used to create a variant project, which
is conceptually designed around a master variant table that often consists of
millions of variants for all of the samples in a sequencing project along
with variant attributes (called fields in variant tools). Variant fields can
include sample statistics, which variant tools can generate, or information
provided by annotation data sources. Regardless of the source of these
fields, they can be used to select, output and analyze genetic variation
from the project. As illustrated in Figure 1, analyzing genetic variants from
next-generation sequencing projects typically involves four steps, namely
importing, annotating, filtering and reporting.
(1) Sample and variant import: variant tools accommodates a variety
of variant file formats. It supports import of VCF files or other tab-
delimited formats such as intermediate output from ANNOVAR or
BEDTools. It is capable of annotating and reporting on all types of
variants, including indels, as long as annotation sources are available.
The toolset also supports annotation and reporting of project variants
using multiple genomic builds, by automatically downloading and
integrating the UCSC liftOver tool (Hinrichs et al., 2006). As an
example, if variants are imported to a project using build hg18, they
can be annotated using annotation sources designed for build hg19,
and exported based on either hg18 or hg19 coordinates.
Annotation: variant tools can incorporate databases that annotate
individual variants or genomic regions, such as genes or
pathways. A growing number of annotation sources such as dbNSFP
© The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions, please email: email@example.com