© Oxford University Press 2005
KAVIAR: an accessible system for testing SNV novelty
Gustavo Glusman*, Juan Caballero, Denise Mauldin, Leroy Hood and Jared C. Roach
Institute for Systems Biology, 401 Terry Ave N, Seattle, WA 98109, USA.
Received on XXXXX; revised on XXXXX; accepted on XXXXX
Associate Editor: XXXXXXX
Summary: With the rapidly expanding availability of data from per-
sonal genomes, exomes and transcriptomes, medical researchers
will frequently need to test whether observed genomic variants are
novel or known. This task requires downloading and handling large
and diverse data sets from a variety of sources, and processing
them with bioinformatics tools and pipelines. Alternatively, research-
ers can upload data to online tools, which may conflict with privacy
requirements. We present here Kaviar, a tool that greatly simplifies
the assessment of novel variants. Kaviar includes a) an integrated
and growing database of genomic variation from diverse sources,
including over 53 million variants from personal genomes, family
genomes, transcriptomes, SNP databases and population surveys,
and b) software for querying the database efficiently.
Availability: Kaviar is programmed in Perl and offered free of
charge as Open Source Software. Kaviar may be used online, as a
programmatic web service, or downloaded for local use, from
http://db.systemsbiology.net/kaviar. The database is also provided.
The advent of personalized systems medicine (Auffray et al.,
2009) will be predicated on the availability of precise genomic
information for each patient. Each individual’s unique pattern of
genomic variations will be gleaned by genotyping of known vari-
ants, by exome and transcriptome sequencing, and through whole-
genome sequencing. A fraction of the personal variants observed
via sequencing are false positive artifacts of random sequencing
error, which may confound the search for disease-causing muta-
tions. When family genomes are available, over half of the errors
may be identified by inheritance state analysis (Roach et al., 2010).
Statistically, if a sequence variant had already been observed in a
population survey or in another personal genome, it is currently
reasonable to conclude it is a real observed variant and not an arti-
fact. Given the list of variations observed in a patient’s genome,
one of the first analytical tasks is thus to determine, for each vari-
ant, whether it has already been observed in the human population.
Depending on the type and extent of sequencing done, a physician
or medical researcher may need to test from just a few, to poten-
tially a very large number of observed genome variants, most of
which are Single Nucleotide Variants (SNVs).
*To whom correspondence should be addressed.
dbSNP is a freely available, periodically updated general catalog
of genome variation (Sherry et al., 2001). The recent explosion of
genomic sequencing projects have identified vast numbers of vari-
ants that have not yet been incorporated to dbSNP, a trend ex-
pected to increase as genomic sequencing becomes more economi-
cal. Researchers therefore need to compare observed SNVs with
data from several different sources to ascertain whether a variant is
novel or known, and to determine the population frequencies of the
alleles. Access to these data sets is offered via various web inter-
faces, or as voluminous downloadable files. These files come in a
variety of formats and may specify coordinates relative to different
genome versions, thus requiring significant processing.
Importantly, the personal nature of the data may impose signifi-
cant restrictions on the methods used for studying them: Institu-
tional Review Board (IRB) and other human-subjects protocols
may have specified that the data should be kept within the institu-
tion’s intranet, or even limited to a specific machine, with tight
controls on data access. In this case, the use of web applications to
study personal data may be restricted or entirely disallowed, leav-
ing researchers with the single option of downloading huge files,
developing bioinformatics pipelines to use them, and maintaining
the entire system updated as more data are produced. This effort is
being replicated in laboratories worldwide.
We have created Kaviar (“Known VARiants”), a compilation of
human SNVs collected from many and diverse sources (Table 1),
stressing accessibility and ease of use. Kaviar answers a very spe-
cific question: What variants have been reported already for a giv-
en specific genomic location? For each SNV in a query, Kaviar
reports the known variants and their source (e.g. in which popula-
tion or individual genomes it was observed), or the fact that no
variants are known. Where available, dbSNP identifiers are dis-
played. Output formats include tabular html annotated with rele-
(JSON). The database encodes SNV positions, identities, and
sources. SNVs are identified by their genomic location using
standard hg18 (NCBI Build 36) or hg19 (GRCh37) coordinates.
The Kaviar database is compact, at 3.44% the size of the equiva-
lent representation in GVF (Genome Variation Format, Reese et
al., 2010), and 23.8% of the gzipped GVF.
Table 1 summarizes the various data sources represented in
Kaviar’s database as of June 21st, 2011. Of the current 53,675,039
G. Glusman et al. Download full-text
SNVs, 27,899,114 (52%) lack dbSNP ids. The largest contributors
of novel SNVs are the 1000Genomes Project (1000 Genomes Pro-
ject Consortium, 2010) and the 69 Genome Data Set from Com-
plete Genomics, Inc., particularly in genomes of African descent.
A concern is that vast numbers of sequencing errors will be incor-
porated into variation databases. We observe that 5,663,976 SNVs
listed in dbSNP are not yet confirmed by any other source. Track-
ing of the provenance of each variant to its source individual ge-
nome(s) will mitigate the accumulation of errors, by facilitating the
selection of SNVs confirmed by independent observation.
A variety of tools and services exist for collecting and annotat-
ing genome variations, including Varietas (Paananen et al., 2010),
SeqAnt (Shetty et al., 2010), ANNOVAR (Wang et al., 2010),
SVA (Pelak et al., 2010) and ENGINES (Amigo et al., 2011). Most
of these tools focus on functional annotation of variants and only
offer web-based interfaces, or downloadables with difficult tech-
nical requirements. Some report only dbSNP data. To our
knowledge, Kaviar offers the widest range of different SNV
sources integrated into one package. Kaviar is easily incorporated
into automated workflows for genome analysis, and results are
easily incorporated into downstream tools such as MAGMA (Hub-
ley et al., 2003). Kaviar output can be used to judge the population
frequency of alleles (e.g., MAF). Allele frequencies can be used by
many algorithms, including those that integrate data across highly
linked SNPs (Roach et al., 2006).
While most of the data sources in Kaviar are public, local instal-
lations can be configured to include data particular to an institu-
tion’s IRB protections and not made visible to outside researchers.
Alternatively, aggregate anonymized data can also be reported.
Indeed, the current distribution includes such aggregate data from
44 unrelated individuals of diverse origins, which were sequenced
by the Institute for Systems Biology.
We expect Kaviar to be of use to the casual researcher interested
in determining the novelty of sets of observed SNVs, to the grow-
ing number of people with their own personal genome information,
and to researchers studying genome-wide personal variation re-
quiring strict confidentiality.
We appreciate technical support, comments and discussion by
Robert Hubley, Lee Rowen and Chris Witwer. All human subjects
protocols at Institute for Systems Biology were reviewed by the
Western Institutional Review Board.
Funding: This work was supported by National Institutes of Health
[RO1GM081083 to G.G.] and by the University of Luxembourg –
Institute for Systems Biology Program.
Conflict of Interest: none declared.
1000 Genomes Project Consortium (2010) A map of human genome variation from
population-scale sequencing. Nature 467(7319): 1061-1073.
Amigo, J., Salas, A. and Phillips, C. (2011) ENGINES: exploring single nucleotide
variation in entire human genomes.
BMC Bioinformatics 12: 105
Table 1. Summary of Kaviar sources as of June 21st, 2011 (hg19 version)
Data set SNVs Uniquea Novelb References
Personal genomes 9,676,945
69 Genome Sete
ISB 44 genomesf
0% Sherry 2001
1.6% Schuster 2010
28.9% Li 2010
63.2% Wang 2008, Blekhman
10.2% Rasmussen 2010
68.4% Green 2010
35.2% Reich 2010
5 humans 2,599,325
aUnique SNVs are those observed solely in that source, but possibly in more than one
individual represented by that source. bNovel SNVs are those lacking dbSNP ids.
cNine independently published individual genomes including J.C. Venter, J. Watson,
S. Quake, S.J. Kim, G. Lucier and anonymous Chinese, Yoruban and Irish individuals.
dPersonal Genome Project. ePanel of 69 genomes released by Complete Genomics,
Inc. fPanel of 44 unrelated genomes sequenced in Institute for Systems Biology pro-
jects. Only SNVs observed in three or more individuals are reported. gSNVs identified
from RNA-seq data, requiring at least 8x coverage and at least 4 reads supporting the
variant. hNeanderthal Genome Project, including both the Neanderthal genome and
five modern humans. iDenisovan Genome Project, including both the Denisovan
genome and seven modern humans. The full list of sources, read mapping methods
and variant calling methods are given in Supplemental Material and online.
Auffray, C., Chen, Z. and Hood, L. (2009) Systems medicine: the future of medical
genomics and healthcare. Genome Med 1: 2.
Blekhman, R. et al. (2010) Sex-specific and lineage-specific alternative splicing in
primates. Genome Res. 20: 180-189.
Green, R.E. et al. (2010) A draft sequence of the Neanderthal genome. Science 328:
Hubley, R. M., Zitzler, E., and Roach, J. C. (2003) Evolutionary algorithms for the
selection of single nucleotide polymorphisms. BMC Bioinformatics 4(1): 30.
Li, Y. et al. (2010) Resequencing of 200 human exomes identifies an excess of low-
frequency non-synonymous coding variants. Nature Genetics 42: 969-972.
Paananen, J., Ciszek, R. and Wong, G. (2010) Varietas: a functional variation data-
base portal. Database 2010: baq016.
Pelak, K. et al. (2010) The characterization of twenty sequenced human genomes.
PLoS Genetics 6(9): e1001111.
Rasmussen, M. et al. (2010) Ancient human genome sequence of an extinct palaeo-
eskimo. Nature 463: 757-762.
Reese, M.G. et al. (2010) A standard variation file format for human genome se-
quences. Genome Biology 11: R88.
Roach, J.C. et al. (2006) Genetic mapping at 3-kilobase resolution reveals inositol
1,4,5-triphosphate receptor 3 as a risk factor for type 1 diabetes in Sweden. Amer-
ican Journal of Human Genetics 79(4): 614-627.
Roach, J.C. et al. (2010) Analysis of genetic inheritance in a family quartet by whole-
genome sequencing. Science 328(5978): 636-639.
Schuster, S. C. et al., (2010) Complete Khoisan and Bantu genomes from southern
Africa. Nature 463(7283): 943-947.
Sherry, S.T et al., (2001) dbSNP: the NCBI database of genetic variation. Nucl Acids
Res 29(1): 308-311.
Shetty, A.C. et al., (2010) SeqAnt: A web service to rapidly identify and annotate
DNA sequence variations. BMC Bioinformatics 11(1): 471.
Wang, E.T. et al. (2008) Alternative isoform regulation in human tissue transcrip-
tomes. Nature 456(7221): 470-476.
Wang, K., Li, M. and Hakonarson, H. (2010) ANNOVAR: functional annotation of
genetic variants from high-throughput sequencing data. Nucl Acids Res 38(16):