Kaviar: an accessible system for testing SNV novelty.
ABSTRACT With the rapidly expanding availability of data from personal genomes, exomes and transcriptomes, medical researchers will frequently need to test whether observed genomic variants are novel or known. This task requires downloading and handling large and diverse datasets from a variety of sources, and processing them with bioinformatics tools and pipelines. Alternatively, researchers can upload data to online tools, which may conflict with privacy requirements. We present here Kaviar, a tool that greatly simplifies the assessment of novel variants. Kaviar includes: (i) an integrated and growing database of genomic variation from diverse sources, including over 55 million variants from personal genomes, family genomes, transcriptomes, SNV databases and population surveys; and (ii) software for querying the database efficiently.
- SourceAvailable from: Neelam Giri[Show abstract] [Hide abstract]
ABSTRACT: Dyskeratosis congenita (DC) is a heterogeneous inherited bone marrow failure and cancer predisposition syndrome in which germline mutations in telomere biology genes account for approximately one-half of known families. Hoyeraal Hreidarsson syndrome (HH) is a clinically severe variant of DC in which patients also have cerebellar hypoplasia and may present with severe immunodeficiency and enteropathy. We discovered a germline autosomal recessive mutation in RTEL1, a helicase with critical telomeric functions, in two unrelated families of Ashkenazi Jewish (AJ) ancestry. The affected individuals in these families are homozygous for the same mutation, R1264H, which affects three isoforms of RTEL1. Each parent was a heterozygous carrier of one mutant allele. Patient-derived cell lines revealed evidence of telomere dysfunction, including significantly decreased telomere length, telomere length heterogeneity, and the presence of extra-chromosomal circular telomeric DNA. In addition, RTEL1 mutant cells exhibited enhanced sensitivity to the interstrand cross-linking agent mitomycin C. The molecular data and the patterns of inheritance are consistent with a hypomorphic mutation in RTEL1 as the underlying basis of the clinical and cellular phenotypes. This study further implicates RTEL1 in the etiology of DC/HH and immunodeficiency, and identifies the first known homozygous autosomal recessive disease-associated mutation in RTEL1.PLoS Genetics 08/2013; 9(8):e1003695. · 8.52 Impact Factor
- [Show abstract] [Hide abstract]
ABSTRACT: Dyskeratosis congenita (DC) is an inherited bone marrow failure and cancer predisposition syndrome caused by aberrant telomere biology. The classic triad of dysplastic nails, abnormal skin pigmentation, and oral leukoplakia is diagnostic of DC, but substantial clinical heterogeneity exists; the clinically severe variant Hoyeraal Hreidarsson syndrome (HH) also includes cerebellar hypoplasia, severe immunodeficiency, enteropathy, and intrauterine growth retardation. Germline mutations in telomere biology genes account for approximately one-half of known DC families. Using exome sequencing, we identified mutations in RTEL1, a helicase with critical telomeric functions, in two families with HH. In the first family, two siblings with HH and very short telomeres inherited a premature stop codon from their mother who has short telomeres. The proband from the second family has HH and inherited a premature stop codon in RTEL1 from his father and a missense mutation from his mother, who also has short telomeres. In addition, inheritance of only the missense mutation led to very short telomeres in the proband's brother. Targeted sequencing identified a different RTEL1 missense mutation in one additional DC proband who has bone marrow failure and short telomeres. Both missense mutations affect the helicase domain of RTEL1, and three in silico prediction algorithms suggest that they are likely deleterious. The nonsense mutations both cause truncation of the RTEL1 protein, resulting in loss of the PIP box; this may abrogate an important protein-protein interaction. These findings implicate a new telomere biology gene, RTEL1, in the etiology of DC.Human Genetics 01/2013; · 4.63 Impact Factor
- [Show abstract] [Hide abstract]
ABSTRACT: Notch signaling determines and reinforces cell fate in bilaterally symmetric multicellular eukaryotes. Despite the involvement of Notch in many key developmental systems, human mutations in Notch signaling components have mainly been described in disorders with vascular and bone effects. Here, we report five heterozygous NOTCH1 variants in unrelated individuals with Adams-Oliver syndrome (AOS), a rare disease with major features of aplasia cutis of the scalp and terminal transverse limb defects. Using whole-genome sequencing in a cohort of 11 families lacking mutations in the four genes with known roles in AOS pathology (ARHGAP31, RBPJ, DOCK6, and EOGT), we found a heterozygous de novo 85 kb deletion spanning the NOTCH1 5' region and three coding variants (c.1285T>C [p.Cys429Arg], c.4487G>A [p.Cys1496Tyr], and c.5965G>A [p.Asp1989Asn]), two of which are de novo, in four unrelated probands. In a fifth family, we identified a heterozygous canonical splice-site variant (c.743-1 G>T) in an affected father and daughter. These variants were not present in 5,077 in-house control genomes or in public databases. In keeping with the prominent developmental role described for Notch1 in mouse vasculature, we observed cardiac and multiple vascular defects in four of the five families. We propose that the limb and scalp defects might also be due to a vasculopathy in NOTCH1-related AOS. Our results suggest that mutations in NOTCH1 are the most common cause of AOS and add to a growing list of human diseases that have a vascular and/or bony component and are caused by alterations in the Notch signaling pathway.The American Journal of Human Genetics 08/2014; · 11.20 Impact Factor
© Oxford University Press 2005
KAVIAR: an accessible system for testing SNV novelty
Gustavo Glusman*, Juan Caballero, Denise Mauldin, Leroy Hood and Jared C. Roach
Institute for Systems Biology, 401 Terry Ave N, Seattle, WA 98109, USA.
Received on XXXXX; revised on XXXXX; accepted on XXXXX
Associate Editor: XXXXXXX
Summary: With the rapidly expanding availability of data from per-
sonal genomes, exomes and transcriptomes, medical researchers
will frequently need to test whether observed genomic variants are
novel or known. This task requires downloading and handling large
and diverse data sets from a variety of sources, and processing
them with bioinformatics tools and pipelines. Alternatively, research-
ers can upload data to online tools, which may conflict with privacy
requirements. We present here Kaviar, a tool that greatly simplifies
the assessment of novel variants. Kaviar includes a) an integrated
and growing database of genomic variation from diverse sources,
including over 53 million variants from personal genomes, family
genomes, transcriptomes, SNP databases and population surveys,
and b) software for querying the database efficiently.
Availability: Kaviar is programmed in Perl and offered free of
charge as Open Source Software. Kaviar may be used online, as a
programmatic web service, or downloaded for local use, from
http://db.systemsbiology.net/kaviar. The database is also provided.
The advent of personalized systems medicine (Auffray et al.,
2009) will be predicated on the availability of precise genomic
information for each patient. Each individual’s unique pattern of
genomic variations will be gleaned by genotyping of known vari-
ants, by exome and transcriptome sequencing, and through whole-
genome sequencing. A fraction of the personal variants observed
via sequencing are false positive artifacts of random sequencing
error, which may confound the search for disease-causing muta-
tions. When family genomes are available, over half of the errors
may be identified by inheritance state analysis (Roach et al., 2010).
Statistically, if a sequence variant had already been observed in a
population survey or in another personal genome, it is currently
reasonable to conclude it is a real observed variant and not an arti-
fact. Given the list of variations observed in a patient’s genome,
one of the first analytical tasks is thus to determine, for each vari-
ant, whether it has already been observed in the human population.
Depending on the type and extent of sequencing done, a physician
or medical researcher may need to test from just a few, to poten-
tially a very large number of observed genome variants, most of
which are Single Nucleotide Variants (SNVs).
*To whom correspondence should be addressed.
dbSNP is a freely available, periodically updated general catalog
of genome variation (Sherry et al., 2001). The recent explosion of
genomic sequencing projects have identified vast numbers of vari-
ants that have not yet been incorporated to dbSNP, a trend ex-
pected to increase as genomic sequencing becomes more economi-
cal. Researchers therefore need to compare observed SNVs with
data from several different sources to ascertain whether a variant is
novel or known, and to determine the population frequencies of the
alleles. Access to these data sets is offered via various web inter-
faces, or as voluminous downloadable files. These files come in a
variety of formats and may specify coordinates relative to different
genome versions, thus requiring significant processing.
Importantly, the personal nature of the data may impose signifi-
cant restrictions on the methods used for studying them: Institu-
tional Review Board (IRB) and other human-subjects protocols
may have specified that the data should be kept within the institu-
tion’s intranet, or even limited to a specific machine, with tight
controls on data access. In this case, the use of web applications to
study personal data may be restricted or entirely disallowed, leav-
ing researchers with the single option of downloading huge files,
developing bioinformatics pipelines to use them, and maintaining
the entire system updated as more data are produced. This effort is
being replicated in laboratories worldwide.
We have created Kaviar (“Known VARiants”), a compilation of
human SNVs collected from many and diverse sources (Table 1),
stressing accessibility and ease of use. Kaviar answers a very spe-
cific question: What variants have been reported already for a giv-
en specific genomic location? For each SNV in a query, Kaviar
reports the known variants and their source (e.g. in which popula-
tion or individual genomes it was observed), or the fact that no
variants are known. Where available, dbSNP identifiers are dis-
played. Output formats include tabular html annotated with rele-
(JSON). The database encodes SNV positions, identities, and
sources. SNVs are identified by their genomic location using
standard hg18 (NCBI Build 36) or hg19 (GRCh37) coordinates.
The Kaviar database is compact, at 3.44% the size of the equiva-
lent representation in GVF (Genome Variation Format, Reese et
al., 2010), and 23.8% of the gzipped GVF.
Table 1 summarizes the various data sources represented in
Kaviar’s database as of June 21st, 2011. Of the current 53,675,039
G. Glusman et al.
SNVs, 27,899,114 (52%) lack dbSNP ids. The largest contributors
of novel SNVs are the 1000Genomes Project (1000 Genomes Pro-
ject Consortium, 2010) and the 69 Genome Data Set from Com-
plete Genomics, Inc., particularly in genomes of African descent.
A concern is that vast numbers of sequencing errors will be incor-
porated into variation databases. We observe that 5,663,976 SNVs
listed in dbSNP are not yet confirmed by any other source. Track-
ing of the provenance of each variant to its source individual ge-
nome(s) will mitigate the accumulation of errors, by facilitating the
selection of SNVs confirmed by independent observation.
A variety of tools and services exist for collecting and annotat-
ing genome variations, including Varietas (Paananen et al., 2010),
SeqAnt (Shetty et al., 2010), ANNOVAR (Wang et al., 2010),
SVA (Pelak et al., 2010) and ENGINES (Amigo et al., 2011). Most
of these tools focus on functional annotation of variants and only
offer web-based interfaces, or downloadables with difficult tech-
nical requirements. Some report only dbSNP data. To our
knowledge, Kaviar offers the widest range of different SNV
sources integrated into one package. Kaviar is easily incorporated
into automated workflows for genome analysis, and results are
easily incorporated into downstream tools such as MAGMA (Hub-
ley et al., 2003). Kaviar output can be used to judge the population
frequency of alleles (e.g., MAF). Allele frequencies can be used by
many algorithms, including those that integrate data across highly
linked SNPs (Roach et al., 2006).
While most of the data sources in Kaviar are public, local instal-
lations can be configured to include data particular to an institu-
tion’s IRB protections and not made visible to outside researchers.
Alternatively, aggregate anonymized data can also be reported.
Indeed, the current distribution includes such aggregate data from
44 unrelated individuals of diverse origins, which were sequenced
by the Institute for Systems Biology.
We expect Kaviar to be of use to the casual researcher interested
in determining the novelty of sets of observed SNVs, to the grow-
ing number of people with their own personal genome information,
and to researchers studying genome-wide personal variation re-
quiring strict confidentiality.
We appreciate technical support, comments and discussion by
Robert Hubley, Lee Rowen and Chris Witwer. All human subjects
protocols at Institute for Systems Biology were reviewed by the
Western Institutional Review Board.
Funding: This work was supported by National Institutes of Health
[RO1GM081083 to G.G.] and by the University of Luxembourg –
Institute for Systems Biology Program.
Conflict of Interest: none declared.
1000 Genomes Project Consortium (2010) A map of human genome variation from
population-scale sequencing. Nature 467(7319): 1061-1073.
Amigo, J., Salas, A. and Phillips, C. (2011) ENGINES: exploring single nucleotide
variation in entire human genomes.
BMC Bioinformatics 12: 105
Table 1. Summary of Kaviar sources as of June 21st, 2011 (hg19 version)
Data set SNVs Uniquea Novelb References
Personal genomes 9,676,945
69 Genome Sete
ISB 44 genomesf
0% Sherry 2001
1.6% Schuster 2010
28.9% Li 2010
63.2% Wang 2008, Blekhman
10.2% Rasmussen 2010
68.4% Green 2010
35.2% Reich 2010
5 humans 2,599,325
aUnique SNVs are those observed solely in that source, but possibly in more than one
individual represented by that source. bNovel SNVs are those lacking dbSNP ids.
cNine independently published individual genomes including J.C. Venter, J. Watson,
S. Quake, S.J. Kim, G. Lucier and anonymous Chinese, Yoruban and Irish individuals.
dPersonal Genome Project. ePanel of 69 genomes released by Complete Genomics,
Inc. fPanel of 44 unrelated genomes sequenced in Institute for Systems Biology pro-
jects. Only SNVs observed in three or more individuals are reported. gSNVs identified
from RNA-seq data, requiring at least 8x coverage and at least 4 reads supporting the
variant. hNeanderthal Genome Project, including both the Neanderthal genome and
five modern humans. iDenisovan Genome Project, including both the Denisovan
genome and seven modern humans. The full list of sources, read mapping methods
and variant calling methods are given in Supplemental Material and online.
Auffray, C., Chen, Z. and Hood, L. (2009) Systems medicine: the future of medical
genomics and healthcare. Genome Med 1: 2.
Blekhman, R. et al. (2010) Sex-specific and lineage-specific alternative splicing in
primates. Genome Res. 20: 180-189.
Green, R.E. et al. (2010) A draft sequence of the Neanderthal genome. Science 328:
Hubley, R. M., Zitzler, E., and Roach, J. C. (2003) Evolutionary algorithms for the
selection of single nucleotide polymorphisms. BMC Bioinformatics 4(1): 30.
Li, Y. et al. (2010) Resequencing of 200 human exomes identifies an excess of low-
frequency non-synonymous coding variants. Nature Genetics 42: 969-972.
Paananen, J., Ciszek, R. and Wong, G. (2010) Varietas: a functional variation data-
base portal. Database 2010: baq016.
Pelak, K. et al. (2010) The characterization of twenty sequenced human genomes.
PLoS Genetics 6(9): e1001111.
Rasmussen, M. et al. (2010) Ancient human genome sequence of an extinct palaeo-
eskimo. Nature 463: 757-762.
Reese, M.G. et al. (2010) A standard variation file format for human genome se-
quences. Genome Biology 11: R88.
Roach, J.C. et al. (2006) Genetic mapping at 3-kilobase resolution reveals inositol
1,4,5-triphosphate receptor 3 as a risk factor for type 1 diabetes in Sweden. Amer-
ican Journal of Human Genetics 79(4): 614-627.
Roach, J.C. et al. (2010) Analysis of genetic inheritance in a family quartet by whole-
genome sequencing. Science 328(5978): 636-639.
Schuster, S. C. et al., (2010) Complete Khoisan and Bantu genomes from southern
Africa. Nature 463(7283): 943-947.
Sherry, S.T et al., (2001) dbSNP: the NCBI database of genetic variation. Nucl Acids
Res 29(1): 308-311.
Shetty, A.C. et al., (2010) SeqAnt: A web service to rapidly identify and annotate
DNA sequence variations. BMC Bioinformatics 11(1): 471.
Wang, E.T. et al. (2008) Alternative isoform regulation in human tissue transcrip-
tomes. Nature 456(7221): 470-476.
Wang, K., Li, M. and Hakonarson, H. (2010) ANNOVAR: functional annotation of
genetic variants from high-throughput sequencing data. Nucl Acids Res 38(16):