Large Numbers of Genetic Variants Considered to be
Pathogenic are Common in Asymptomatic Individuals
Christopher A. Cassa,1,2,3,4∗Mark Y. Tong,5and Daniel M. Jordan1,2,6
1Brigham and Women’s Hospital, Division of Genetics Boston, Massachusetts;2Division of Genetics, Harvard Medical School, Boston,
Massachusetts;3Massachusetts Institute of Technology, Cambridge, Massachusetts;4Broad Institute of Harvard and MIT, Cambridge,
Massachusetts;5Center for Biomedical Informatics, Harvard Medical School, Boston, Massachusetts;6Program in Biophysics, Harvard
University, Cambridge, Massachusetts
Communicated by Sean Tavtigian
Received 24 April 2013; accepted revised manuscript 20 June 2013.
Published online DD MM YYYY in Wiley Online Library (www.wiley.com/humanmutation). DOI: 10.1002/humu.22375
ABSTRACT: It is now affordable to order clinically inter-
preted whole-genome sequence reports from clinical lab-
oratories. One major component of these reports is de-
rived from the knowledge base of previously identified
pathogenic variants, including research articles, locus-
specific, and other databases. While over 150,000 such
pathogenic variants have been identified, many of these
were originally discovered in small cohort studies of af-
fected individuals, so their applicability to asymptomatic
populations is unclear. We analyzed the prevalence of a
large set of pathogenic variants from the medical and sci-
entific literature in a large set of asymptomatic individuals
(N = 1,092) and found 8.5% of these pathogenic variants
in at least one individual. In the average individual in the
1000 Genomes Project, previously identified pathogenic
variants occur on average 294 times (σ = 25.5) in ho-
mozygous form and 942 times (σ = 68.2) in heterozygous
form. We also find that many of these pathogenic variants
are frequently occurring: there are 3,744 variants with
minor allele frequency (MAF) ≥ 0.01 (4.6%) and 2,837
variants with MAF ≥ 0.05 (3.5%). This indicates that
many of these variants may be erroneous findings or have
lower penetrance than previously expected.
Hum Mutat 00:1–8, 2013. C ?2013 Wiley Periodicals, Inc.
KEY WORDS: whole genome sequencing; WGS; personal-
ized medicine; incidental findings; incidentalome
It is now possible to order clinically interpreted whole-genome
sequences (WGS) from clinical laboratories [GenomeWeb, 2011a;
These data have the potential to improve medical care, but the
methods to translate genomic variation into accurate clinical in-
terpretations remain to be defined, particularly for asymptomatic
∗Correspondence to: Christopher A. Cassa, 77 Avenue Louis Pasteur, Boston, MA
02115. E-mail: email@example.com
Contract grant sponsors: NHGRI (HG007229); NIGMS (GM078598).
address both novel variation that is likely to be pathogenic as well
as over 150,000 variants with implied—but unconfirmed—disease
associations that have been reported in the medical and scientific
literature [Stenson et al., 2012], locus-specific databases [Vihinen
et al., 2012], researcher submissions [Yu et al., 2008], clinical ge-
netics practice [McKusick-Nathans Institute of Genetic Medicine;
NCBI, 2012a], and genome-wide association studies (NHGRI).
Many of these variants were identified in symptomatic popula-
tions and may be erroneously associated with disease due to small
ascertainment bias is that the probability of observing a particular
the probability of developing the disease given the presence of each
variant. Some variants may also be incompletely penetrant or po-
these associations, especially those reported before the completion
of the Human Genome Project, are limited in applicability because
of potential inconsistencies with our current standards for genomic
coordinates, nomenclature, and gene structure [Tong et al., 2011].
of disease risk for asymptomatic individuals, creating a major bot-
tleneck in clinical application of WGS.
In a recent study, we estimated that 10.6% of variants, genome
wide, have sufficient clinical relevance and scientific validity for in-
vestigators to share them with research participants [Cassa et al.,
2012]. This estimate specifies that there are over 12,000 variants
that are appropriate to review and report, linked to both common
and rare disease, and adverse drug response [Kohane et al., 2012].
ing clinical suspicion, it is difficult to determine whether these are
tients and cause needless diagnostic workups and costly screenings
[Fabsitz et al., 2010; Tong et al., 2011].
be filtered or annotated before they reach a clinical WGS report.
But, just how many of these variants are sufficiently prevalent that
we would be ill-advised to counsel an asymptomatic carrier about
such an association?
To answer this question, we combined the data from the largest
pathogenic variant database, the Human Gene Mutation Database
C ?2013 WILEY PERIODICALS, INC.
Table 1. List of HGMD Classifications
Disease-associated polymorphism A polymorphism reported to be in significant association with a disease/phenotype (P<0.05) that is assumed to be
functional (e.g., as a consequence of location, evolutionary conservation, replication studies, etc.), although there may as
yet be no direct evidence (e.g., from an expression study) of function.
A polymorphism reported to be in significant association with disease (P<0.05) that has evidence of being of direct
functional importance (e.g., as a consequence of altered expression, mRNA studies, etc.).
A polymorphism reported to affect the structure, function, or expression of the gene (or gene product), but with no disease
association reported as yet.
A polymorphic or rare variant reported in the literature (e.g., detected in the process of whole-genome or whole-exome
screening) that is predicted to truncate or otherwise alter the gene product (i.e., a nonsense or frameshift variant) but
with no disease association reported as yet. Please note that any variant affecting the obligate donor/acceptor splice site of
a gene will not be included in this category unless there is evidence for an effect on the splicing phenotype. Variants
occurring in pseudogenes will also be excluded unless evidence for a functional effect is present for both the pseudogenes
[Balakirev and Ayala, 2003] and the variant in question.
Pathological mutation reported to be disease causing in the corresponding report (i.e., all other HGMD data).
Disease-associated polymorphism with
additional supporting functional evidence
In vitro/laboratory or in vivo functional
Frameshift or truncating variant
HGMD classifies variant reports in one of the five major categories. Source: HGMD.
(HGMD), with data from the largest publicly available source
of WGS data from asymptomatic individuals, the 1000 Genomes
Project (TGP). We analyzed the prevalence of HGMD variants by
Materials and Methods
Set of Variants and WGS Data
We included all single nucleotide substitution variants with ge-
nomic coordinate and reference/alternate allele information avail-
able in HGMD version 2012.2 [Stenson et al., 2009] (N = 81,432
variants). We used publicly available WGS call data from the TGP
(N = 1,092 individuals) (1000 Genomes Phase 1, Version 3, release
date April 30, 2012) [Consortium, 2012], as the set of genomic
Whole-Genome Analysis of Asymptomatic Individuals for
Variant Allele Frequency and Count Data
In each asymptomatic WGS, we checked for the presence of
each HGMD pathogenic variant. Among the 81,432 variants, we
found 6,917 variants present in at least one individual. We recorded
mozygous minor form, and calculated the maximum, minimum,
variant, we calculated the minor allele frequency (MAF) using the
maximum likelihood estimate from the observed variants in the
TGP, which is the number of alternate alleles divided by the to-
tal number of alleles. For 717 variants, the allele frequency of the
nonreference allele was above 50%; for these alleles, we treated the
reference allele as the minor allele, so all minor allele frequencies
were below 50%.
Analysis of Variant Pathogenicity Classifications and
Amino Acid Changes
We analyzed the pathogenicity classifications for each variant we
observed in TGP. We reviewed the HGMD reported classification
in Table 1, and those calculated by PolyPhen 2 [Adzhubei et al.,
2013]. For each HGMD variant classification, we generated the
using TGP data. For PolyPhen 2 scores, we plotted each variant’s
score by variant MAF bin. We also grouped all variants in HGMD
into four major categories of amino acid changes: synonymous,
missense, nonsense, or none/other, where the associated variant is
noncoding or no information is provided. For each amino acid
category, we generated the population frequency distribution of
observed variants in that class, using TGP data.
We identified a total of 6,917 of these variants in at least one
individual in the TGP (8.5% of HGMD variants in this study).
The number of HGMD variants identified in each individual is
graphed for both homozygous minor and heterozygous form in
Figure 1. We found that individuals in the TGP had an average of
294 (σ = 25.5) variants in homozygous form and 942 (σ = 68.2)
variants in heterozygous form (Table 2).
We found that many of these disease-associated variants are ob-
that there are opportunities to filter and prioritize variants that are
very common and therefore unlikely to be strongly associated with
genomes individuals. We identified a total of 6,917 HGMD variants in
genotypes as well as heterozygous genotypes.
Frequency distribution of HGMD variants identified in 1000
HUMAN MUTATION, Vol. 00, No. 0, 1–8, 2013
Table 2. Aggregate Results from the Whole-Genome Interpretation of Asymptomatic Individuals from the 1000 Genomes Project
Homozygous variant genotypes Homozygous reference genotypes
Number of HGMD 2012.1 variants in each 1000 Genomes Project individual (N = 1,092)
For all HGMD variants identified in this study
294.4 (σ = 25.5)
942.3 (σ = 68.2)
5,680.2 (σ = 53.5)
43.9 (σ = 76.5)
148.8 (σ = 181.5)
899.3 (σ = 250.6)
WGS data of asymptomatic individuals in the 1000 Genomes Project were analyzed, using the substitution variants from HGMD. The total number of variants identified in each
sequence is reported, along with the subset of those that are homozygous and heterozygous.
viduals from the 1000 Genomes Project. There are 3,744 HGMD variants
with MAF > 0.01, 3,111 variants with MAF > 0.03, and 2,837 variants with
MAF > 0.05.
Distribution of HGMD variants with MAF > 0.01 in the 1000
disease. By population frequency, 3,744 variants have MAF≥0.01
(54.1% of the 6,917 observed, 4.6% of all study variants) and 2,837
variants have MAF≥0.05 (41.0% of the 6,917 observed, 3.5% of all
findings or be incompletely penetrant.
All observed variants were grouped by their HGMD pathogenic-
ity classification and predicted amino acid change. For each variant
classification type, we graph the frequency distribution of observed
that are classified as polymorphic with MAF≥0.01 in the asymp-
tomatic population (DFP = 903, disease-associated polymorphism
[DP] =1,495,functionalpolymorphism[FP] =741).Howeverthere
are also variants classified as polymorphic that are below a popula-
tion frequency of 1% (DFP = 62, DP = 151, FP = 342), so these vari-
ants previously classified as polymorphic may be worthy of review
variants. Unexpectedly, there are also many variants that are clas-
sified as disease-causing mutations or disease-associated nonsense
mutations (disease mutations [DM] = 583, frameshift or truncat-
ing variant = 32) that are present in asymptomatic individuals with
We graph the distribution of variants by four major categories
of predicted amino acid changes in Figure 4. As expected, most
color in print
Genomes Project by HGMD classification.
Distribution of variants with MAF > 0.01 from the 1000
color in print
Genomes Project by amino acid change.
Distribution of variants with MAF > 0.01 from the 1000
HGMD variants identified in this set of asymptomatic individuals
were synonymous (275), missense (4,440), or none/other (1,905).
present with MAF≥0.01. Of the missense variants, 42.8% (1,900)
were present with MAF≥0.01.
We computed PolyPhen 2 scores for variants observed in TGP,
and plotted these by MAF, in Figure 5. We observed the fraction
of variants predicted as damaging by Polyphen 2 decreases as the
variant MAF increases, so that the vast majority of high-frequency
variants are predicted to be neutral. It is also worth noting that
HUMAN MUTATION, Vol. 00, No. 0, 1–8, 2013
individuals. We observe that 4.6% of these disease-associated vari-
ants are present with sufficient frequency (MAF > 0.01), that they
are unlikely to be highly penetrant Mendelian disease variants. Fur-
by PolyPhen 2, suggesting that even many of these rare missense
variants are not necessarily pathogenic.
We do not intend for this report to be a criticism of HGMD,
but a commentary on the current state of the application of this
knowledge base in WGS interpretation and personalized medicine.
HGMD acknowledges that a substantial number of variants in the
database are polymorphic and are included because of their asso-
ciation with disease [Stenson et al., 2012]. We also identified many
variants that are classified as polymorphic, but that do not ap-
pear frequently in the asymptomatic population, and many others
that are classified as DMs that are quite common. HGMD recently
introduced a new category of mutation, “DM,” which updates a
previously updated DM variant where the author of the report has
indicated that there may be some degree of doubt, or subsequent
evidence in the literature calls the deleterious nature of the variant
into question [Biobase/HGMD, 2013].
Current approaches to variant filtering focus on the exclusion of
common variation [Biesecker, 2010] or inclusion of variants with
siderations [Adzhubei et al., 2013; Kumar et al., 2009; Thompson
et al., 2013]. But the variant filtering process comes with complex-
ities; simple frequency-based filters do not exclude all common,
benign variation, and they do not maintain all pathogenic varia-
clinical importance include hereditary hemochromatosis, Factor V
Leiden deficiency, and numerous pharmacogenomics associations
[Klein et al., 2001]. This filtering may be additionally informed by
the population prevalence of disease [Biesecker, 2010; Park et al.,
2009], as well as validated case-control population data [EBI, 2012;
NCBI, 2012a; NCBI, 2012b; NHLBI, 2012], and in silico predic-
tive algorithms that assess variant pathogenicity using functional
and evolutionary significance [Adzhubei et al., 2013; Kumar et al.,
There are several limitations to this study. The variants stud-
ied only represent a subset of the total knowledge base, although
this sample represents an easily accessible subset of variants with
chromosome coordinate data that will likely be evaluated in most
WGS pipelines. While these variants have been previously associ-
ated with disease in the scientific literature, they largely have been
derived from small disease cohorts with limited control popula-
tions such that a reassessment of the evidence for pathogenicity is
We have likely produced an underestimate for the number of
HGMD variants that are present in asymptomatic individuals gen-
erally. In this study, we have estimated the MAF values for disease-
associated variants using the maximum likelihood estimate from
low coverage data (2010). We expect that there is a great deal of
rare variation that was not observed given the number of indi-
viduals and low coverage [Keinan and Clark, 2012; Nelson et al.,
2012]. This means that there is likely additional rare variation that
makes our present estimates an underestimate. In this prelimi-
nary study, we did not make important exclusions for bias, such
as for the sampling frequency of variants that cause early onset
Interpreted WGS are now available as clinically certified labora-
of patients. Central to the clinical interpretation of these variants
is the development of a standardized methodology to predict the
Project MAF. The PolyPhen score is bimodal, with most of the scores
predicted as pathogenic, whereas polymorphisms are mostly, but not
tions decreases with increasing MAF, so that a variant with MAF > 0.3
is far more likely to be predicted benign than pathogenic.
Distribution of variants PolyPhen 2 versus 1000 Genomes
even at low frequencies, PolyPhen 2 predicts a substantial number
of observed variants as benign (52% of the 4,431 total variants for
which PolyPhen 2 predictions could be made, and 40% of 2,580
variants with MAF<0.01).
Variants classified as DM in HGMD are mostly, but not over-
whelmingly, predicted as pathogenic (58% of 2,555 DM variants
for which predictions could be made). Polymorphisms are mostly,
but not overwhelmingly, predicted as benign (66% of 360 disease-
disease-associated polymorphisms, and 603 in vitro/in vivo func-
could be made). The number of pathogenic predictions decreases
with increasing MAF, so that a variant with MAF > 0.3 is far more
likely to be predicted benign than pathogenic (84% of 357 variants
observed with MAF > 0.3 for which PolyPhen 2 predictions could
In this study, we demonstrate that a large number of published
disease-associated variants from HGMD are so common that they
are likely to have limited predictive value for asymptomatic indi-
viduals. When one of these common variants is encountered in an
asymptomatic individual, the significance is unclear when the prior
probability of disease is low, and there is no other confirmatory evi-
population data [Kohane et al., 2006]. We believe these findings are
variants in the clinical interpretation of whole-exome and WGS
While there is a substantial percentage (approximately 10.6%) of
previously identified variants that are of sufficient clinical relevance
2012], this study demonstrates that a similar percentage (8.5%) of
these disease-associated variants are present in completely healthy
HUMAN MUTATION, Vol. 00, No. 0, 1–8, 2013
amounts of DNA to highly complex mixtures using high-density SNP genotyping
microarrays. PLoS Genet 4:e1000167.
Keinan A, Clark AG. 2012. Recent explosive human population growth has resulted in
an excess of rare genetic variants. Science 336:740–743.
Klein TE, Chang JT, Cho MK, Easton KL, Fergerson R, Hewett M, Lin Z, Liu Y,
Liu S, Oliver DE, Rubin DL, Shafa F, Stuart JM, Altman RB. 2001. Integrating
genotype and phenotype information: an overview of the PharmGKB project.
Pharmacogenetics Research Network and Knowledge Base. Pharmacogenomics J
Knome, Inc. Know thyself. Personal Human Genome Sequencing. 2010. A map of
human genome variation from population-scale sequencing. Nature 467:1061–
Kohane IS, Hsing M, Kong SW. 2012. Taxonomizing, sizing, and overcoming the
incidentalome. Genet Med 14:399–404.
Kohane IS, Masys DR, Altman RB. 2006. The incidentalome: a threat to genomic
medicine. JAMA 296:212–215.
Kohane IS, Shendure J. 2012. What’s a genome worth? Sci Transl Med 4:133fs13.
Kumar P, Henikoff S, Ng PC. 2009. Predicting the effects of coding non-synonymous
variants on protein function using the SIFT algorithm. Nat Protoc 4:1073–
McKusick-Nathans Institute of Genetic Medicine, JHUB, MD. Online Mendelian In-
heritance in Man, OMIM
NCBI. 2012a. ClinVar.
NCBI. 2012b. Database of genotypes and phenotypes (dbGaP).
Nelson MR, Wegmann D, Ehm MG, Kessner D, St Jean P, Verzilli C, Shen J, Tang
Z, Bacanu SA, Fraser D, Warren L, Aponte J, et al. 2012. An abundance of rare
functional variants in 202 drug target genes sequenced in 14,002 people. Science
NHGRI. Genome-wide association studies. In: genome.gov, editor.
NHLBI. 2012. NHLBI GO Exome Sequencing Project.
Park J, Lee DS, Christakis NA, Barabasi AL. 2009. The impact of cellular networks on
disease comorbidity. Mol Syst Biol 5:262.
Review T. 2011. Making genome sequencing part of clinical care.
Roberts NJ, Vogelstein JT, Parmigiani G, Kinzler KW, Vogelstein B, Velculescu VE.
2012. The predictive capacity of personal genome sequencing. Sci Transl Med
Stenson PD, Ball EV, Howells K, Phillips AD, Mort M, Cooper DN. 2009. The Human
Gene Mutation Database: providing a comprehensive central mutation database
for molecular diagnostics and personalized genomics. Hum Genom 4:69–72.
Stenson PD, Ball EV, Mort M, Phillips AD, Shaw K, Cooper DN. 2012. The Human
genomics and molecular evolution. Curr Protoc Bioinformatics Chapter 1:Unit1
IA, Li B, Bell R, Feng B, Mooney SD, Radivojac P. 2013. Calibration of multiple
in silico tools for predicting pathogenicity of mismatch repair gene missense
substitutions. Hum Mutat 34:255–265.
Tong MY, Cassa CA, Kohane IS. 2011. Automated validation of genetic variants from
large databases: ensuring that variant references refer to the same genomic loca-
tions. Bioinformatics 27:891–893.
Vihinen M, den Dunnen JT, Dalgleish R, Cotton RG. 2012. Guidelines for establishing
locus specific databases. Hum Mutat 33:298–305.
Yu W, Gwinn M, Clyne M, Yesupriya A, Khoury MJ. 2008. A navigator for human
genome epidemiology. Nat Genet 40:124–125.
pathogenicity of these variants, and to prioritize those that require
disease, the results will undoubtedly uncover risk variants for other
diseases for which a patient has no strong prior probability, and is
essentially “asymptomatic”. Furthermore, even variants with clear
evidence for pathogenicity in a disease cohort may be incompletely
penetrant, making an initial assessment of risk to a healthy individ-
ual difficult. This makes it essential to understand the limitations
of the current knowledge base of genomic variants [Roberts et al.,
rare Mendelian disorders [Kohane and Shendure, 2012].
Our findings demonstrate the limitations of using pathogenic
variant databases in the WGS interpretation of asymptomatic in-
dividuals. These findings have substantial implications for those
that leverage these genomic knowledge bases for use in clinical
sequence interpretation. Microarrays and targeted sequencing are
sequencing will be integrated into clinical care. Issues to address
in future research include decision support systems for prioritiz-
ing large numbers of identified variants, in conjunction with family
history and/or clinical presentation.
Adzhubei I, Jordan DM, Sunyaev SR. 2013. Predicting functional effect of human
missense mutations using PolyPhen-2. Curr Protoc Hum Genet Chapter 7:Unit7
Biesecker LG. 2010. Exome sequencing makes medical genomics a reality. Nat Genet
Biobase/HGMD. 2013. What’s new at HGMD.
Brunham LR, Hayden MR. 2012. Medicine. Whole-genome sequencing: the new stan-
dard of care? Science 336:1112–1113.
Cassa CA, Savage SK, Taylor PL, Green RC, McGuire AL, Mandl KD. 2012. Disclosing
pathogenic genetic variants to research participants: quantifying an emerging
ethical responsibility. Genome Res 22:421–428.
Consortium GP. 2012. 1000 Genomes Project FTP Server. 1000 Genomes phase 1,
version 3, release date April 30, 2012.
EBI. 2012. The European Genome–Phenome Archive.
Fabsitz RR, McGuire A, Sharp RR, Puggal M, Beskow LM, Biesecker LG, Bookman E,
Burke W, Burchard EG, Church G, Clayton EW, Eckfeldt JH, et al. 2010. Ethical
updated guidelines from a National Heart, Lung, and Blood Institute working
group. Circ Cardiovasc Genet 3:574–580.
GenomeWeb. 2011a. Baylor Whole Genome Laboratory Launches Clinical Exome Se-
Genome Sequencing Interpretation Service in 2012.
Homer N, Szelinger S, Redman M, Duggan D, Tembe W, Muehling J, Pearson JV,
HUMAN MUTATION, Vol. 00, No. 0, 1–8, 2013
Q1: AU: Human Mutation is initiating publication of a graphical Table of Contents. Please supply a colorful, mostly square, image of high
resolution with descriptive text (2–3 sentences; 80-word maximum) that captures the most compelling aspects of your work, aimed
at a broad readership.
Q2: Figures 3, 4 are to be printed in color. Please complete the color charge agreement form and return it to firstname.lastname@example.org.
Q3: AU: Please check the usage of the acronyms WGS and DM throughout the article for correctness.
Q4: AU: But ... association? Please check the sentence for correctness.
Q5: AU: Please check the presentation of Tables 1 and 2 for correctness.
Q6: AU: Please include the reference Balakirev and Ayala (2003) with its full details in the reference list.
Q7: AU: Variants ... question. Please check the sentence for the correctness of the edits made.
Q8: AU: Please define DFP at its first occurrence in the text.
Q9: AU: HGMD . .. Biobase/HGMD, 2013]. Please check the sentence for correctness.
Q10: AU: Please cite the specific reference here.
Q11: AU: Please update references Biobase/HGMD (2013), EBI (2012), GenomeWeb (2011a), GenomeWeb (2011b), McKusick-Nathans
Institute of Genetic Medicine, NCBI (2012a), NCBI (2012b), NHGRI, NHLBI (2012), and Review (2011) sp that it can become
Q12: AU: Please check the presentation of the reference Consortium (2012) and Fabsitz et al. (2010) for correctness.
Q13: AU: The citation of this reference is missing in the text. Please either cite the reference or delete it.