Page 1

Quantitative Analysis of Single Nucleotide

Polymorphisms within Copy Number Variation

Soohyun Lee1, Simon Kasif1,2,3, Zhiping Weng1,2,5*, Charles R. Cantor1,2,4*

1Bioinformatics Program, Boston University, Boston, Massachusetts, United States of America, 2Department of Biomedical Engineering, Boston University, Boston,

Massachusetts, United States of America, 3Children’s Hospital Informatics Program at Harvard-MIT Health Sciences and Technology, Boston, Massachusetts, United States

of America, 4Sequenom Inc., San Diego, California, United States of America, 5Biochemistry and Molecular Pharmacology, University of Massachusetts, Worcester,

Massachusetts, United States of America

Abstract

Background: Single nucleotide polymorphisms (SNPs) have been used extensively in genetics and epidemiology studies.

Traditionally, SNPs that did not pass the Hardy-Weinberg equilibrium (HWE) test were excluded from these analyses. Many

investigators have addressed possible causes for departure from HWE, including genotyping errors, population admixture

and segmental duplication. Recent large-scale surveys have revealed abundant structural variations in the human genome,

including copy number variations (CNVs). This suggests that a significant number of SNPs must be within these regions,

which may cause deviation from HWE.

Results: We performed a Bayesian analysis on the potential effect of copy number variation, segmental duplication and

genotyping errors on the behavior of SNPs. Our results suggest that copy number variation is a major factor of HWE

violation for SNPs with a small minor allele frequency, when the sample size is large and the genotyping error rate is 0,1%.

Conclusions: Our study provides the posterior probability that a SNP falls in a CNV or a segmental duplication, given the

observed allele frequency of the SNP, sample size and the significance level of HWE testing.

Citation: Lee S, Kasif S, Weng Z, Cantor CR (2008) Quantitative Analysis of Single Nucleotide Polymorphisms within Copy Number Variation. PLoS ONE 3(12):

e3906. doi:10.1371/journal.pone.0003906

Editor: Richard Mayeux, Columbia University, United States of America

Received May 9, 2008; Accepted November 11, 2008; Published December 18, 2008

Copyright: ? 2008 Lee et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted

use, distribution, and reproduction in any medium, provided the original author and source are credited.

Funding: R01 GM080625 & Sequenom Support

Competing Interests: The authors have declared that no competing interests exist.

* E-mail: zhiping@umassmed.edu (ZW); ccantor@sequenom.com (CRC)

Introduction

1. Single nucleotide polymorphism (SNP) and Hardy-

Weinbergequilibrium

polymorphisms (SNPs) are common biallelic variations that are

widely used as genetic markers in linkage analyses and association

studies[1]. Most human SNPs satisfy the Hardy-Weinberg

equilibrium (HWE), the condition of allelic independence, in

which allele frequencies and genotype frequencies do not change

over generations[2,3]. Hunter et al.[4] reported that 5.0% and

1.3% of SNPs in their analysis deviated from HWE, at significance

level a=0.05 and a=0.01, respectively, which indicates that most

of the human SNPs are under the null hypothesis of HWE. A

departure from HWE can be explained by natural selection,

population admixture, inbreeding, experimental errors and

duplication[5].Conventionally

deviated from HWE are discarded before further analysis.

2.Copynumber variation

duplication (SD).

A copy number variation (CNV) is a

genomic segment larger than 1 kb that occurs in variable

numbers in the genome. When the variant frequency is larger

than 1% in a population, it is called a copy number polymorphism

(CNP). In some contexts, CNV stands for copy number

variants[6], which refers to individuals whose copy number is

different from the majority in a population. Here, by CNV we

(HWE).

Singlenucleotide

SNPsthat aresignificantly

(CNV) andsegmental

refer to a specific locus, or a genetic marker in a population that

shows variations among individuals.

A segmental duplication (SD) refers to a large duplicated

sequence in the genome, conventionally longer than 1 kb with at

least 90% sequence identity between duplicate copies (reviewed by

Bailey and Eichler[7]). SDs occupy about 5% of the human

genome[8]. SDs are closely related to CNVs, except that an SD

does not have a varying copy number within a population. Based

on a single Caucasian individual’s diploid genome sequence that

came out recently, about 55% of CNVs seem to overlap with an

annotated SD[9]. A similar rate of overlap had been reported in

another study based on comparison between the human genome

reference sequence and a fosmid-paired-end library[10]. Redon et

al.[11] suggested that the significant overlap between SD and

CNV is partly because of incorrect annotation of CNVs as SDs;

i.e. the number of individuals sequences was not large enough to

detect rare variants. Moreover, CNVs and SDs can be viewed as a

special case of one another. Sebat et al.[12] viewed copy number

gains as recent segmental duplications. We adopt a view that SD is

an extreme case of CNV in which duplication frequency is 100%.

3. SNPs in a CNV.

Recent studies show that at least 12%–

15% of the human genome is covered by copy number

variations[11,12]. Moreover, 56% of the CNVs identified were

in known genes, according to Iafrate et al.[13] and Zogopoulos et

al.[14]. The large proportion of CNVs in the genome indicates

PLoS ONE | www.plosone.org1 December 2008 | Volume 3 | Issue 12 | e3906

Page 2

that a significant number of SNPs may fall in these regions.

Nguyen et al. showed that SNPs are significantly enriched in

known human CNVs[15].

We are interested to know how a SNP would behave when it is

in a copy number variation. We begin with an ‘observed SNP’ site,

that shows two different bases in sequencing or genotyping

experiments. The measured genotype and allele frequencies of an

observed SNP may not reflect the true frequencies when

additional copies exist. An observed SNP may not even be a true

SNP, but instead a variation between two duplicate copies.

It is difficult to separate duplicate copies experimentally. The

sequences flanking the two loci are nearly identical and PCR

(polymerase chain reaction) and extension reactions cannot

differentiate them. Finding out the exact genotypes for CNVs is

also a challenging problem and only relative quantification is

available to date[16]. Thus, computational inference can be useful

at this point, for understanding the HWD of SNPs in a CNV.

Our study focused on relatively small scale SNP studies with

limited information. Detection and validation of CNVs through

experimental and computational methods have been an ongoing

problem. However such information is often limited due to

difference in population (e.g. ethnicity), lack of confirmed

boundaries, and quantification relative to the population average

than the absolute number of copies.

Methods have been developed specifically for detecting CNVs

using a large number of SNPs. SNP arrays (BeadArrayTMby

Illumina and GeneChipH by Affymetrix) became available recently

that allow simultaneous genotyping of CNVs and SNPs. Software

that detects CNVs from the SNP arrays has been developed (eg.

BeadStudio LOH+ by Illumina and QuantiSNP by Colella et

al.[17]). QuantiSNP uses the information that many consecutive

SNPs within a CNV region must share the effect of a CNV and

has an estimated false positive rate of 1 CNV in 100,000 SNPs.

McCaroll et al.[18] identified 541 deletion variants by using the

neighboring-marker effect as well as HWD and non-Mendelian

inheritance. Most of these approaches use the logic that closely

located neighboring SNPs share the same CNV.

However, not every investigator genotypes such a dense set of

SNPs, depending on the goal of the genetics or epidemiology

study. Closely positioned SNPs are often in linkage disequilibrium

and many investigators prefer typing distant SNPs for cost

effectiveness. Our goal is to compute the theoretical degree of

contribution of CNVs and SDs to HWD of individual SNPs

provided limited knowledge of CNVs in the particular population

under study, rather than developing a method of detecting CNVs

using a dense set of genotyped SNPs.

The power to detect deviation from HWE in SNPs in a

segmentally duplicated region was recently examined by theoret-

ical analysis and simulation[19]. Here we provide a more general

model that considers CNVs and their relative contribution to

HWD. We construct a quantitative SNP-CNV mixture model and

present Bayesian estimates of probability of a SNP being in a

CNV, given that it is significantly deviated from HWE. To our

knowledge this is the first study to provide the posterior

probabilities P(CNV|HWD).

Results

I. Model and assumptions

According to Redon et al., only about 1,2% of CNVs are multi-

allelic and 5,10% are complex[11]. Thus, the majority of the CNVs

detected maybe biallelic,which involves eithera singleduplication or

a single deletion. It is relatively easier to identify deletion

polymorphisms, by null allele individuals. Assuming that there is no

null-alleleindividual,we proposethat a biallelic CNV assumption isa

good start for quantitative modeling. An extension may be applied to

multiallelic or more complex cases. In order to deal with multiallelic

CNVs, more parametric assumptions are required such as how

sequence variations aredistributed across different copies. We believe

that a multiallelic extension may be more informative after we gain

more knowledge about these parameters.

Under a biallelic CNV assumption, we can imagine a situation

as depicted in Figure 1. Suppose that we have two sites L1 and L2,

where L1 is always a diploid and L2 is a variable ectopic site. In

some individuals, L2 may not exist or exist in only one of the two

homologous chromosomes. Suppose the observed SNP has alleles

A and C, with A as the minor allele, as an example. Each of the

two sites can be either heterozygotic or monomorphic. We denote

by p1the true frequency of allele A at L1, and by p2the true

frequency of allele A at L2. Though we assume that A is the

observed minor allele, it does not have to be a minor allele at each

site and p1 and p2 may range from 0 to 1. Additionally, we

introduce a new parameter r, the frequency of having both sites L1

and L2, as apposed to having only L1. Thus, r refers to the true

allele frequency of the underlying CNV. For a CNV, r can vary

between 0 and 1. When there is no duplication (i.e. regular

genomic regions), r=0. When duplication is fixed in all individuals

in the population (segmental duplication), r=1. For convenience,

here rM(0,1) (i.e. 0,r,1) is treated equivalent to a CNV, r=0 to a

regular genomic region, and r=1 to a SD.

If both sites are polymorphic with different pairs of bases, the

observed SNP will be triallelic (or even quadrallelic), which are not

considered in the current study. Here, we assume the observed

SNP is biallelic, as well as the true sites and the CNV itself.

Theoreticalderivation

frequencies.

Given true SNP allele frequencies p1and p2and

CNV allele frequency r, observed SNP genotype frequencies

^ p pAA, ^ p pCCand ^ p pACwere derived, under the assumption that each

of the three markers (two SNP sites and a CNV) is independent

and under Hardy-Weinberg equilibrium (details in Method S1):

of observedgenotype

^P PAA~p2

11{rzrp2

ðÞ2

ð1Þ

^P PCC~ 1{p1

ðÞ21{rp2

ðÞ2

ð2Þ

^P PAC~1{^P PAA{^P PCC

ð3Þ

Observed allele frequencies can be directly calculated from

observed genotype frequencies.

^P PA~^P PAAz^P PAC

?2

^P PC~1{^P PA

ð4Þ

SNP genotyping errors.

can be in both ways and its rate depends on which nucleotides are

involved. However, it is more common to misread a heterozygote

as a homozygote. In our mixture model, we take a conservative

approach and assume that all genotyping errors mistake a

heterozygote as a homozygote, and not the other way around. If

we consider both directions, the two effects counterbalance each

other and contribute less to HWD. Thus, our assumption of one-

way genotyping error means that the genotyping error fully

contributes to HWD and does not cancel out within itself.

In theory, SNP genotyping errors

Analysis of SNPs within CNVs

PLoS ONE | www.plosone.org2 December 2008 | Volume 3 | Issue 12 | e3906

Page 3

II. Effect of allele frequency parameters on HWD

1. Measure of HWD.

relationship between HWD, r, p1,p2and ^ p pA. For this purpose, we

used a quantitative measure of HWD. A measure of Hardy-

Weinberg disequilibrium, h, has been suggested by Olson and

Foley[20].

h~

4pAApCC, where pAA, pCCand pAC are frequencies of geno-

types AA, CC and AC.

Under HWE, h=1. When there are excessive heterozygotes,

h.1. When there are more homozygotes than expected under

HWE, h,1. Unlike other HWD measures such as the disequilib-

rium parameter D[21] and the inbreeding coefficient f[22], h does

not assume symmetric deviations from the two homozygote

frequencies, which is useful for our analysis because the effect of a

CNV on the two homozygote frequencies is not always symmetric.

2. Behavior of h with respect to r, p1and p2.

Figure 2, h monotonically increases with r, regardless of p1and p2.

This indicates that the ectopic site contributes to increasing the

number of observed heterozygotes relative to homozygotes. Based

on the assumption of no other causes of HWD such as SNP

genotyping errors, h never goes below 1 (log(h) is always $0).

Thus, duplication always results in excessive heterozygotes.

3. Estimation of r, given h and an observed minor allele

frequency.

Given the observed minor allele frequency, the

possible values of r vary widely depending on the assumption of p2.

The plots in Figure 3 were drawn based on the simulation

described above. A larger h always indicates a larger r, given ^ p pA. A

higher ^ p pAmay indicate a larger or a smaller r, depending on p2.

4. Range of p1, given an observed allele frequency and

r.

Figure 4 shows the relationship between the true and the

observed allele frequencies given r. When r is large and the minor

allele frequency is large, the deviation of observed allele frequency

from true allele frequency p1can be very large. Thus, in this case

the observed allele frequency cannot serve as a substitute for the

Our first goal is to understand the

pAC2

As seen in

true allele frequency. In the majority of the cases, the minor allele

frequency is overestimated. Figure S1 shows the range of true

allele frequency given pooled sample allele frequencies.

III. Probability that an HWE-violating SNP is in a CNV

P(CNV|HWD), or the probability that a SNP is in a CNV (i.e.

rM(0,1)), given that the SNP is in HWD, was computed at different

observed allele frequency(^ p pA), significance level for HWD testing

(a), sample size (n) and SNP genotyping error (eg). Several

hypothesis tests for HWE have been proposed, including the most

commonly used chi-square goodness-of-fit test[23]. Here we used a

chi-square test. We used two different prior distributions for true

CNV allele frequency r; uniform and beta distributions. The

uniform prior assumes equal probability density for all allele

frequency, whereas the beta distribution assumes higher proba-

bility towards a smaller r (more detail can be found in the

discussion section and Method S1).

As seen in Figure 5, at a=0.05 and n=100, under the

assumption of no genotyping error and a beta prior, segmental

duplication (r=1) was the most responsible cause of HWD.

Interestingly, when the observed minor allele frequency is small

(,0.2), duplicons happen to generate allele frequencies that mimic

apparent HWE, and random variation is the most important cause

of HWD at these small minor allele frequencies. Under the beta

prior with 5% genotyping error, the contribution from SD or

CNV becomes minor, except at ^ p pAw0:4. Under a very large

genotyping error, the probability of the SNP not being in a CNV

or SD is 60,80%. In general, a 1% Genotyping error made little

difference compared to the case of no genotyping error. For

n=1000 and a=0.01, with 0,1% genotyping errors, the most

likely cause of HWD was CNV or SD, depending on the observed

allele frequency. CNV and SD tend to counterbalance one-way

genotyping errors, as seen clearly in the case of a 25% error rate.

Figure 1. Possible cases of a SNP in a biallelic CNV. All possible cases of observed SNPs on a biallelic, duplication-type CNV. Each gray box

represents an individual. Two parallel lines are homologous chromosomes. The left homologous pair represents the original site (L1) and the right

pair represents the ectopic site (L2). The ectopic site may not exist or exist in only one of the homologous chromosomes in some individuals.

doi:10.1371/journal.pone.0003906.g001

Analysis of SNPs within CNVs

PLoS ONE | www.plosone.org3 December 2008 | Volume 3 | Issue 12 | e3906

Page 4

The relative contribution by duplication is quite different

depending on the stringency of HWD testing (Figure 5, No

genotyping error). At a=0.05, theoretically about 5% of SNPs in

the regular regions must be determined to be in HWD, whereas at

a=0.01, only 1% contributes to HWD. Also at a=0.05 and

n=100, SNPs in duplicons (CNV/SD) often do not generate a

sufficient deviation from HWE to be detected by the testing,

whereas at a=0.01 and n=1000, the likelihood of HWD given

CNV or SD become much larger (Figure 6, No genotyping error)

that the posterior probabilities point to CNVs and SDs as a major

contributor to HWD.

The uniform model (Figure S2, S3) tends to conclude a higher

contribution of CNV to HWD compared to the beta model, which

is intuitive because the uniform model assumes more CNVs whose

allele frequencies are close to SD than to regular regions.

The computation by sampling directly from priors converged,

as suggested by one of the cases shown here (Figure 6). The

computation was done by summing the probabilities of different

Figure 2. r vs log(h), given true allele frequency. A. p2=0, B. p2=1. Log base 2.

doi:10.1371/journal.pone.0003906.g002

Figure 3. log(h) vs r, given observed allele frequency. A. p2=0, B. p2=1. Log base 2. Observed allele frequencies are derived from computed

observed genotype frequencies.

doi:10.1371/journal.pone.0003906.g003

Analysis of SNPs within CNVs

PLoS ONE | www.plosone.org4 December 2008 | Volume 3 | Issue 12 | e3906

Page 5

cases of r, p1and p2. Some individual cases failed to converge but

did not affect the overall summation, because the values were

ignorably small (Figure S4).

Discussion

Effect of allele frequency parameters on HWD

Our simulation shows that the HWD measure h only increases

with respect to r under no experimental errors, supporting that

duplication acts in the direction of increasing observed heterozygotes.

Probability that an HWE-violating SNP is in a CNV

Our results suggest that copy number variation can be a major

contributor to HWD, even assuming the tendency towards small

variant frequencies of CNV, especially at a low observed SNP

minor allele frequency and large sample size. Segmental

duplication is a major effect at a higher observed SNP minor

allele frequency. About 1% genotyping errors did not make much

difference to P(CNV|HWD). At a 5% or higher genotyping error,

CNV or SD is less likely to be the cause of HWD.

Out results show that the probability of a SNP being in a

duplicated region given HWD depends on the observed allele

frequency. In case of a high observed minor allele frequency,

HWD tends to be due to duplication, whereas in case of a small

^ p pA, HWD is mainly due to SNP genotyping error and random

variation. This is mainly because the effect of duplication can be

buffered for low observed minor allele frequencies.

Hosking et al.[24] analyzed 36 HWE-violating SNPs and

concluded that 58% of these cases were due to genotyping errors.

This is an average that does not depend on observed minor allele

Figure 4. Range of p1, given r and ^ p pA. The black diagonal line is the case where the true frequency p1is identical to the observed frequency. Red

and blue curves represent p2=0 and p2=1, respectively.

doi:10.1371/journal.pone.0003906.g004

Analysis of SNPs within CNVs

PLoS ONE | www.plosone.org5 December 2008 | Volume 3 | Issue 12 | e3906