High-resolution mapping and analysis of copy
number variations in the human genome: A data
resource for clinical and research applications
Tamim H. Shaikh,1,2,11Xiaowu Gai,3,11Juan C. Perin,3Joseph T. Glessner,4Hongbo Xie,3
Kevin Murphy,5Ryan O’Hara,3Tracy Casalunovo,4Laura K. Conlin,1Monica D’Arcy,5
Edward C. Frackelton,4Elizabeth A. Geiger,1Chad Haldeman-Englert,1
Marcin Imielinski,4Cecilia E. Kim,4Livija Medne,1Kiran Annaiah,4Jonathan P. Bradfield,4
Elvira Dabaghyan,4Andrew Eckert,4Chioma C. Onyiah,4Svetlana Ostapenko,3
F. George Otieno,4Erin Santa,4Julie L. Shaner,4Robert Skraban,4Ryan M. Smith,4
Josephine Elia,6,7Elizabeth Goldmuntz,2,9Nancy B. Spinner,1,2Elaine H. Zackai,1,2
Rosetta M. Chiavacci,4Robert Grundmeier,2,3,8Eric F. Rappaport,3
Struan F.A. Grant,1,2,4Peter S. White,2,3,5,12and Hakon Hakonarson1,2,4,10,12
1Division of Genetics, The Children’s Hospital of Philadelphia, Philadelphia, Pennsylvania 19104, USA;2Department of Pediatrics,
University of Pennsylvania School of Medicine, Philadelphia, Pennsylvania 19104, USA;3Center for Biomedical Informatics, The
Children’s Hospital of Philadelphia, Philadelphia, Pennsylvania 19104, USA;4Center for Applied Genomics, The Children’s Hospital of
Philadelphia, Philadelphia, Pennsylvania 19104, USA;5Division of Oncology, The Children’s Hospital of Philadelphia, Philadelphia,
Pennsylvania 19104, USA;6Department of Child and Adolescent Psychiatry, The Children’s Hospital of Philadelphia, Philadelphia,
Pennsylvania 19104, USA;7Department of Psychiatry, University of Pennsylvania School of Medicine, Philadelphia, Pennsylvania
of Cardiology, The Children’s Hospital of Philadelphia, Philadelphia, Pennsylvania 19104, USA;10Division of Pulmonary Medicine, The
Children’s Hospital of Philadelphia, Philadelphia, Pennsylvania 19104, USA
We present a database of copy number variations (CNVs) detected in 2026 disease-free individuals, using high-density,
SNP-based oligonucleotide microarrays. This large cohort, comprised mainly of Caucasians (65.2%) and African-
Americans (34.2%), was analyzed for CNVs in a single study using a uniform array platform and computational process.
We have catalogued and characterized 54,462 individual CNVs, 77.8% of which were identified in multiple unrelated
individuals. These nonunique CNVs mapped to 3272 distinct regions of genomic variation spanning 5.9% of the genome;
51.5% of these were previously unreported, and >85% are rare. Our annotation and analysis confirmed and extended
previously reported correlations between CNVs and several genomic features such as repetitive DNA elements, segmental
duplications, and genes. We demonstrate the utility of this data set in distinguishing CNVs with pathologic significance
from normal variants. Together, this analysis and annotation provides a useful resource to assist with the assessment of
CNVs in the contexts of human variation, disease susceptibility, and clinical molecular diagnostics.
[Supplemental material is available online at http:/ /www.genome.org. The CNV data reported here are available at
http:/ /cnv.chop.edu. These data are also available in the Database of Genomic Variants (DGV) (http:/ /projects.tcag.ca/
variation). The individual level intensity data from the Illumina arrays are available in dbGaP (http:/ /www.ncbi.nlm.nih.
gov/dbgap) under accession phs000199.v1.p1.]
Copy number variation(CNV) in the humangenomesignificantly
influences human diversity and predisposition to disease (Sebat
et al. 2004, 2007; Sharp et al. 2005; Conrad et al. 2006; Feuk et al.
2006; Hinds et al. 2006; McCarroll et al. 2006; Redon et al. 2006;
Kidd et al. 2008; Perry et al. 2008; Walsh et al. 2008). CNVs arise
from genomic rearrangements, primarily owing to deletion, du-
plication, insertion, and unbalanced translocation events. The
pathogenic role of CNVs in genetic disorders has been well docu-
mented (Lupski and Stankiewicz 2005), yet the extent to which
CNVs contribute to phenotypic variation and complex disease
predisposition remains poorly understood. CNVs have been
known to contribute to genetic disease through different mecha-
nisms, resulting in either imbalance of gene dosage or gene dis-
ruption in most cases. In addition to their direct correlation with
that can be deleterious (Feuk et al. 2006; Freeman et al. 2006).
11These authors contributed equally to this work.
E-mail firstname.lastname@example.org; fax (215) 590-3020.
E-mail email@example.com; fax (267) 426-0363.
Article published online before print. Article and publication date are at
1682 Genome Research
19:1682–1690 ? 2009 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/09; www.genome.org
Recently, several studies have reported an increased burden of
rare or de novo CNVsin complexdisorders suchas Autism,ADHD,
and schizophrenia as compared to normal controls, highlighting
the potential pathogenicity of rare or unique CNVs (Sebat et al.
2007; International Schizophrenia Consortium 2008; Stefansson
et al. 2008; Walsh et al. 2008; Xu et al. 2008; Elia et al. 2009). Thus,
more thorough analysis of genomic CNVs is necessary in order to
determine their role in conveying disease risk.
Several approaches have been used to examine CNVs in the
genome, including array CGH and genotyping microarrays
(Albertson and Pinkel 2003; Iafrate et al. 2004; Sebat et al. 2004;
Sharp et al. 2005; Redon et al. 2006; Wong et al. 2007). Results
from more than 30 studies comprising 21,000 CNVs have been
reported in public repositories (Iafrate et al. 2004). However,
a majority of these studies have been performed on limited num-
bers of individuals using a variety of nonuniform technologies,
reporting methods, and disease states. In addition, these data are
both substantially reiterative and enriched in CNVevents that are
frequently observed in one or more populations. Thus, extreme
care is needed in determining whether a particular structural var-
iant plays a role in disease susceptibility or progression. To address
these challenges, we identified and characterized the constella-
tion of CNVs observed in a large cohort of healthy children and
their parents, when available. This study uses uniform measures to
detect and assess CNVs within the context of genomic and func-
tional annotations, as well as to demonstrate the utility of this
information in assessing their impact on abnormal phenotypes.
the assessment of structural variants in the contexts of human
variation, disease susceptibility, andclinicalmoleculardiagnostics.
Assessment of copy number variation
in 2026 healthy individuals
DNA samples analyzed in our study
were obtained from the whole blood of
healthy subjects routinely seen at pri-
mary care and well-child clinic practices
within the Children’s Hospital of Phila-
delphia (CHOP)Health Care Network. All
samples were uniformly genotyped using
the Illumina HumanHap 550 BeadChip.
Genotype data were analyzed for CNVs
using Illumina’s BeadStudio software in
combination with CNV detection meth-
odologies developed by our group. Data
from 2026 individuals were used for CNV
analysis, comprising 1320 Caucasians
(65.2%), 694 African-Americans (34.2%),
and 12 Asian-Americans (0.6%). Overall,
we detected a total of 54,462 CNVs,
with an average of 26.9 CNVs per in-
dividual (range 4–79) (Supplemental
Table 1). Collectively, these CNVs span-
ned 551,995,356 unique base pairs, or
;19.4% of the total human genome.
A majority of the CNVs detected
(77.8%) were classified as nonunique
CNVsas they wereobserved in morethan
one unrelated individual (Table 1). Al-
though it is likely that some nonunique CNVs may representfalse-
positives due to platform-specific artifacts, a vast majority of them
are hypothesized to be real as they were detected independently in
more than one unrelated individual. This is supported by our ex-
perimental validation of nonunique CNVs using quantitative PCR
(see below). We selected nonunique CNVs sharing at least 80%
overlap in SNP content for further analysis and annotation. Mean
and median sizes of nonunique CNVs were 38.3 kb and 7.2 kb,
respectively. A vast majority (93.8%) of these nonunique events
The remaining 22.2% of events were classified as unique
CNVs since each event was detected in just one individual. The
unique CNV set likely includes rare, individual-specific variants
as well as potential false-positives. The unique and nonunique
data sets are available for download at http://cnv.chop.edu.
We used a combination of experimental methods to provide
validation for a representative set of CNVs detected in our pop-
ulation,includingCNVs of different size classes(Table 2). Methods
included cross-platform validation with the Affymetrix 6.0 array,
quantitative PCR, fluorescent in situ hybridization (FISH), multi-
plex ligation-dependent probe amplification (MLPA), and com-
parison with reported fosmid end-sequencing results (Table 2;
Methods; Supplemental Methods). The array-based comparison
suggested an overall validation rate of 72.7% (Table 2). For CNVs
represented by more than 10 probes on the Illumina platform, our
validation rate was >96% with a gradual decrease in validation
with reducing numbers of probes. This analysis provides a conser-
vative estimate of the true positive rate of CNVs, categorized by
probe content, detected using our methods. The validation rate
Summary characteristics of nonunique CNVs
Number of SNPs
deletionsDuplications All events
Total 4679 (84.0%)
Number of SNPs
Duplications All events
Copy number variation mapping in healthy controls
for nonunique CNVs, spanning two to nine probes, as measured
by quantitative PCR, was 80%. All deletions (12/12) spanning
two to nine probes were validated, while duplications span-
ning two to nine probes had a much lower validation rate of 50%.
This combined with the array-based comparison results yields
a conservative false discovery rate upper bound of 50% for CNVs
spanning two to nine probes.
Generation of CNV database and web-based resource
All CNVs identified in this study are available at http://cnv.chop.
edu. A database and query engine allows users to search for and
sort CNVs by a variety of criteria. Results are presented in a web-
based tabular format and as a set of study-wide file downloads for
all CNV determinations. The CNV database can be queried for all
CNVs within a selected region defined either by chromosomal
coordinates or individual gene names (Fig. 1). The user can visu-
alize all CNVs within a given interval or just focus on either the
nonunique or unique CNVs. Additionally, the web browser allows
further classification of the CNVs by ethnicity, size, number of
SNPs within, and individual variation types, which comprises
duplications and both homozygous and heterozygous deletions.
Resulting CNVs can be displayed in either a tabular, graphical, or
combined format (Fig. 1; Supplemental Fig. 1). Furthermore, the
‘‘Map it’’ link allows the visualization of a particular CNV in the
context of all available annotations within the UCSC Genome
Browser (http://genome.ucsc.edu), while the ‘‘Toronto DB’’ link
accesses the corresponding CNV data in the Database of Genomic
Variants (DGV) (http://projects.tcag.ca/variation) (Fig. 1; Supple-
mental Fig. 1). A link for ‘‘downloads’’ of all CNV data from a given
display is available at the bottom of the web page.
The contemporary Database of Genomic Variants serves as
a valuable repository of CNVs, with more than 21,000 CNVs from
31 studies represented currently. Overall comparison with this
public variant set revealed that 73.1% of our nonunique CNVs
overlapped with CNVs reported in DGV. In addition, the frequen-
cy of overlap increased as a function of population frequency:
54.9% of CNV blocks with <1% frequency overlapped with DGV
CNVRs, compared to 98.8% overlap with DGV for CNV blocks
with frequencies >10% (Supplemental Table 2). Conversely, only
together, these results indicate that the CNVs we have identi-
fied are more likely to be rare events
in comparison with previously reported
structural variant collections. This is
consistent with the notion that platform
and methodological variations may con-
tribute significantly to these differentials.
We have also examined whether the
genomic distributions of various classes
of structural and functional elements
were correlated with the presence or
absence of CNV regions. Our results ex-
tended upon previously reported corre-
lations and are available in the Supple-
annotations’’ and in Supplemental Tables
3–7 and Supplemental Figure 3. Ethnic-
specific CNVanalysis was also performed
for samples of Caucasian and African
ancestries, the results of which are avail-
able in the Supplemental material (Sup-
plemental Results; Supplemental Tables 8–10; Supplemental
Interpretation of CNVs
Differences in genome coverage, resolution, technologies, cohort
prove challenging for successfully interpreting the biological sig-
nificance of particular events. In comparing our results with pre-
viously reported CNVs, data from the latter often appeared to
overstate the genomic extent of actual variation, as well as to un-
derestimate variation among individuals. One typical illustration
of these effects is represented by CNVs encompassing the putative
tumor suppressor gene CSMD1 (Fig. 2). Studies from DGV collec-
tively report 49 CNVs within this gene (mean size: 347 kb; median
size: 9560 bp), including seven duplications spanning large
stretches of the gene (all derived from HapMap cell lines) and an
additional five CNVs predicted to disrupt one or more CSMD1
exons (12/49, 24.5%). Interpretation of these results might lead to
the conclusion that genomic alterations of this gene are frequent
and do not necessarily predispose to disease risk. However, while
our CNV set identifies 507 CNVs within this region, the mean and
four of our CNVs (0.8%) in this region are predicted to disrupt
exonic sequence, and we did not detect any of the large duplica-
tions previously reported, suggesting the possibility that these are
genomic regions with CNV distributions similar to the CSMD1
example. Thus, our data set should facilitate further delineation of
the true extent of structural variation within a given genomic re-
gion, leading to improved interpretation of the biological signifi-
cance of particular events.
Assessment of pathogenicity in clinical samples
A CNV data set generated from healthy controls has the potential
to be very useful in clinical applications as a comparator with
CNVs identified in diseased individuals. We demonstrate the
clinical utility of our CNV collection using the example of
a patient with multiple congenital anomalies, including global
Validation of CNVs
Type of CNV
Number of SNPs
Experimental technique used for validation
(Kidd et al. 2008)
aTotal number validated/total number tested.
(NT) Not tested; (NA) not applicable.
Shaikh et al.
1684 Genome Research
developmentaldelay andbrainmalformations. Interestingly, 32 of
35 CNVs identified in this individual were transmitted from
a healthy parent or had been previously detected in healthy
controls, many of them at frequencies >1% (Table 3). Of the re-
maining three CNVs, two included olfactory receptor genes and
were relatively small in size. The third unique CNV, the second
largest CNV detected (915 kb), was a deletion in 17p13 that en-
tirely encompasses 51 genes, including several genes involved in
early embryonal development. The 915-kb deletion was vali-
dated by fluorescent in situ hybridization (data not shown).
Analysis of parental samples showed that while 32 of the 35 pro-
band CNVs were found to be inherited from a parent, the 17p13
deletion was apparently de novo, providing support for the po-
tential pathogenicity of this variant based solely on control CNV
To further assess the utility of our CNV database, we exam-
ined two microdeletions recently implicated in neurological dis-
orders. A recurrent 1.5-Mb microdeletion in 15q13.3 has been
associated with a recently recognized syndrome characterized by
mental retardation and seizures (Sharp et al. 2008). This micro-
deletion contains at least six genes, including the CHRNA7 gene
that has been implicated in epilepsy (Sharp et al. 2008). An as-
30192473, hg17, NCBI build 35) yielded 36 nonunique CNVs in
this region, comprising 16 deletions and 20 duplications (Fig. 3);
five of these CNVs were unique (all duplications; available at
http://cnv.chop.edu). Most of the control CNVs were relatively
smaller, and none encompassed the entire critical region impli-
cated in the syndrome (Fig. 3), except for one unique duplication
encompassing the entire region (data not shown). The high prev-
alence of this 15q13.3 microdeletion in affected individuals along
deletion in the etiology of the patients’ phenotypes. Furthermore,
duplication CNVs in controls outnumbered the deletion CNVs,
were larger in size, andmore frequently affected codingsequences.
This may suggest that gain in copy number of genes within this
region may not be as detrimental as loss due to deletion.
In sharp contrast to the above example, CNVs seen in our
database contradict the genotype–phenotype correlation made
between a microdeletion in 15q11.2 and a patient with a neuro-
logical disorder and speech impairment (Murthy et al. 2007). In
this report, an ;400-kb deletion in 15q11.2 encompassing four
assessment of our CNV set for the region (chr15:20300000–
20800000, hg17, NCBI build 35) yielded 22 CNVs (both unique
and nonunique), including 15 deletions and seven duplications.
set within chromosomal ‘‘position’’ chr1:1–2,000,000. The graphical view shows the extent and type of CNVs; (het del) heterozygous deletion; (dup)
duplication. The CNVR is indicated, and the frequency graph of the CNV blocks is also shown. The tabular view lists additional information for each
individual CNV, including subject ethnicity, chromosomal band (Chr), sequence start and end positions, size in base pairs, type of event, and number of
SNPs within (SNPs). The interface also provides links to associated CNVRs and CNV Blocks, the Database of Genomic Variants (Toronto DB), genes within
or overlapping the CNV (Genes), and the UCSC Genome Browser (Map It!).
Copy number variation database web portal (http://cnv.chop.edu). This view shows the ‘‘combined’’ output of nonunique CNVs in our data
Copy number variation mapping in healthy controls
the entire critical region implicated in the syndrome (Fig. 4).
Although our data do not provide conclusive evidence for or
against a role for this microdeletion in abnormal phenotypes, it
cautions against relying strictly on assessment of disease-derived
CNVs for genotype–phenotype correlations. These findings un-
derscore the utility of our CNV data set in clinical diagnostics.
We present here a data set consisting primarily of relatively rare
human genomic CNVs that were derived from 2026 healthy
individuals. The generation of this resource is aimed at serving as
a reference to aid in the investigation of the clinical significance of
CNVs detected in disease cohorts. We believe that this will be
a valuable resource to other investigators for applications in clin-
ical diagnostics as well as in CNV enrichment and association
studies for particular disease cohorts. Currently, there are several
databases, including DECIPHER (https://decipher.sanger.ac.uk/)
and ECARUCA (http://agserver01.azn.nl:8080/ecaruca/ecaruca.
jsp), that provide cytogenetic and clinical information on dis-
orders known to result from CNVs. We envision a pathway in
which CNV data derived from clinical samples can be compared to
these clinical databases, DGV, and our data set for each CNV
detected. The clinical significance of CNVs detected in the sample
can then be better evaluated using several criteria, including the
occurrence and frequency in healthy controls, gene content, and
the phenotype being studied.
Genome-wide analyses such as ours are highly dependent on
the resolution and content of the discovery platform used. The
platform used in our study provides lower SNP coverage in regions
of known common CNVs, regions of segmental duplication, and
both the X and Y chromosomes, and as such is by no means
comprehensive. Interestingly, our nonunique CNVrate was much
higher than those reported in previous studies (Redon et al. 2006).
The higher rate of nonunique CNVs ob-
served in our study can be attributed at
least in part to our larger study cohort.
and sample size approaches a plateau as
more samples are surveyed (Supplemen-
tal Fig. 7), suggesting that the majority of
events detectable by our methods and
platform are being captured. However,
recent sequence-based analyses of CNVs,
such as the fosmid end-sequencing study
of nine HapMap individuals (Kidd et al.
2008), indicate that a large number of
as-yet-undiscovered variants are present
in the human genome. Thus, we con-
clude that although not comprehensive,
our survey is identifying a substantial
proportion of moderately common and
rare genomic variations existing in the
Caucasian and African-American pop-
ulations, and a considerably larger set of
variants than currently exists in DGV.
This observation further highlights the
utility of our CNV collection for clinical
applications, as moderately recurrent and
rare CNVs are more likely to cause erro-
neous genotype–phenotype correlations.
Furthermore, analyses such as ours are also highly dependent
on computational algorithms used for detection and platform-
specific experimental errors. As the large set of CNV predictions
has precluded exhaustive validation, we focused validation efforts
on establishing general quality guidelines for guiding users. We
have used a combination of computational and experimental
techniques to carefully evaluate selected CNVs. Our analyses pre-
dict low false discovery and false-negative rates, especially for
nonunique CNVs, deletions, and CNVs spanning four or more
SNPs. Furthermore, the fact that most of our nonunique CNVs
overlapped with those reported by DGV from multiple studies
suggests that they represent authentic CNVs. While we have pro-
vided access to all CNV predictions, we recommend particular
caution in using the unique CNV data, particularly those that are
represented by fewer than four SNPs, where independent valida-
tion using experimental methods is advised.
Our analyses largely reiterated prior associations between
genomic features and CNV distributions in a larger, more uniform
sample set. The presence of ethnic-specific CNV signatures is in
keeping with the demonstration of greater genomic diversity
among individuals of African descent from HapMap data (The
International HapMap Consortium 2003, 2007; Sebat et al. 2007).
Similarly, our results confirmed that CNV distributions are posi-
tively correlated with regions of segmental duplication (Redon
et al. 2006). The role of segmental duplications (SDs) in generating
pathogenic chromosomal rearrangements by nonallelic homolo-
support a proposed model wherein CNV generation is promoted by
close proximity to SDs (Sharp et al. 2005; Redon et al. 2006).
As CNV determinations continue to improve in-depth reso-
lution and inclusion, the results will empower both biological
discovery and clinical application. Greater resolution will espe-
cially be important for precisely determining the extent of each
CNV, the frequency with which specific genomic regions are
disrupted in healthy and disease cohorts, and the biological
gene. (Top row) Chromosome 8 genomic sequence coordinates for the CSMD1 gene. (Second row)
Exonic structure of the 70-exon CSMD1 gene. (Red vertical lines) Exons; (black horizontal line) the
extent of the mRNA transcript. Owing to the scale of the diagram, each exon is treated as an equivalent
size, and exons with short intervening sequences are drawn adjacent to each other. (Third row) CNVs
overlap one or more CSMD1 exons. (Bottom row) CNVs within the CSMD1 gene reported in this study.
Numbers adjacent to two CNVs (designated by asterisks) indicate the number of instances in which that
exact CNV is reported. CNVs with a lighter shade of purple overlap one or more CSMD1 exons.
Comparison of CNVs detected in the current cohort with DGV CNVs within the CSMD1
Shaikh et al.
1686 Genome Research
part on clone-based array data, may be inflated in size consistent
with other recent studies (Kidd et al. 2008). This finding is highly
significantespeciallysince use ofcurrent CNVdatabasesin clinical
applications enhances the possibility of erroneously excluding
disease-causing variation in patient samples. We envision that the
genomic studies on medical disorders with a genomic component.
Sample population and SNP genotyping
Subjects were primarily recruited from the Philadelphia region
through the Hospital’s Health Care Network, including four pri-
mary care clinics and several group practices and outpatient
practices that performed well child visits. Eligibility criteria for this
study included all of the following: (1) disease-free children and
high quality, genome-wide genotyping data from blood samples
(defined in Supplemental Methods); (2) self-reported ethnic
background; and (3) no serious underlying medical disorder, in-
cluding but not limited to neurodevelopmental disorders, cancer,
chromosomal abnormalities, and known metabolic or genetic
disorders. Genotypes from a small set of parents of the participat-
ing children were used to assess CNV heritability patterns. All
subjects and/or their parents signed an informed consent permit-
Ancestry informative markers (AIMs) available on the Human-
Hap550 BeadChip (Yang et al. 2005) were used to evaluate eligible
subjects to determine ethnicity. Where the AIMs markers contra-
dicted self-reported ethnicity, the AIMs marker status was used in
the analysis. The cohort comprised 1320 Caucasians, 694 African-
Americans, and 12 Asian-Americans. This cohort contained 80
complete mother–father–child trios. Furthermore, there were 325
mother–child, 140 father–child, 59 sibling, and 10 twin relation-
ships confirmed by genotype concordance. The remaining 1492
samples shared no relatedness with other samples in this data set.
Samples were assayed on the Illumina Infinium II Human-
Hap550 BeadChip (Gunderson et al. 2005; Steemers et al. 2006)
(Illumina), as previously described in our laboratory (Hakonarson
et al. 2007). A total of 2026 individuals passed all quality control
(QC) measures, which included >98% SNP call rate and LRR stan-
dard deviation <0.35, and qualified for the study. The version of
Illumina Infinium BeadChip is consistent for all samples in this
study. The standard Illumina cluster file was used for the analysis,
which is generated at Illumina by running 120 HapMap samples,
running the BeadStudio clustering algorithm, and reviewing SNPs
with poor performance statistics, including call frequency, cluster
separation, and Hardy-Weinberg equilibrium. We reviewed this
Assessment of CNVs detected in a patient with multiple congenital anomalies
SNPsCNV type CNV size
2 OR genes
(OR gene) Olfactory receptor gene; (SD region) region of known segmental duplication (RefSeq gene transcript overlap was used for gene assessment);
(P.T.) parental transmission. Boldface indicates the putative pathogenic CNV.
Copy number variation mapping in healthy controls
clustering in reference to our typed samples to robustly establish
a reference normal diploid state for each SNP. This optimization
was essential to establish the true baseline from which theta (ratio
of green color corresponding to genotype) and R (intensity) are
calculated into B allele frequency (BAF) and Log R ratio values
(LRRs). We reviewed the raw theta and R-values of each SNP in
called CNV regions to ensure proper clustering of normal samples
and deviation of samples with a CNV call across the region. Spu-
rious single SNP-driven signals were rejected.
CNV detection and initial analysis
The Illumina BeadStudio 3.0 software package was used for initial
CNV detection analysis. LRRs and BAFs were first exported from
BeadStudio. LRR values were used as an additional sample-wide
genotype quality control measure, and LRRs with a standard de-
viation above 0.35 were excluded from the study. In our experi-
ence, Log R ratio standard deviation provides a robust quality
metric; as demonstrated in Supplemental Figure 8, samples with
LRR SDs <0.35 have similar numbers of CNVs detected with our
method. Furthermore, samples with LRR SDs >0.35 had signifi-
cantly higher numbers of detected CNVs, a majority of which are
expected to be false-positives resulting from background.
CNV detections were then performed for the remaining
genotypes using a customized analysis workflow. Briefly, chro-
mosomes were segmented based on LRRs using the Circular Binary
Segmentation algorithm implemented in the R statistical package
module DNAcopy 1.7. Default parameters were used (i.e., nperm =
10,000; alpha = 0.01; kmax = 25; nmin = 200; eta = 0.05; overlap =
0.25; trim = 0.025; undo.splits = ‘‘none’’). Segments were then
filtered based on their average LRRs and additional devised BAF
n ? 1+
min Xi? 0;1 ? Xi;jXi? 0:5jðÞðÞ2
n ? 1+
min Xi? 0;1 ? Xi;jXi? 0:67j;jXi? 0:33jðÞðÞ2
30,302,218, hg17, NCBI build 35) are shown as custom tracks within the UCSC Genome Browser (http://genome.ucsc.edu/). (Red rectangles) Deletions;
(blue rectangles) duplications; (green rectangle) the CNV reported by Sharp et al. (2008). The UCSC known genes and segmental duplication tracks are
Copy number variation within 15q13.3. Nonunique CNVs detected in our control data set that map within 15q13.3 (chr15:28,700,577–
Shaikh et al.
1688 Genome Research
The b2.sd and b3.sd for each segment were used to measure
whether the BAF pattern of a segment fits the two-copy mode
better than a three-copy mode, or vice versa. The paucity of AB
chromosomes, the thresholds used are listed in Table 4.
Different LRR cutoffs were used for the X chromosome. For
males, X chromosome thresholds of ?2 and 0.1 were used for
hemizygous deletions and duplications, respectively. For females,
X chromosome thresholds of ?1.5, ?0.1, and 0.6 were used for
homozygous deletions, heterozygous deletions, and duplications,
respectively. Female X duplications and homozygous deletions
were also required to have b2.sd $ b3.sd. The percentage of SNPs
with BAFs between 0.6 and 0.4 in the segment #4% was a re-
quirement for calling the segment a heterozygous deletion for
females as well.
CNV validation was conducted by a combination of experimental
methods (experimental details are available in Supplemental
Methods). Briefly, cross-platform validation was performed on 112
HapMap samples to provide an unbiased assessment of the accu-
racy and robustness of our computational methods. Illumina
HumanHap550K genotypes of these HapMap samples were
obtained from Illumina and analyzed with our computational
methods. Affymetrix 6.0 genotyping data sets from these same
HapMap samples were obtained from Affymetrix and analyzed for
CNVs using a commercial software package (Partek Genomics
Suite; Partek Incorporated; Supplemental Table 11). Quantitative
PCR was used to validate a representative sample of nonunique
CNVs containing fewer than 10 SNPs (Supplemental Table 12).
Finally, CNV calls made by our method were compared to those
end-sequence pairs in a recently published study by Kidd and
Data availability and access
The CNV data reported here are available at http://cnv.chop.edu.
These data are also available in the Database of Genomic Variants
20,800,000, hg17, NCBI build 35) are shown as custom tracks within the UCSC Genome Browser (http://genome.ucsc.edu/). (Red rectangles) Deletions;
(blue rectangles) duplications; (green rectangle) and the CNV reported by Murthyetal.(2007).TheUCSCknowngenesandsegmentalduplicationtracks
are also shown.
Copy number variation within 15q11.2. Nonunique CNVs detected in our control data set that map within 15q11.2 (chr15:20,300,000–
Thresholds used for autosomal chromosomes
Type of CNV Mean LRRs
of SNPs with
0.6 and 0.4
b2.sd and b3.sd
b2.sd $ b3.sd
b2.sd $ b3.sd
Copy number variation mapping in healthy controls
will be available in dbGaP under accession phs000199.v1.p1.
This work was supported in part by NIH grant GM081519 (to
T.H.S), Pennsylvania Departmentof HealthgrantSAP4100037707
(to P.S.W.), a Developmental Research Award from the Cotswold
Foundation (to H.H. and S.F.G), funds from the David Lawrence
Altschuler Chair in Genomics and Computational Biology (to
P.S.W.), and Institutional Awards to the Center for Applied
Genomics (to H.H.) and the Center for Biomedical Informatics (to
P.S.W.) from the Children’s Hospital of Philadelphia. We thank all
participating subjects and families for making this study possible.
Alexandre Belisle, Alejandrina Estevez, Kenya Fain, Rosalie Fre-
chette, Alexandria Thomas, and LaShea Williams provided expert
assistance with data collection and management. We also ac-
knowledge Allen Ladd and Peter Witzleb of CHOP and Smari
Kristinsson, Larus Arni Hermannsson, and Asbjo ¨rn Krisbjo ¨rnsson
of Rafo ¨rninn ehf for informatics support. The Children’s Hospital
of Philadelphia Institutional Review Board has approved this
Albertson DG, Pinkel D. 2003. Genomic microarrays in human genetic
disease and cancer. Hum. Mol. Genet. 12: R145–R152.
Conrad DF, Andrews TD, Carter NP, Hurles ME, Pritchard JK. 2006. A high-
resolution survey of deletion polymorphism in the human genome. Nat
Genet 38: 75–81.
Elia J, Gai X, Xie HM, Perin JC, Geiger E, Glessner JT, D’arcy M, deBerardinis
R, Frackelton E, Kim C, et al. 2009. Rare structural variants found in
attention-deficit hyperactivity disorder are preferentially associated
with neurodevelopmental genes. Mol Psychiatry 14: doi: 10.1038/mp.
Feuk L, Carson AR, Scherer SW. 2006. Structural variation in the human
genome. Nat Rev Genet 7: 85–97.
H, Jones KW, Tyler-Smith C, Hurles ME, et al. 2006. Copy number
variation: New insights in genome diversity. Genome Res 16: 949–961.
Gunderson KL, Steemers FJ, Lee G, Mendoza LG, Chee MS. 2005. A genome-
wide scalable SNP genotyping assay using microarray technology. Nat
Genet 37: 549–554.
Hakonarson H, Grant SF, Bradfield JP, Marchand L, Kim CE, Glessner JT,
Grabs R, Casalunovo T, Taback SP, Frackelton EC, et al. 2007. A genome-
wide association study identifies KIAA0350 as a type 1 diabetes gene.
Nature 448: 591–594.
Hinds DA, Kloek AP, Jen M, Chen X, Frazer KA. 2006. Common deletions
and SNPs are in linkage disequilibrium in the human genome. Nat Genet
Iafrate AJ, Feuk L, Rivera MN, Listewnik ML, Donahoe PK, Qi Y, Scherer SW,
Lee C. 2004. Detection of large-scale variation in the human genome.
Nat Genet 36: 949–951.
The International HapMap Consortium. 2003. The International HapMap
Project. Nature 426: 789–796.
The International HapMap Consortium. 2007. A second generation human
haplotype map of over 3.1 million SNPs. Nature 449: 851–861.
International Schizophrenia Consortium. 2008. Rare chromosomal
deletions and duplications increase risk of schizophrenia. Nature 455:
Kidd JM, Cooper GM, Donahue WF, Hayden HS, Sampas N, Graves T,
Hansen N, Teague B, Alkan C, Antonacci F, et al. 2008. Mapping and
sequencing of structural variation from eight human genomes. Nature
Lupski JR. 2007. Genomic rearrangements and sporadic disease. Nat Genet
Lupski JR, Stankiewicz P. 2005. Genomic disorders: Molecular mechanisms
for rearrangements and conveyed phenotypes. PLoS Genet 1: e49. doi:
McCarroll SA, Hadnott TN, Perry GH, Sabeti PC, Zody MC, Barrett JC,
Dallaire S, Gabriel SB, Lee C, Daly MJ, The International HapMap
Consortium, et al. 2006. Common deletion polymorphisms in the
human genome. Nat Genet 38: 86–92.
Murthy SK, Nguyen AOH, El Shakankiry HM, Schouten JP, Al Khayat AI,
Ridha A, Al Ali MT. 2007. Detection of a novel familial deletion of
four genes between BP1 and BP2 of the Prader-Willi/Angelman
disorder and speech impairment. Cytogenet Genome Res 116: 135–140.
Scheffer A, Steinfeld I, Tsang P, Yamada NA, et al. 2008. The fine-scale
and complex architecture of human copy-number variation. Am J Hum
Genet 82: 685–695.
Redon R, Ishikawa S, Fitch KR, Feuk L, Perry GH, Andrews TD, Fiegler H,
Shapero MH, Carson AR, Chen W, et al. 2006. Global variation in copy
number in the human genome. Nature 444: 444–454.
Sebat J, Lakshmi B, Troge J, Alexander J, Young J, Lundin P, Ma ˚ne ´r S, Massa
H, Walker M, Chi M, et al. 2004. Large-scale copy number
polymorphism in the human genome. Science 305: 525–528.
Sebat J, Lakshmi B, Malhotra D, Troge J, Lese-Martin C, Walsh T, Yamrom B,
Yoon S, Krasnitz A, Kendall J, et al. 2007. Strong association of de novo
copy number mutations with autism. Science 316: 445–449.
Sharp AJ, Locke DP, McGrath SD, Cheng Z, Bailey JA, Vallente RU, Pertz LM,
copy-number variation in the human genome. Am J Hum Genet 77:
Sharp AJ, Mefford HC, Li K, Baker C, Skinner C, Stevenson RE, Schroer RJ,
Novara F, De Gregori M, Ciccone R, et al. 2008. A recurrent 15q13.3
microdeletion syndrome associated with mental retardation and
seizures. Nat Genet 40: 322–328.
Steemers FJ, Chang W, Lee G, Barker DL, Shen R, Gunderson KL. 2006.
Whole-genome genotyping with the single-base extension assay. Nat
Methods 3: 31–33.
Stefansson H, Rujescu D, Cichon S, Pietilainen OP, Ingason A, Steinberg S,
Fossdal R, Sigurdsson E, Sigmundsson T, Buizer-Voskamp JE, et al. 2008.
Large recurrent microdeletions associated with schizophrenia. Nature
Walsh T, McClellan JM, McCarthy SE, Addington AM, Pierce SB, Cooper
GM, Nord AS, Kusenda M, Malhotra D, Bhandari A, et al. 2008. Rare
structural variants disrupt multiple genes in neurodevelopmental
pathways in schizophrenia. Science 320: 539–543.
Wong KK, deLeeuw RJ, Dosanjh NS, Kimm LR, Cheng Z, Horsman DE,
MacAulay C, Ng RT, Brown CJ, Eichler EE, et al. 2007. A comprehensive
analysisofcommon copy-number variationsinthehumangenome.Am
J Hum Genet 80: 91–104.
Yang N, Li H, Criswell LA, Gregersen PK, Alarcon-Riquelme ME, Kittles R,
Shigeta R, Silva G, Patel PI, Belmont JW, et al. 2005. Examination of
ancestry and ethnic affiliation using highly informative diallelic DNA
markers: Application to diverse and admixed populations and
implications for clinical epidemiology and forensic medicine. Hum
Genet 118: 382–392.
Xu B, Roos JL, Levy S, van Rensburg EJ, Gogos JA, Karayiorgou M. 2008.
Strong association of de novo copy number mutations with sporadic
schizophrenia. Nat Genet 40: 880–885.
Received December 1, 2008; accepted in revised form June 17, 2009.
Shaikh et al.
1690 Genome Research