Article

CNV Workshop: An integrated platform for high-throughput copy number variation discovery and clinical diagnostics

Center for Biomedical Informatics, The Children's Hospital of Philadelphia, Philadelphia, PA 19104, USA.
BMC Bioinformatics (Impact Factor: 2.58). 02/2010; 11(1):74. DOI: 10.1186/1471-2105-11-74
Source: PubMed

ABSTRACT

Recent studies have shown that copy number variations (CNVs) are frequent in higher eukaryotes and associated with a substantial portion of inherited and acquired risk for various human diseases. The increasing availability of high-resolution genome surveillance platforms provides opportunity for rapidly assessing research and clinical samples for CNV content, as well as for determining the potential pathogenicity of identified variants. However, few informatics tools for accurate and efficient CNV detection and assessment currently exist.
We developed a suite of software tools and resources (CNV Workshop) for automated, genome-wide CNV detection from a variety of SNP array platforms. CNV Workshop includes three major components: detection, annotation, and presentation of structural variants from genome array data. CNV detection utilizes a robust and genotype-specific extension of the Circular Binary Segmentation algorithm, and the use of additional detection algorithms is supported. Predicted CNVs are captured in a MySQL database that supports cohort-based projects and incorporates a secure user authentication layer and user/admin roles. To assist with determination of pathogenicity, detected CNVs are also annotated automatically for gene content, known disease loci, and gene-based literature references. Results are easily queried, sorted, filtered, and visualized via a web-based presentation layer that includes a GBrowse-based graphical representation of CNV content and relevant public data, integration with the UCSC Genome Browser, and tabular displays of genomic attributes for each CNV.
To our knowledge, CNV Workshop represents the first cohesive and convenient platform for detection, annotation, and assessment of the biological and clinical significance of structural variants. CNV Workshop has been successfully utilized for assessment of genomic variation in healthy individuals and disease cohorts and is an ideal platform for coordinating multiple associated projects.
Available on the web at: http://sourceforge.net/projects/cnv.

Full-text

Available from: Peter S White
SOFTW A R E Open Access
CNV Workshop: an integrated platform for high-
throughput copy number variation discovery and
clinical diagnostics
Xiaowu Gai
1
, Juan C Perin
1
, Kevin Murphy
2
, Ryan OHara
1
, Monica Darcy
1
, Adam Wenocur
1
, Hongbo M Xie
1
,
Eric F Rappaport
3,4
, Tamim H Shaikh
4,5
, Peter S White
1,2*
Abstract
Background: Recent studies have shown that copy number variations (CNVs) are frequent in higher eukaryotes
and associated with a substantial portion of inherited and acquired risk for various human diseases. The increasing
availability of high-resolution genome surveillance platforms provides opportunity for rapid ly assessing research
and clinical samples for CNV content, as well as for determining the potenti al pathogenicity of identified variants.
However, few informatics tools for accurate and efficient CNV detect ion and assessment currently exist.
Results: We developed a suite of software tools and resources (CNV Workshop) for automated, genome-wide CNV
detection from a variety of SNP array platforms. CNV Workshop includes three major components: detection,
annotation, and presentation of structural variants from genome array data. CNV detection utilizes a robust and
genotype-specific extension of the Circular Binary Segmentation algorithm, and the use of additional detection
algorithms is supported. Predicted CNVs are captured in a MySQL database that supports cohort-based projects
and incorporates a secure user authentication layer and user/admin roles. To assist with determination of
pathogenicity, detected CNVs are also annotated automatically for gene content, known disease loci, and gene-
based literature references. Results are easily queried, sorted, filtered, and visualized via a web-based presentation
layer that includes a GBrowse-based graphical representation of CNV content and relevant public data, integration
with the UCSC Genome Browser, and tabular displays of genomic attributes for each CNV.
Conclusions: To our knowledge, CNV Workshop represents the first cohesive and convenient platform for
detection, annotation, and assessment of the biological and clinical significance of structural variants. CNV
Workshop has been succes sfully utilized for assessment of genomic variation in healthy individuals and disease
cohorts and is an ideal platform for coordinating multiple associated projects.
Availability and Implementation: Available on the web at: http://sourceforge.net/projects/cnv
Background
Genome copy number changes (copy number v ariations,
or CNVs) include inherited, de novo, and somatically
acquired deviations from a diploid state within a parti-
cular chromosome segment. CNVs likely contribute sub-
stantially to inherited and/or acquired risk for a variety
of human diseases, including cancer and neuropsychia-
tric disorders [1,2]. In addition, CNVs are widely
distributed in the genomes of apparently healthy indivi-
duals and thus constitute significant amou nts of popula-
tion-based genomic variation [3-8]. New genotyping
technologies such as SNP-based arrays provide high-
resolution coverage of entir e genomes as well as an
opportunity for rapidly determining CNV content in
sample collections of interest [4,6,7,9-11]. Accordingly,
numerous rece nt studies have describe d constella tions
of structural variants in various healthy and di sease
cohorts [1,2,12,13]. However , interpretation of t he exact
extent, character, distribution, and effect of these CNVs
has been limited by the emerging nature of
* Correspondence: white@genome.chop.edu
Contributed equally
1
Center for Biomedical Informatics, The Childrens Hospital of Philadelphia,
Philadelphia, PA, 19104, USA
Gai et al. BMC Bioinformatics 2010, 11:74
http://www.biomedcentral.com/1471-2105/11/74
© 2010 Gai et al; licensee BioMed Central Ltd. This is an Open Access article dist ribute d under the terms of the Creative Commons
Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in
any medium, provided the original work is properly cited.
Page 1
computational methods for accurate detection, and
further challenged by the d ifficulty in assessing th e bio-
logical importance of particular CNVs in context with
other genomic features and study findings.
Detection of CNVs in high-density SNP arrays requires
genoty pes that yield high quality intensity and, optimally,
allelic ratio data for each locus surveyed. A number of
algorithms have been utilized for the detection of CNVs
from such genotyping data sets. Software from array ven-
dors such as Illumina and Affymetrix provide basic CNV
calls along with graphical interfaces that allow visual
inspection of a region of interest. However, these tools
generally lack the abi lity t o q uickly manage, annotate,
and assess CNVs from a sizable number of samples.
Moreover, visual inspection becomes challenging for
interpreting small o r complex rearrangements, or CNVs
predicted from genome array data of marginal quality. A
number of 3
rd
party commercial and open-source algo-
rithms, including QuantiSNP [14] and PennCNV [15],
utilize algorithms emplo ying Hidden Markov Models
[16] to predict CNVs, and these approaches have been
developed and adopted for a number of recent genome-
wide studies of structural variation. Equally promising
are segmentation algorithms such as GLAD [17] and Cir-
cular Binary Segmentation (CBS) [18] that have been suc-
cessfully applied for analysis of data from array-based
comparative genomic hybridization (aCGH) platforms.
These segmentation approaches are particularly attractive
as they have been shown to outperform certain HMM-
based approaches for a CGH data [19,20]. Regardless of
the approach, these algorithms typically overcall CNV
events [12,15,21,22], thus requiring post-prediction
methods that consider data quality metrics for distin-
guishing true events from false positives. Currentl y,
researchers interested in analyzing gen otypes for CN V
content for the first time, or in setting up production sys-
tems for high-throughput analysis and interpretation, are
challenged by the considerable variety and limited scope
of most existing met hods and tools. This is especially
true in the use of S NP arrays for clinical diagnostic appli-
cations, where reliability and performance are of critical
importance.
At the same time, assessing the importance of particu-
lar CNVs in context with other genomic features and
study findings is a complex task even without robust
quality assessment of CNV predictions, especially given
limited current knowledge o f the distributions of CNVs
across the genome and in populations. Contextual geno-
mic and phenotypic annotations need to be considered,
while projects involving sizable cohorts also require an
infrastructure for managing, accessing, batch-processing,
and visualizing annotated CNV predictions.
To address these challenges, we describe the inte-
grated platform CNV Workshop. This package
incorporates a modified segmentation algorithm that we
have previously applied successfully for detecting patho-
genic CNVs in large-scale research and clinical projects
[12,13]. CNV Workshop includes a database layer, ro le-
based s ecurity and authentication schemes suitable for
clinical diagnostic environments, a web-based presenta-
tion layer providing textual and graphical visualization
of CNV predictions, and integration of CNV content
with known genomic and biomedical annotations for
rapidly determining the significance of a particular
CNV. These components are modular yet seamlessly
integrated and together provide an effective platform for
identification of high-throughput copy number variation;
discovery of inherited, de novo, and somatically acquired
pathogenic variants; and clinical diagnostics.
Implementation
Approach
Conceptually, CNV detection from genotyping data sets
consists of two major steps: 1) segmenting chromo-
some-arrayed genotypes into discrete regions, with
probes in each region presenting different signal inten-
sity patterns than adjacent regions; and 2) labeling parti-
cular segments that are inherently different in copy
number from expected. To accurately predict CNV
events, an algorithm requires sufficient sensitivity to dis-
tinguish true chromosome copy number state changes
from local signal fluctuations.
For aCGH data, these algorithms rely solely upon nor-
malized probe signal intensities, typically log2 ratios of
intensity, for segment delineation. Examples include the
GLAD and CBS algorithms [23,24] . Genotyping ar rays
provide an additional useful metri c, the allelic rati o,
which can be utilized for assessing the copy number
state of each segment. Allelic ratio is a measure of the
relative signal intensities for probes measuring each of
the two alleles at a SNP locus. B esides overall signal
intensi ty at a SNP locus, allelic ratios of a region o f true
copy number change should present a pattern consis-
tently different from a diploid region. For these reasons,
we devised a generic three-step CNV detection metho-
dology that can be applied to all genotyping p latforms,
with only slight variations required to address platform-
specific properties: segmentation, calculation of geno-
type-specific statistics, and CNV determination. We
describe here our implementation and modifications to
CBS first fo r the Illumina Infinium ar ray platfo rm and
then modifications required for use with Affymetrix,
other SNP, and aCGH arrays.
Segmentation
In the Illumina Infinium assay, two different probe sets are
use d to measure the presence of the two different alleles
for a given SNP. Allele-spe cific signal intensities are fir st
normalized into R
subject
values. R
expected
values are then
Gai et al. BMC Bioinformatics 2010, 11:74
http://www.biomedcentral.com/1471-2105/11/74
Page 2 of 9
Page 2
calculated through linear interpolation of the R values for
each canonical cluster; the log2 ratio of R
subject
and
R
expected
is named the Log R Ratio (LRR) [25]. LRRs above
or below zero indicate possible duplication or deletion at a
locus, with the degree of deviation correlating with the
likelihood of a copy number change. To identify segments
of adjacent loci (SNPs) that display an overall LRR pattern
consistently distinct from neighboring segments, CNV
Workshop implements the segmentation algorithm CBS
as its default detection method. However, other segmenta-
tion algorithms can be used in place of CBS with only
minor source code modifications.
Additional statistics
After the segmentation step, a dditional LRR and allelic
ratio (B-allele frequency, or BAF) statistics are then cal-
culated for each segment, which are critical for ensuring
high quality CNV determinations. The results are then
stored in a MySQL database along with the chromosomal
coordinates of each segment. For LRRs, two simple statis-
tics are calculated: standard deviation of LRRs b y sample
and by chromosome, and mean LRR for each chromo-
some and each segment. Similarly for BAF, three statis-
tics are devised and calculated for each segment:
percentage of SNPs with BAFs between 0.6 and 0.4, b2.sd
(Equation 1), and b3.sd (Equation 2). These statistics are
designed as straightforward measures of the distribution
pattern of the BAFs in a segment. For b2.sd and b3.sd, X
i
represents the BAF for the ith SNP of the segment and n
represents the total number of SNPs in the segment.
bsd2
1
1
01 05
2
1
.min,,|.|



n
XXX
iii
i
n
(1)
bsd3
1
1
01 067 033
2
1
. min ,,|.|,|.|



n
XXX X
iii i
i
n
(2)
For b2.sd, the constants chosen for the equations are
expected BAF values for SNPs in a normal diploid seg-
ment for the homozygous AA alleles (0), homozygous
BB alleles (1), and heterozygous AB alleles (0.5). Fo r the
b3.sd, the constants are the expected values of SNP
BAFs in a monoallelic duplication: AAA (0), BBB (1),
ABB (0.67), and AAB (0.33) [25]. In this way, b2 .sd is
expected to be significantly smaller than b3.sd when the
segment is truly diploid, and the opposite is expected
when the segment is a duplicated or amplified region.
CNV determination
The likelihood of a segment being predicted as a CNV is
determined by many attributes of the segment, espe-
cially the BAF statistics, the mean LRR, and the num ber
of SNPs. Although copy number determination can be
performed directly after the segmentation step, delay of
the copy number calling step affords greater flexibility.
Subsequent use of a modified set of criteria for calling
copy number changes will not require the repeat of the
segmentatio n step , which is much improved but still
computationally intensive with the current implementa-
tion of the CBS algorithm. However, the values to use
might need to be changed based on the goal of the ana-
lysis and the nature of the samples of i nterest. We have
previously reported CNV Workshop threshold values for
calling germline heterozygous deletions, homozygous
deletions, and duplications from Illumina 550 K data
that we found effective for a wide ra nge of samples and
genotype quality scores [12,13]. An effective way for
learning a new platform and developing appropriate
threshold values i s by taking advantage of the widely
available a nd validated CNV contents of the 270 Hap-
Map samples, as well as the genotyping data of these
same samples from differe nt platform vendo rs. HapMap
data for a new platform can first be processed with
CNV Workshop. Thresholds that provide desirable or
acceptable Type I and T ype II rates can then be
obtained by comparing calls derived using different
thresholds for known variants. Using this process, we
have adopted the algor ithm for a number of genotyping
platforms, including Illumina 610-Quad, 660-Quad, and
Affymetrix 6.0 arrays (thresholds available at: http://cnv.
sourceforge.net/).
A number of variables in addition to the array plat-
form, including the particular samples and the referen ce
group used for allele calling, may influence the set of
parameters that will function optimally in a given set-
ting. For example, tumor samples are often hyper- or
hypodiploid a cross a genome or certain chromosomes.
Commonly employed global normalization algorithms
often assume that the majority of probe intensities
should remain at a diploid state and do not incorporate
apriorimethods for inferring degree of aneuploidy.
CNV Workshop provides a convenient mechanism f or
determining the existence and degree of hyper- or hypo-
diploidy. As b2.sd and b3.sd statistics are calculated for
each chromosome, highly skewe d chromosome-specific
b2.sd/b3.sd ratios indicate chromosome-level aneu-
ploidy. However, normalized values of a hyper- or hypo-
diploid sample are also skewed due to global
normalization; thus, detection of copy number changes
at higher resolution requires cutoff mean intensity
values of a segment to be adjusted accordingly for
tumor samples. We advise users to experiment with
these parameters as ap propriate. To assist with this pro-
cess, an advanced search function has been included
in CNV W orkshop for adjusting these paramet ers. I n
addition to these criteria, segments can be queried
based on physical size and number of SNPs.
Gai et al. BMC Bioinformatics 2010, 11:74
http://www.biomedcentral.com/1471-2105/11/74
Page 3 of 9
Page 3
Affymetrix arrays
Affymetrix genotyping arrays are widely used for copy
number variation detection. Similar to Illumina sLRR
metric, the Affyme trix Genotyping Console application,
as well as c ommerci al pac kages such as Partek Geno-
mics Suite, calculate log-transformed ratios (Log2 ratios)
of summar ized probe intensities for a SNP o f a g iven
sample, as compared to the same intensities measured
in control samples [26]. Additionally, these pac kages
provide alleli c ratios comp arable to BAFs. We have suc-
cessfully used CNV Workshop to analyze Affymetrix
array data by substituting Log R Ratios and B Allele Fre-
quencies with normalized log2 ratios and allelic ratios
derived from Partek Genomics Suite. As log2 ratios
exhibit greater variance than LRRs and vary across dif-
ferent Affymetrix platforms, different threshold values
are required. Certain Affymetrix platf orms such as the
6.0 platform incorporate no n-SNP copy number probes
in addition to SNP probes. While the data from these
intensity-only probes is less reliable for CNV determina-
tion, the added advantage of increased resolution may
be desirable for certain applications. Inclusion of inten-
sity-only probe data is enabled by uploading an addi-
tional text file c ontaining in tensity valu es for these
probes.
aCGH and other SNP platforms
CNV Workshop can also be adapted for use with aCGH
and othe r SNP platfo rms. For aCGH platfor ms, normal -
ized probe signal intensities, which are typi cally trans-
formed as log2 ratios of probe intensities of a sample
versus controls, are the only available met ric for asses-
sing relative copy number. After the seg mentation step,
the likelihood of a given segment representing a true
copy number loss or gain is proportional to the segment
mean signal intensity r elative to other segments from
the same chromosome, or across the entire genome. For
this reason, CNV Workshop calculates and stores all
probe and segment means, medians, and standard devia-
tions. This information, even in the absence of allelic
ratio data, can be used to establish a dynamic yet robust
threshold for aCGH data. For example, a segment with
mean signal intensity deviating by three standard devia-
tions from the mean signal intensity of all segments is
likely to indicate true gain or loss.
Algorithm performance
Direct comparison of CNV detection algorithms is chal-
lenging in the absence of a sizable evaluation standard.
However, to p rovide a general measure of algorithm
performance, we compared results from C NV Work-
shop with PennCNV, a commonly used, HMM-based
CNV detection algorith m. A set of 112 unique HapMap
samples genotyped on the Illumina 550 kv1 platform
was analyzed for CNV content using default settings
and threshold values for each algorithm (Figure 1).
Overall, CNV Workshop and Penn CNV we re gene rally
concordant (77.5% and 69.0%, respectively). Concor-
dance rate increased substantially as a function of CNV
size, but considerable discordance was observed espe-
cially for CNVs spanning <5 SNPs. These results indi-
cate some combined contribution of Type II error from
CNV Workshop and Type I error from PennCNV for
smaller predicted CNVs. Notably, the number of CNVs
predicted by PennCNV per sample was inversely pro-
portional to sample-wide LRR standard deviation, b ut
this trend was not observed for CNV Workshop within
LRR standard deviation ranges we consider acceptable
for analysis (<0.35).
Architecture
CNV Workshop builds upon a number of open source
applicatio ns and libraries. The majo r components of
CNV Workshop are a set of scripts for processing the
genot yping data, a set of scripts for predicting cop y var-
iations and subsequently annotating each CNV, a
MySQL database server, a we b server hosting a custom
instance of the GMOD Generic Genome Browser [27]
via CGI, and an Apache Tomcat server hosting the Java-
based CNV Workshop web application. These compo-
nents may reside on the same or different physical com-
puters running either Windows or UNIX-based
operating systems such as Linux and Macintosh OS X.
As such, the application is w ell suited to support a set
of investigators and projects distributed across an orga-
nization or multi-site collaboration. CNV Workshop is
best administered by bioinformaticists or computer sys-
tem administrators on behalf o f biologists. However, w e
also make available a pre-installed virtual machine
(Linux CentOS 5.4) to ease installation for those with a
powerful computer and virtualization software such as
Parallels, VMWare, VirtualBox, or Xen.
Data processing and management
Raw genotyping data are first processed with an R script
that automatically segments based on the SNP intensity
data, calculates additio nal statistics, and subsequently
inserts the segment information and these stat istics into
a MySQL database. In our setting, segmentation of data
from a single Illumina 550 k a rray, using an Intel Xeon
3.16 GH z server running Centos 5 with 16 GB of RAM,
required 18 minutes and less than 1 GM RAM on a sin-
gle CPU core. A Pe rl script then an alyzes the segment
data files, predicts CNVs, and populates the database
with CNV calls. Alternatively, CNV predictions using
custom parameter values can be made on-the-fly for
specified datasets via the advanced search function in
the web application. CNV data sets established by a user
are then made visible via the CNV Workshop web appli-
cation. The database supports the ability to view and
manipulate CNVs at the event, sample, and sample
cohort levels.
Gai et al. BMC Bioinformatics 2010, 11:74
http://www.biomedcentral.com/1471-2105/11/74
Page 4 of 9
Page 4
CNV annotations
CNVs are automatically annotated f or genotyping and
genome-derived metadat a, including CNV type ( e.g.,
deletion or duplication), number of SNPs in an event,
genomic sequence position, and quality metrics. A data-
base parameter specifies genome build version such that
annotations reflected in CNV Workshop are accurate
with respect to build. The defaul t value i s build hg18 as
most array platforms are currently based o n this build.
Additional automated annotations include gene content,
association with known disease loci or genes, and over-
lapping public CNVs from the Database of Genomic
Variants (DGV) [4]. This is accomplishe d by running
programs tha t query remote data sources such as DGV
and UCSC Known Genes, certain of which are cached
locally for performance reasons . CNV Wor kshop also
comes pre-loaded with the CHOP CNV collection from
2,026 healthy controls [12]. An optional custom track is
reserved such that a set of custo mized anno tations ca n
be readily incorporated. To facilitate this function, the
site administrator is able to load i nto the database a
mapping of annotation labels to loci. T hese labels are
displayed in both the graphical and tabular presentations
for CNVs that overlap the a nnotated loci. For instance,
the custom track might be used to flag CNVs that over-
lap g enes in a specific pathw ay or are associated with a
disease of particular interest to the research group.
Administration
Analysis and loading of data sets into the database,
along with creation and updat ing of local mirrors of
annotation sources, is accomplished by the execution of
programs on the c ommand line. T hrough the Admin
tab of the web application, an administrator can assign
role-based privileges so that access to a data set is
restricted to a group of users. This function is con-
trolled by the creation, deletion, and modification of
three entities: users, groups, and data sets. Users have
three attributes: email address, password, and group
membership. Groups are essentially many-to-many map-
pings between users and data sets. Finally,
Figure 1 Comparison of CNV Workshop and PennCNV variant prediction sets. Depicted is a composite graph showing the fraction of CNVs
predicted solely by PennCNV, CNV Workshop, and both algorithms. Each column indicates the fraction of predicted CNVs for a certain size range,
and for all CNVs combined (leftmost column).
Gai et al. BMC Bioinformatics 2010, 11:74
http://www.biomedcentral.com/1471-2105/11/74
Page 5 of 9
Page 5
administrators use the Admin interface to provision data
sets that have previously been loaded into the database.
Results
CNV Workshops web application a llows users to flex-
ibly query CNV data sets, view annotated search results,
mark and save subsets of queries in their accounts, and
download query results in a variety of formats.
Queries
For e ach data set, users can choose from a basic search
function that queries CNV predictions and annotations
of these CNVs (Figure 2), or an advanced search func-
tion that querie s segmented data prior to CNV determi-
nation (Figure 3) . For both data types, supported sear ch
parameters include genomic position (chromosome,
cytogenetic band or band range, sequence position
range, or gene name), var iation type (duplication, het-
erozygous deletion, or homozygous deletion), and CNV
size (base pairs or number of SNPs). For the advanced
search function, additional supported queries include
segment mean, h eterozygous allele frequencies, and
copy gain filter, which allows a user t o set parameters
for establishing customized CNV detection thresholds.
Presentation
Query results are presented to a user both graphically
and in a tabular format (Figure 4). The graphical image
is rendered via a GMOD Generic Genome Browser
(GBrowse) display. The GBrowse laye r presents project-
specific CNVs in one o r more reg ions of interest as a
custom track, along with default track s for the hea lthy
control set, DGV content, annotations from the Gene tic
Association Database (GAD) [28], and the Known
Genes and cytogenetic band tracks from the UCSC
Genome Browse r [29]. Quer ies that yie ld result s for
multiple, non-overlapping genomic region s are rendered
as separate visualizations, which are viewa ble by select-
ing region-specific views from a drop-down list.
The tabular display gene rally reitera tes the graphical
display but includes additional features of each CNV in
a row-by-row format , including variation type, sample
ID, cytogenetic band, sequence position, number of
SNPs, segment size in b ase pairs, and segment mean log
R ratio. To facilitate further exploration of particular
CNVs, both the graphical and tabular displays include
links to the external genomic resources DGV, GAD, and
the UC SC Genome Browser. In addition, for CNVs t hat
overlap genes, annotations and hyperlinks are provided
to corresponding gene content from NCBI Entrez,
Entrez Gene, and MEDLINE-mined literature through
FABLE [30,31].
Saving and downloading
Query results can be downloaded in a variety of formats,
includingExcel,CSV,XML,andBED.TheBEDformat
is especially useful as it is compatible with visualization
in the UCSC Genome Browser as well as additional ana-
lysis t ools such as Galaxy [32]. CNV Workshop also
supports the concept of persistent, editable clipboards
of previ ous sea rch results through the MyCNV function.
Users can c reate multiple clipboards, each of which can
store selections from multiple queries. Clipboards persist
across logins until deleted by the user.
Discussion
Struct ural variants are increasingly recognized as crucial
contributors to genome diversity and disease risk. While
many studies exploring associations between structural
Figure 2 CNV Workshop qu ery interface.Screenshotofthesimplequeryinterfaceinannotated mode for the CHOP CNV map database.
Annotated mode enables a user to query by sample. Positional queries supported are chromosome(s), cytogenetic band(s), sequence position or
range(s), and gene name(s). Most non-standard gene names are recognized and normalized to HGNC gene symbols. For the CHOP CNV map,
additional searchable fields are CNV type (CNV, CNV region, or CNV block), ethnicity, and uniqueness (unique or non-unique/observed in multiple
unrelated individuals).
Gai et al. BMC Bioinformatics 2010, 11:74
http://www.biomedcentral.com/1471-2105/11/74
Page 6 of 9
Page 6
Figure 3 Screenshot of the advanced query interface in raw mode for a database of autism samples. Raw mode enables a user to
query by segment mean, minor (AB) allele frequency, and to filter results based upon allelic ratio statistics.
Figure 4 Presentation of query results in CNV Workshop. Depicted are results for a chromosome 16 quer y of sequence position range
20,300,000-20,610,000 in an autism cohort. Top panel: graphical display. Layers (top to bottom) represent the sequence position (top),
cytogenetic bands, CNVs observed in the autism cohort, CNVs in the CHOP CNV map, CNVs in the Database of Genomic Variants (labels indicate
the study), phenotypes of Genetic Association Database studies, and UCSC Known Genes. All glyphs hyperlink to corresponding database
records. Bottom panel: tabular display for a subset of CNVs. Sortable column headers are colored red or (for current sort order) green. Each row
value colored red denotes a hyperlink to a corresponding external database record. Checkboxes at far left allow a user to save certain CNVs in
the MyCNV clipboard.
Gai et al. BMC Bioinformatics 2010, 11:74
http://www.biomedcentral.com/1471-2105/11/74
Page 7 of 9
Page 7
variants and individual disorders have recently emerge d,
most hu man disea ses with a genetic c omponent hav e
yet to be systematically investigated. Analysis of new
and existing genotype data generated for association stu-
dies or clinical purpo ses will require more robust tools
to facilitate these numerous and often large-scale stu-
dies. Accordingly, our design of CNV Workshop
attempted to address both current impediments to rapid
analysis and the need to accommodate a variety of
approaches. An additional objective was to incorporate
features to facilitate workflow, data management, and
data interpretation tasks that are often underappreciated
in CNV studies. Finally, we aimed to create a platform
that was compatible with both discovery and diagnostic
needs.
Current methods fo r analyzing struc tura l variants ar e
diverse, including only moderately compatible
approaches for g enotyping, CNV calling, and analysis
requirements. This diversity has created challenges for
groups or consortia interested in combining data sets
derived from multiple platforms or analytical
approaches. While CNV Workshop cannot o vercome
these challenges, there are several features that can
assist. First, CNV Workshop supports both Illumina and
Affymetrix SNP arrays that constitute the majority of
current S NP array data, and it can be readily adapted to
other SNP and aCGH platforms with relatively minor
effort. Moreover, data sets with pre-exist ing CNV calls
can be uploaded into CNV Workshop for integrated
annotation, visualization, and cross-comparison. This
feature also provides flexibility for users who wish to
use other detection algorithms, although the CNV
Workshop architecture enables additional algorithms to
be incorporated di rectly into the workflow with modest
effort. Moreover, as CNV calls are locally stored, parti-
cular CNVs or samp les can be qui ckly and conveniently
re-queried or re-analyzed with differing attribute or
parameter settings, especially as new data sets are added
incrementally. F inally, for use in diagnostic settings, we
have in corporated features such as role-based access
control and the ability to view and store CNV content
relative to healthy controls.
We have predominantly used the CBS algorithm for
the segmentation step, although we have tested all seg-
mentation algorithms described by Lai and colleagues
[19] on the Log R Ratios of Illumina genotypes. In
terms of sensitivity and specifici ty, CBS was found to be
one of the better performing algorithms by the Lai
study. CBS was also the only algorithm that could c on-
sistently segment chromosomes correctly for all samples
with known CNVs that we tested. Our work led us to
an appreciation of including po st-s egmentation ana lyses
that incorporate quality metrics into the CNV determi-
nation process.
We have successfully applied our modified CBS pro-
cess for analyzing over 15,000 research-derived geno-
types spanning more than a dozen pediatric disorders,
along with nearly 2,000 clinical samples for diagnostic
purposes [12,13]. These efforts have included validation
trial s using a variety of experimental approaches. Future
versions of CNV Workshop plan to exploit newer detec-
tion methods, possibly including the simultaneous appli-
cation of multiple algorithms, as well as approaches that
consider additional genomic features.
Conclusions
As disease-oriented genomic analysis continues to
evolve, large-scale array- and sequence-based studies
will become increasingly possible. This evolution will
likely necessitate m ore sophisticated analytical, work-
flow, and data infrastructure elements. CNV Workshop
provides a first-generation platform for managing many
of the complex tasks required to productively process
and assess structural variation content from high-resolu-
tion genomic array data. Currently, we are formulating
strategies for further accommodating these needs within
the CNV Workshop framework. Possible e xtensions
include features to more directly allow cross-cohort
comparisons and to assist with clinical dia gnostic appli-
cations via automated disease labeling and report gen-
eration. In addition, we are developing features for
viewing regions of homozygosity and labeling potential
mosaic CNVs. Finally, we are exploring methods for
both expert- and machine-ranking of CNVs to assist the
considerable challen ge of assessing pathogenicity for
structural variants in disease settings.
Availability and requirements
Project name: CNV Workshop
Project home page: htt p://sou rcef orge.net/projects/
cnv
Operating systems: Linux or Mac OS X operating
systems
Programming languages: Java, R, Perl
Other requirements: Maven 2, Java JDK 6, Perl
5.8.6+, Apache or other web server, Apache Tomcat 6.0,
MySQL client and server 4.1 or 5.0, Generic Genome
Browser 1. X, R 2.8, GNU Make
License: GNU Affero GPL v3 or any later version
Acknowledgements
This work was supported in part by Pennsylvania Department of Health
grant SAP 4100037707 (PSW), NIH grant GM081519 (THS), and the David
Lawrence Altschuler Endowed Chair Fund (PSW). We would like to thank
members of the Shaikh, Cancer Genetics, Cytogenomics, Deardorff, and
Goldmuntz Laboratories at the Childrens Hospital of Philadelphia for design
specification input and testing.
Gai et al. BMC Bioinformatics 2010, 11:74
http://www.biomedcentral.com/1471-2105/11/74
Page 8 of 9
Page 8
Author details
1
Center for Biomedical Informatics, The Childrens Hospital of Philadelphia,
Philadelphia, PA, 19104, USA.
2
Division of Oncology, The Childrens Hospital
of Philadelphia, Philadelphia, PA, 19104, USA.
3
Childrens Hospital of
Philadelphia Research Institute, Philadelphia, PA, 19104, USA .
4
Department
of Pediatrics, University of Pennsylvania School of Medicine, Philadelphia, PA,
19104, USA.
5
Division of Genetics, The Childrens Hospital of Philadelphia,
Philadelphia, PA, 19104, USA.
Authors contributions
XG and JCP planned, wrote and tested most of the first iteration of software,
including algorithm development. KM, RO, AW, and MD substantially
contributed code, subsequent aspects of design, testing, and deployment.
HX, EFR, and THS contributed user requirements, functionality input, and
expert testing. PSW oversaw project development. XG, KM, and PSW wrote
and revised the manuscript. All authors read and approved the final
manuscript.
Received: 1 September 2009
Accepted: 4 February 2010 Published: 4 February 2010
References
1. Cook EH Jr, Scherer SW: Copy-number variations associated with
neuropsychiatric conditions. Nature 2008, 455(7215):919-923.
2. Henrichsen CN, Chaignat E, Reymond A: Copy number variants, diseases
and gene expression. Hum Mol Genet 2009, 18(R1):R1-8.
3. Conrad DF, Andrews TD, Carter NP, Hurles ME, Pritchard JK: A high-
resolution survey of deletion polymorphism in the human genome. Nat
Genet 2006, 38(1):75-81.
4. Iafrate AJ, Feuk L, Rivera MN, Listewnik ML, Donahoe PK, Qi Y, Scherer SW,
Lee C: Detection of large-scale variation in the human genome. Nat
Genet 2004, 36(9):949-951.
5. McCarroll SA, Hadnott TN, Perry GH, Sabeti PC, Zody MC, Barrett JC,
Dallaire S, Gabriel SB, Lee C, Daly MJ, et al: Common deletion
polymorphisms in the human genome. Nat Genet 2006, 38(1):86-92.
6. Redon R, Ishikawa S, Fitch KR, Feuk L, Perry GH, Andrews TD, Fiegler H,
Shapero MH, Carson AR, Chen W, et al : Global variation in copy number
in the human genome. Nature 2006, 444(7118):444-454.
7. Sebat J, Lakshmi B, Malhotra D, Troge J, Lese-Martin C, Walsh T, Yamrom B,
Yoon S, Krasnitz A, Kendall J, et al: Strong association of de novo copy
number mutations with autism. Science 316(5823):445-449.
8. Tuzun E, Sharp AJ, Bailey JA, Kaul R, Morrison VA, Pertz LM, Haugen E,
Hayden H, Albertson D, Pinkel D, et al: Fine-scale structural variation of
the human genome. Nat Genet 2005, 37(7):727-732.
9. Albertson DG, Pinkel D: Genomic microarrays in human genetic disease
and cancer. Hum Mol Genet 2003, 12(Spec No 2):R145-152.
10. Sharp AJ, Locke DP, McGrath SD, Cheng Z, Bailey JA, Vallente RU, Pertz LM,
Clark RA, Schwartz S, Segraves R, et al: Segmental duplications and copy-
number variation in the human genome. Am J Hum Genet 2005,
77(1):78-88.
11. Wong KK, deLeeuw RJ, Dosanjh NS, Kimm LR, Cheng Z, Horsman DE,
MacAulay C, Ng RT, Brown CJ, Eichler EE, et al: A comprehensive analysis
of common copy-number variations in the human genome. Am J Hum
Genet 2007, 80(1):91-104.
12. Shaikh TH, Gai X, Perin JC, Glessner JT, Xie H, Murphy K, OHara R,
Casalunovo T, Conlin LK, D
Arcy M, et al: High-resolution mapping and
analysis of copy number variations in the human genome: A data
resource for clinical and research applications. Genome Res 2009,
19(9):1682-90.
13. Elia J, Gai X, Xie HM, Perin JC, Geiger E, Glessner JT, DArcy M,
Deberardinis R, Frackelton E, Kim C, et al: Rare structural variants found in
attention-deficit hyperactivity disorder are preferentially associated with
neurodevelopmental genes. Mol Psychiatry 2009.
14. Colella S, Yau C, Taylor JM, Mirza G, Butler H, Clouston P, Bassett AS,
Seller A, Holmes CC, Ragoussis J: QuantiSNP: an Objective Bayes Hidden-
Markov Model to detect and accurately map copy number variation
using SNP genotyping data. Nucleic Acids Res 2007, 35(6):2013-2025.
15. Wang K, Li M, Hadley D, Liu R, Glessner J, Grant SF, Hakonarson H, Bucan M:
PennCNV: an integrated hidden Markov model designed for high-
resolution copy number variation detection in whole-genome SNP
genotyping data. Genome Res 2007, 17(11) :1665-1674.
16. Fridlyand J, Snijders AM, Pinkel D, Albertson DG, Jain AN: Hidden Markov
models approach to the analysis of array CGH data. J Multivar Anal 2004,
90(1):132-153.
17. Hupe P, Stransky N, Thiery JP, Radvanyi F, Barillot E: Analysis of array CGH
data: from signal ratio to gain and loss of DNA regions. Bioinformatics
2004, 20(18):3413-3422.
18. Olshen AB, Venkatraman ES, Lucito R, Wigler M: Circular binary
segmentation for the analysis of array-based DNA copy number data.
Biostatistics 2004, 5(4):557-572.
19. Lai WR, Johnson MD, Kucherlapati R, Park PJ: Comparative analysis of
algorithms for identifying amplifications and deletions in array CGH
data. Bioinformatics 2005, 21(19):3763-3770.
20. Willenbrock H, Fridlyand J: A comparison study: applying segmentation to
array CGH data for downstream analyses. Bioinformatics 2005,
21(22):4084-4091.
21. Fiegler H, Redon R, Andrews D, Scott C, Andrews R, Carder C, Clark R,
Dovey O, Ellis P, Feuk L, et al: Accurate and reliable high-throughput
detection of copy number variation in the human genome. Genome Res
2006, 16(12):1566-1574.
22. Itsara A, Cooper GM, Baker C, Girirajan S, Li J, Absher D, Krauss RM,
Myers RM, Ridker PM, Chasman DI, et al: Population analysis of large copy
number variants and hotspots of human genetic disease. Am J Hum
Genet 2009, 84(2):148-161.
23. Hupe P, La Rosa P, Liva S, Lair S, Servant N, Barillot E: ACTuDB, a new
database for the integrated analysis of array-CGH and clinical data for
tumors. Oncogene 2007, 26(46):6641-6652.
24. Venkatraman ES, Olshen AB: A faster circular binary segmentation
algorithm for the analysis of array CGH data. Bioinformatics 2007,
23(6):657-663.
25. Peiffer DA, Le JM, Steemers FJ, Chang W, Jenniges T, Garcia F, Haden K, Li J,
Shaw CA, Belmont J, et al: High-resolution genomic profiling of
chromosomal aberrations using Infinium whole-genome genotyping.
Genome Res 2006, 16(9):1136-1148.
26. Komura D, Shen F, Ishikawa S, Fitch KR, Chen W, Zhang J, Liu G, Ihara S,
Nakamura H, Hurles ME, et al: Genome-wide detection of human copy
number variations using high-density DNA oligonucleotide arrays.
Genome Res 2006, 16(12):1575-1584.
27. Stein LD, Mungall C, Shu S, Caudy M, Mangone M, Day A, Nickerson E,
Stajich JE, Harris TW, Arva A, et al: The generic genome browser: a
building block for a model organism system database. Genome Res 2002,
12(10):1599-1610.
28. Becker KG, Barnes KC, Bright TJ, Wang SA: The genetic association
database. Nat Genet 2004, 36(5):431-432.
29. Kuhn RM, Karolchik D, Zweig AS, Wang T, Smith KE, Rosenbloom KR,
Rhead B, Raney BJ, Pohl A, Pheasant M, et al: The UCSC Genome Browser
Database: update 2009. Nucleic Acids Res 2009, , 37 Database: D755-761.
30. Wheeler DL, Barrett T, Benson DA, Bryant SH, Canese K, Chetvernin V,
Church DM, DiCuccio M, Edgar R, Federhen S, et al: Database resources of
the National Center for Biotechnology Information. Nucleic Acids Res 2006,
, 34 Database: D173-180.
31. Fang HW, Murphy K, Jin Y, Kim J, White P: Human gene name
normalization using text matching with automatically extracted
synonym dictionaries. BioNLP06: June 8 2006, New York, New York 2006.
32. Giardine B, Riemer C, Hardison RC, Burhans R, Elnitski L, Shah P, Zhang Y,
Blankenberg D, Albert I, Taylor J, et al: Galaxy: a platform for interactive
large-scale genome analysis. Genome Res 2005, 15(10) :1451-1455.
doi:10.1186/1471-2105-11-74
Cite this article as: Gai et al.: CNV Workshop: an integrated platform for
high-throughput copy number variation discovery and clinical
diagnostics. BMC Bioinformatics 2010 11:74.
Gai et al. BMC Bioinformatics 2010, 11:74
http://www.biomedcentral.com/1471-2105/11/74
Page 9 of 9
Page 9
  • Source
    • "The presence of copy number variations (CNVs) (duplications, deletions) might alter gene dosage in KS. CNVs have a clinical impact on the risk for a variety of human diseases (Gai et al., 2010), including male infertility (Krausz et al., 2012), but no study has been performed on the X chromosome of KS, including their possible involvement in the clinical variability of KS. "
    [Show abstract] [Hide abstract] ABSTRACT: The Klinefelter syndrome (KS) is the most frequent sex chromosomal disorder in males, characterized by at least one supernumerary X chromosome (most frequent karyotype 47,XXY). This syndrome presents with a broad range of phenotypes. The common characteristics include small testes and infertility, but KS subjects are at increased risk of hypogonadism, cognitive dysfunction, obesity, diabetes, metabolic syndrome, osteoporosis, and autoimmune disorders, which are present in variable proportion. Although part of the clinical variability might be linked to a different degree of testicular function observed in KS patients, genetic mechanisms of the supernumerary X chromosome might contribute. Gene-dosage effects and parental origin of the supernumerary X chromosome have been suggested to this regard. No study has been performed analyzing the genetic constitution of the X chromosome in terms of copy number variations (CNVs) and their possible involvement in phenotype of KS. To this aim, we performed a SNP arrays analysis on 94 KS and 85 controls. We found that KS subjects have more frequently than controls X-linked CNVs (39/94, [41.5%] with respect to 12/42, [28.6%] of females, and 8/43, [18.6%] of males, p < 0.01). The number of X-linked CNVs in KS patients was 4.58 ± 1.92 CNVs/subject, significantly higher with respect to that found in control females (1.50 ± 1.29 CNVs/subject) and males (1.14 ± 0.37 CNVs/subject). Importantly, 94.4% X-linked CNVs in KS subjects were duplications, higher with respect to control males (50.0%, p < 0.001) and females (83.3%, p = 0.1). Half of the X-linked CNVs fell within regions encompassing genes and most of them (90%) included genes escaping X-inactivation in the regions of X-Y homology, particularly in the pseudoautosomal region 1 (PAR1) and Xq21.31. This study described for the first time the genetic properties of the X chromosome in KS and suggests that X-linked CNVs (especially duplications) might contribute to the clinical phenotype.
    Full-text · Article · Jan 2016 · Andrology
  • Source
    • "tched , disease - free children was used as a healthy control group . All case and control samples were genotyped using the Illumina HumanHap550K BeadChip and a single consistent protocol . Genotype data were uni - formly analyzed for CNVs using Illumina ' s GenomeStudio software in combination with CNV Workshop and PennCNV ( Wang et al . , 2007 ; Gai et al . , 2010 ) ."
    [Show abstract] [Hide abstract] ABSTRACT: Background: We sought to characterize the landscape of structural variation associated with the subset of congenital cardiac defects characterized by left-sided obstruction. Methods: Cases with left-sided cardiac defects (LSCD) and pediatric controls were uniformly genotyped and assessed for copy number variant (CNV) calls. Significance testing was performed to ascertain differences in overall CNV incidence, and for CNV enrichment of specific genes and gene functions in LSCD cases relative to controls. Results: A total of 257 cases of European descent and 962 ethnically matched, disease-free pediatric controls were included. Although there was no difference in CNV rate between cases and controls, a significant enrichment in rare LSCD CNVs was detected overall (p=7.30 × 10(-3) , case/control ratio=1.26) and when restricted either to deletions (p=7.58 × 10(-3) , case/control ratio=1.20) or duplications (3.02 × 10(-3) , case/control ratio=1.43). Neither gene-based, functional nor knowledge-based analyses identified genes, loci or pathways that were significantly enriched in cases as compared to controls when appropriate corrections for multiple tests were applied. However, several genes of interest were identified by virtue of their association with cardiac development, known human conditions, or reported disruption by CNVs in other patient cohorts. Conclusion: This study examines the largest cohort to date with LSCD for structural variation. These data suggest that CNVs play a role in disease risk and identify numerous genes disrupted by CNVs of potential disease relevance. These findings further highlight the genetic heterogeneity and complexity of these disorders.
    Full-text · Article · Dec 2014 · Birth Defects Research Part A Clinical and Molecular Teratology
  • Source
    • "In this study, we combined the genomics data generated from multiple genome-wide association studies (GWAS) consisting of 3,017 unrelated Thai subjects with no undiagnosed genetic disorders. We carried out CNV discovery from these dataset using the two commonly used CNV calling algorithms, PennCNV [13] and CNV Workshop [14], to identify the most accurate set of CNVs, and put together the first large reference CNV database for Thais. Furthermore, we performed population Copy Number Variation Region (CNVR) frequency comparison between Thais and 11 HapMap3 populations, and identified unique CNVRs in Thais as well as CNVs overlapping with genes associated with Thai population. "
    [Show abstract] [Hide abstract] ABSTRACT: Copy number variation (CNV) is a major genetic polymorphism contributing to genetic diversity and human evolution. Clinical application of CNVs for diagnostic purposes largely depends on sufficient population CNV data for accurate interpretation. CNVs from general population in currently available databases help classify CNVs of uncertain clinical significance, and benign CNVs. Earlier studies of CNV distribution in several populations worldwide showed that a significant fraction of CNVs are population specific. In this study, we characterized and analyzed CNVs in 3,017 unrelated Thai individuals genotyped with the Illumina Human610, Illumina HumanOmniexpress, or Illumina HapMap550v3 platform. We employed hidden Markov model and circular binary segmentation methods to identify CNVs, extracted 23,458 CNVs consistently identified by both algorithms, and cataloged these high confident CNVs into our publicly available Thai CNV database. Analysis of CNVs in the Thai population identified a median of eight autosomal CNVs per individual. Most CNVs (96.73%) did not overlap with any known chromosomal imbalance syndromes documented in the DECIPHER database. When compared with CNVs in the 11 HapMap3 populations, CNVs found in the Thai population shared several characteristics with CNVs characterized in HapMap3. Common CNVs in Thais had similar frequencies to those in the HapMap3 populations, and all high frequency CNVs (>20%) found in Thai individuals could also be identified in HapMap3. The majorities of CNVs discovered in the Thai population, however, were of low frequency, or uniquely identified in Thais. When performing hierarchical clustering using CNV frequencies, the CNV data were clustered into Africans, Europeans, and Asians, in line with the clustering performed with single nucleotide polymorphism (SNP) data. As CNV data are specific to origin of population, our population-specific reference database will serve as a valuable addition to the existing resources for the investigation of clinical significance of CNVs in Thais and related ethnicities.
    Full-text · Article · Aug 2014 · PLoS ONE
Show more