The American Journal of Human Genetics, Volume 93
of Small CNVs in Simplex Autism
Niklas Krumm, Brian J. O’Roak, Emre Karakoc, Kiana Mohajeri, Ben Nelson, Laura Vives, Sebastien
Jacquemont, Jeff Munson, Raphe Bernier, and Evan E. Eichler
Figure S1. Data Processing and CNV Calling Details
A.Previously generated FASTQ data from
four exome sequencing studies (Iossifov et
al., 2012; O'Roak et al., 2012; Sanders et
al., 2012) was used in this study. In
addition, we generated sequence for
unaffected sibling in 20 published trios
(O'Roak et al., 2011) for a complete set of
412 quads. Data was processed and
analyzed using CoNIFER (Krumm, 2011, as
previously described). SVD cutoff values
was set to either 12 or 15 for each dataset.
We excluded one family (12154) on the
basis of significant contamination between
members, resulting in 411 families QCʼd
B. We used DNACopy and CGHCall to
segment and assign deletion or duplication
probabilities to SVD-ZRPKM values.
Parameters for DNACopy were as follows:
alpha = 0.01, using the undo.split=”sdundo”
option with undo.SD = 2.
C. Next, we grouped individual CNV calls
into similar CNV Regions (CNVRs) using
pairwise distances between all CNVs
based on a modified reciprocal overlap
(RO) heuristic that incorporates the size of
the CNV as well as RO percentage.
D.We reduced false-negative calls for
inherited CNVs by applying a family-based
genotyping method which uses a metric
based on Mutual Information between the
raw CoNIFER of each family member at a
particular locus in order to determine
E.CNVs were filtered based on overlap
>50% with known polymorphic sites,
processed pseudogenes, segmental duplications and other non-unique portions of the exome.
F. Our final set of calls was created by requiring an absolute median SVD-ZRPKM score (i.e., signal
strength) of ≥ 0.5 for calls with 5 or more probes, ≥1.0 for calls 3-5 probes in length, and ≥ 1.0 for calls
2 probes in length. We excluded any calls on the X or Y chromosomes for all analyses in this work.
Details of these methods are available upon request.
: Flow chart for inherited CNV detection. See Methods and Supplemental Methods for details.
Figure S1: CNV Calling Flowchart
Figure S2. Mapped Coverage between Probands/Siblings and by Data Source
Figure S2: Mapped coverage between probands/siblings and by data source
Figure S2: Mapped Coverage between probands/siblings and by data source.
X-axis: total mapped 36mer reads (x108) by the mrsFAST alignment program to the
human exome. (a) Histograms of Probands (left) and Siblings (center) and overlap
(right) shows no significant difference in coverage levels (Paired t-test p= 0.09).
(b). Same as in (a), but by dataset, revealing that the Iossifov dataset had lower
coverage than the O!Roak or Sanders datasets.
All panels: X-axis: total mapped 36mer reads (x108) by the mrsFAST alignment
program to the human exome. (a) Histograms of Probands (left) and Siblings (center)
and overlap (right) shows no significant difference in coverage levels (Paired t-test p=
0.09). (b). Same as in (a), but by dataset, revealing that the Iossifov dataset had lower
coverage than the OʼRoak or Sanders datasets.
Figure S3. Array-CGH Validation of CNVs
Figure S3: Array-CGH validation ROC curves
Figure S3: Receiver-Operator Curve determining deletion and duplication thresh-
olds in array-CGH validation. ROC curves based on 60 true-positive deletions (a)
and duplications (b) from Sanders et al., 2011 in these samples. Arrows indicate
chosen optimal operating point (OOP), which was used as the threshold for valida-
tion of unknown calls.
Agilent Feature Extract v10.5.1.1. Arrays with a per-sample standard deviation of LogR values > 0.5 were
repeated. In order to reduce systematic and batch noise between probes and samples, we employed a
similar normalization strategy to the CoNIFER pipeline and used SVD to remove the three strongest
components of variance. We determined minimum logR thresholds for the validation arrays by leveraging
the logR values across the 60 previously identified CNVs (from Sanders et al., 2011), each found in at
least one of our validation samples. We calculated Receiver Operating Curves for (a) duplications (39
calls) and (b) deletions (21 calls), using the samples without the previously identified CNVs as the “true
negatives”. Next, we individually picked the optimal operating point (OOP) for deletions (median LogR
OOP <= -0.178) and duplications (median LogR OOP >= 0.24), such that we maximally discerned our
known true positives from true negatives. Both OOPs had a FPR of ~1%, and a recall rate >90%,
indicating our array was highly specific and sensitive to true events. These logR cutoff values were used
in assessing if novel CNVs were true positives or not: if the mean LogR across all probes in the call
interval was greater than the duplication threshold (or lower than the deletion threshold), we considered
the call validated. Arrows indicate chosen optimal operating point (OOP), which was used as the
threshold for validation of unknown calls.
We designed a custom Agilent SurePrint G3 4x180k CGH microarray to confirm CNVs, using variable
density spacing of probes, ranging from 125bp-1 for calls smaller than 10kbp to 5kbp-1 for large calls up to
500kbp, in order to insure at least 10 probes per call. (Note: Due to the high density of probes required for
validation of small CNVs, some of the probes were of lower quality (as based on the manufacturerʼs
quality score), and their performance was accordingly lower.) Test and reference DNA (we used DNA from
HapMap sample NA18507) from each sample was labeled with Cy3 and Cy5 dye using a NimbleGen
array labeling kit according to manufacturerʼs instructions. Five micrograms of labelled test and reference
DNA was hybridized for 24 hours using Agilent reagents to the microarray slide and washed according to
manufacturerʼs directions. Slides were scanned using an Agilent Microarray Scanner and analyzed using
Figure S4. CNV Size and Copy Number
Figure S4: CNV Size, inheritance, and copy number
Figure S4: CNV size and copy number. Inherited CNVs in probands and siblings,
binned by size in exons (a) or estimated genomic size (b). As expected, larger
CNVs are more likely to be duplications, an effect we found true for both probands
Inherited CNVs in probands and siblings, binned by size in exons (a) or estimated
genomic size (b). As expected, larger CNVs are more likely to be duplications, an
effect we found true for both probands and siblings.