ArticlePDF Available

Abstract and Figures

To investigate the information about Y-structural variants (SVs) in the general population that could be obtained by low-coverage whole-genome sequencing. We investigated SVs on the male-specific portion of the Y chromosome in the 70 individuals from Africa, Europe, or East Asia sequenced as part of the 1000 Genomes Pilot project, using data from this project and from additional studies on the same samples. We applied a combination of read-depth and read-pair methods to discover candidate Y-SVs, followed by validation using information from the literature, independent sequence and single nucleotide polymorphism-chip data sets, and polymerase chain reaction experiments. We validated 19 Y-SVs, 2 of which were novel. Non-reference allele counts ranged from 1 to 64. The regions richest in variation were the heterochromatic segments near the centromere or the DYZ19 locus, followed by the ampliconic regions, but some Y-SVs were also present in the X-transposed and X-degenerate regions. In all, 5 of the 27 protein-coding gene families on the Y chromosome varied in copy number. We confirmed that Y-SVs were readily detected from low-coverage sequence data and were abundant on the chromosome. We also reported both common and rare Y-SVs that are novel.
Content may be subject to copyright.
194
www.cmj.hr
Aim To investigate the information about Y-structural vari-
ants (SVs) in the general population that could be obtained
by low-coverage whole-genome sequencing.
Methods We investigated SVs on the male-specic por-
tion of the Y chromosome in the 70 individuals from Africa,
Europe, or East Asia sequenced as part of the 1000 Genom-
es Pilot project, using data from this project and from ad-
ditional studies on the same samples. We applied a com-
bination of read-depth and read-pair methods to discover
candidate Y-SVs, followed by validation using information
from the literature, independent sequence and single nu-
cleotide polymorphism-chip data sets, and polymerase
chain reaction experiments.
Results We validated 19 Y-SVs, 2 of which were novel. Non-
reference allele counts ranged from 1 to 64. The regions
richest in variation were the heterochromatic segments
near the centromere or the DYZ19 locus, followed by the
ampliconic regions, but some Y-SVs were also present in
the X-transposed and X-degenerate regions. In all, 5 of the
27 protein-coding gene families on the Y chromosome
varied in copy number.
Conclusions We conrmed that Y-SVs were readily detect-
ed from low-coverage sequence data and were abundant
on the chromosome. We also reported both common and
rare Y-SVs that are novel.
Received: November 27, 2014
Accepted: May 24, 2015
Correspondence to:
Chris Tyler-Smith
The Wellcome Trust Sanger
Institute
Hinxton, Cambs. CB10 1SA, UK
cts@sanger.ac.uk
Jose Rodrigo Flores
Espinosa1,2, Qasim Ayub1,
Yuan Chen1, Yali Xue1, Chris
Tyler-Smith1
1The Wellcome Trust Sanger
Institute, Hinxton, UK
2Department of Ecology and
Evolution, University of Lausanne,
Lausanne, Switzerland
Structural variation on the
human Y chromosome from
population-scale resequencing
FORENSIC SCEINECE
Croat Med J. 2015;56:194-207
doi: 10.3325/cmj.2015.56.194
195
Espinosa et al: Structural variation on the human Y chromosome from resequencing
www.cmj.hr
Structural variation, here considered as genetic variation
that aects more than a single base pair of DNA, accounts
for the majority of the nucleotide dierences between in-
dividuals (1). Furthermore, structural variants (SVs) have
substantial functional impact. For example, individual SVs
are more likely than individual single nucleotide polymor-
phisms (SNPs) to lead to phenotypic dierences such as
changes in gene expression (2), are responsible for more
loss-of-function in protein-coding genes (3), and under-
lie many disorders, indeed leading to the recognition of a
novel class of “genomic disorder” (4). There has therefore
been considerable interest among human geneticists in
cataloguing SVs in both patients and samples from the
general population. Technological advances have allowed
the resolution of genome surveys to increase from ~ 5 Mb
in cytogenetic studies, via ~ 100 kb in comparative genom-
ic hybridization (CGH) experiments based on bacterial ar-
ticial chromosome clones (5) to ~ 0.5 kb using high-res-
olution oligonucleotide arrays (6) and nally to base-pair
resolution in sequence-based studies (7).
The Y chromosome (here, “Y chromosome” always refers to
the male-specic portion excluding the pseudoautosom-
al regions) stands out in SV studies for two reasons. First,
it is unusually rich in SVs (8). Cytogeneticists reported vis-
ible variation in the length of the Y long arm, pericentric in-
versions, and translocations of nucleolus organizer regions
(NORs, containing ribosomal RNA gene clusters) to the Y in
the general population (9). Early molecular studies focusing
mainly on individual loci identied variable microsatellites
(10), minisatellites (11,12), major satellites (13,14), tandemly-
arranged genes (15), and also large duplications, deletions,
and inversions (16,17) in the general population. The en-
richment of copy number variants (CNVs) on the Y was con-
rmed by genome-wide surveys (5). More recently, howev-
er, the second reason for the Y standing out has come to
the fore: it has been neglected in high-resolution studies,
either because these are focused on women (6) or, in the
most comprehensive sequence-based study (7), because of
a combination of lower depth of coverage of the sequence
reads (ie, less sequence information because of its haploid
status), the use of SV discovery algorithms optimized for
autosomes, and the complexities of mapping the short se-
quence reads produced by next-generation sequencing to
the repeated regions that make up most of the Y.
There are, nevertheless, good reasons to study Y-SVs at high
resolution. Several independent >100 kb deletions on Yq
(18), and variation of the copy number of the TSPY array on
Yp (19), contribute to spermatogenic variation and failure,
but together account for only a small proportion of male
infertility. Do smaller Y-SVs account for additional cases of
male infertility? In forensic genetics, the AMELY locus forms
the basis for the most commonly-used sex test, but is unre-
liable in a minority of men (20) because of deletions of this
locus, which make men dicult to distinguish from women,
and have several independent origins (21). More generally,
the absence of recombination over most of the length of
the Y provides an opportunity to investigate the density, size
distribution, and mutational origins of SVs in this dierent
environment, aided by a well-established phylogeny (22).
We therefore used a previously-generated population-scale
resequencing data set to investigate Y-SVs at high resolu-
tion. We chose the 70 male samples used in the 1000 Ge-
nomes Pilot Project (23) because, despite their potential,
only three validated Y-SVs have been reported in them (7),
leaving scope for further discoveries, and additional high-
coverage sequence data from 8 of these men have been
released. We performed further Y-SV discovery in both the
original low-coverage and newer high-coverage sequence
data, combined these Y-SVs with those reported by Mills et
al, Complete Genomics, and Phase 1 of the 1000 Genomes
project in the same samples (7,24), and performed extensive
validation and functional prediction on the combined set.
MATERIALS AND METHODS
1000 Genomes data
We analyzed 70 Y chromosome sequences released by the
1000 Genomes Pilot 1 Project. These chromosomes were
sequenced using ~ 36 bp Illumina (Illumina, Inc. San Di-
ego, CA, USA) reads. Most have an average depth of se-
quence coverage of 2.3 × ; that is, each Y base pair in each
individual was sequenced on average 2.3 times. Howev-
er, two Y chromosomes were sequenced more deeply, to
an average coverage of 26.2 × . The samples come from
four worldwide populations representing three continen-
tal regions: 1) Yoruba in Ibadan, Nigeria from sub-Saharan
Africa, abbreviated YRI; 2) CEPH (Centre d’Etude du Poly-
morphisme Humain)-Utah residents with ancestry from
northern and western Europe, abbreviated CEU; 3) Han
Chinese in Beijing, China, from East Asia, abbreviated CHB;
and 4) Japanese in Tokyo, Japan, also from East Asia and ab-
breviated JPT. At the classication level used, ten dierent
Y haplogroups were represented (Supplementary Table
1). Aligned data in the form of BAM les were download-
ed from the FTP site of the 1000 Genomes Project: ftp://
ftp.1000genomes.ebi.ac.uk/vol1/ftp/pilot_data/.
FORENSIC SCEINECE
196 Croat Med J. 2015;56:194-207
www.cmj.hr
TABLE 1. Summary of the 19 validated structural variants (SVs)*
SV ID
Chro-
mo-
some
Region
start
Region
end Type of event
Length
(bp)
#
Sam-
ples
MSY
Class
SDs
present
in the
region
Main
genes
Data
set Comment P1T CGT CGR P1R Ph1R OCT L PCR
# Sources
of
validation
Sources of
validation
SV 01 Y 3.10 9. 26 6 3 .111 .3 0 0 Gains:0,Deletions:4 2.034 4 X-Transp os e d +NA Phase 1
report
NA - - - - + - - + 2 Platform(1),PCR
SV 02 Y 6.543.750 6 .573.750 Gains:22,Deletions:0 30.000 22 X-Tr an sp ose d +NA This work New + + - - - - - - 2 Platforms(2+)
SV 03 Y 7.7 61. 250 8.001. 250 Gains:1,Deletions:0 240.000 1Ampliconic -NA This work New + - - - - + - - 2 Platforms(2+)
SV 04 Y 9.172.875 9.236.625 Gains:4,Deletions:1 63.750 5Ampliconic +TSPY This work TSPY array + + - - - + + - 4 Platforms(2+),Literature
SV 05 Y 9. 30 0 .125 9. 3 6 7. 625 Gains:6,Deletions:0 67.50 0 6Ampliconic +TSPY This work TSPY array + - - - - + + - 3 Platforms(2+),Literature
SV 06 Y 9.639.875 9.650.375 Gains:0,Deletions:8 10.5 00 8Ampliconic +TTTY22 This work NA + + + - - + - - 4 Platforms(2+)
SV 07 Y 10.016. 25 0 10.0 41. 250 Gains:1,Deletions:36 25.000 37 Heterochromatic +NA This work Alphoid
Repeats
+ - - - - - + - 2 Platform(1),Literature
SV 08 Y 10.083 .750 10.10 4. 553 Gains:0,Deletions:13 20.803 13 Heterochromatic -NA This work Alphoid
Repeats
+ + - - - + + - 4 Platforms(2+),Literature
SV 09 Y 13.104.553 13 .126 . 2 50 Gains:0,Deletions:17 21.697 17 Heterochromatic -NA This work Repeats + + - - - + + - 4 Platforms(2+),Literature
SV 10 Y 13.13 6.2 5 0 13.143.954 Gains:0,Deletions:9 7.7 0 4 9Heterochromatic -NA This work Repeats + - - - - - + - 2 Platform(1),Literature
SV 11 Y 13. 4 46. 2 50 13.688.750 Gains:4,Deletions:38 242.500 40 Heterochromatic +NA This work Repeats + - + - - - + - 3 Platforms(2+),Literature
SV 12 Y 14.208.831 14.208.912 Gains:0,Deletions:4 81 4X-Degenerate -NA Pilot 1
report
NA - - - + - - - + 2 Platform(1),PCR
SV 13 Y 17. 306.55 9 17. 311 . 58 4 Gains:0,Deletions:3 5.025 3X-Degenerate +NA This work NA + - - - + + - + 4 Platforms(2+),PCR
SV 14 Y 22.223.737 22.434.987 Gains:25,Deletions:39 211 . 2 50 64 Heterochromatic -NA This work D IZ19 + + + - + - - - 4 Platforms(2+)
SV 15 Y 22.464.987 22.471.737 Gains:11,Deletions:3 6.750 14 Heterochromatic -NA This work DIZ19 + - + - + - - - 3 Platforms(2+)
SV 16 Y 24. 875. 619 26. 526.445 Gains:3,Deletions:9 1.650.82 6 12 Ampliconic +BPY2,DAZ1,
DAZ2,PRYP3,
CDY1B
This work gr/gr
deletion
+ + + - - + + + 6 Platforms(2+),Literature,
PCR
SV 17 Y 25.299.362 25.424.362 Gains:14,Deletions:9 125.000 19 Ampliconic +DAZ1,DAZ2 This work DAZ 1-2 + + + - - - + - 4 Platforms(2+),Literature
SV 18 Y 25.505.069 27.435.593 Gains:0,Deletions:3 1. 930 .524 3Ampliconic +PRYP3,CDY1B,
BPY2B,DAZ3,
DAZ4
This work b2/b3(g1/g3)
deletion
- - - - - - + + 2 Platform(1),Literature,
PCR
SV 19 Y 26.929.362 2 7. 0 4 6 . 86 2 Ga in s:15, D ele tio n s:10 117. 5 0 0 19 Ampliconic +DAZ3,DAZ4 This work DAZ 3-4 + + + - - - + - 4 Platforms(2+),Literature
*Type of event: tota l number of deletions an d duplications that carr y the variant. MSY Class : type of Y sequence cla ss in which the SVs were locate d. Segmental duplica tions (SDs) associated in the Re gion: + = overlapping or
anking se gmental duplicatio ns around the SV. Genes: protein -coding genes that ov erlap each SV. Data set: source o f each SV. Comment: relevant pr ior information for eac h SV; variants that have not bee n previously repor ted
elsewhere a re labeled as “New”. P1T: supporting ev idence from the analysi s of Pilot 1 Data on this work. CGT: suppo rting evidence fro m the analysis of Complete G enomics Data as par t of this work. CGR: suppor ting evidence
from previ ously reported dat a for Complete Genomic s samples. P1R: supportin g evidence from previo usly reported data f or Pilot 1 samples. Ph1R: supporti ng evidence from previ ously reported dat a on 1000 G enomes Phase
1 sample. OC T: suppor ting evidence from th e analysis of OMNI SNP array s as part of this work. L: s upporting eviden ce from information in pu blished literature. Pol ymerase chain react ion (PCR): supporting ev idence from PCR
experim ents performed a s part of this work. # So urces of Validation: total numb er of independent so urces of validation. Sou rces of Validation: “Platf orm(1)” correspond s to the presence of eviden ce from one genomic tech nol-
ogy and “Plat form(2+)” indicates the prese nce of supporting evi dence from two or more ge nomic technologies; l iterature and PCR eviden ce are considered to be sep arate. For all columns contain ing “+” or “-” symbols , the
former ind icates “presence” and the la tter “absence”.
197
Espinosa et al: Structural variation on the human Y chromosome from resequencing
www.cmj.hr
In addition to the Y-CNVs discovered during the current
study, in our nal data set we included the only Y-dele-
tion reported and validated in the original study of these
genomes (24), and 4 Y-CNVs reported in a more recent
study that used these and additional samples (7). Most
of the analyses were performed from September to De-
cember 2010, and from January to July 2012, with follow-
up and manuscript preparation August 2012 to Novem-
ber 2014.
Complete Genomics data
A total of 8 male samples that are in common between
1000 Genomes Pilot 1 data and the Complete Genom-
ics Public Data set (v36 v2.0.0) were analyzed. These sam-
ples were sequenced at an average depth of 25.4 × using
33 bp reads (Supplementary Table 1). Sequencing data in
the form of mapping.*.tsv.bz2 and read.*.tsv.bz2 les were
downloaded and converted into BAM les using cgatools-
1.4.0.15. Additionally, we also included insertions and dele-
tions (>50 bp) reported in the high-quality release of Y-SVs
on the same samples using the Complete Genomics Anal-
ysis Pipeline Version 2.0. All data were downloaded from
the FTP site of Complete Genomics: ftp://ftp2.completege-
nomics.com/.
OMNI SNP-chip data
Normalized SNP intensity data from Illumina
HumanOmni2.5-8 arrays generated for the entire set of
1000 Genomes Project samples were analyzed (Supple-
mentary Table 1). Data were downloaded from the FTP site
of the 1000 Genomes Project: ftp://ftp.1000genomes.ebi.
ac.uk/vol1/ftp/technical/working/20120131_omni_geno-
types_and_intensities/.
DNA samples for experimental validation
Experimental validation of some SVs was carried out as de-
scribed below using DNA samples obtained from the Cori-
ell Institute for Medical Research, http://www.coriell.org/.
Depth of coverage CNV analysis
CNVs were called using three dierent data sets. 1) Low-cov-
erage (2.3 × ) samples with Illumina sequence data, 2) low-
to-high coverage (4-23 × ) pools that were built by merg-
ing low-coverage samples from the same haplogroup (23),
and 3) high-coverage (25.4 × ) samples from Complete Ge-
nomics. A read depth-of-coverage (DOC) analysis was im-
plemented (25) on a total of 70, 10, and 8 samples for each
data set, respectively (Supplementary Table 1). This method
follows a rationale derived from aCGH experiments: rst, in-
stead of hybridizing DNA against a microarray, sequence
reads from two dierent samples are aligned to the same
genome template. Second, instead of measuring levels of
uorescence, the number of reads in each sample is directly
counted using a sliding-window approach. Finally, instead
of searching for signicant dierences using intensity levels,
dierences in coverage between the two samples are cal-
culated and then transformed into log2-ratios. Dierences
exceeding a signicance threshold are indicative of relative
gains or losses of genetic material.
The individual NA12891 (CEU trio father) was used as the
reference for comparisons against all the other samples. A
threshold (T) and a sliding-window size (WS) that best t-
ted each of our samples was chosen based on the follow-
ing reasoning: we sub-sampled the reference individual
NA12891 to a series of dierent coverage levels represen-
tative of the samples in each data set. We then performed
DOC analysis between each of the sub-samples and the
complete NA12891 reference. Since this last analysis com-
pares samples from the same individual, we expected no
CNVs to be detected. Combinations of T and WS values
producing this expected result were chosen as the best for
each data set (Supplementary Table 2).
A problem often encountered when using DOC strategies
is the fragmentation of CNVs into two or more adjacent
segments that actually correspond to the same variant.
This is due to local minima in the log-2 ratio signal that fail
to meet the threshold established. To account for this, we
joined into single calls all segments with an upstream or
downstream neighbor separated by a distance of 5 kb or
less (see Supplementary Table 3 and Supplementary Table
4 for a description of SV calls after and before merging, re-
spectively). Using a modied version of the cnv R-package
present in the CNV-seq method (25), we generated cover-
age plots for each of the detected CNV regions in all the
samples. Finally, all CNVs showing weak evidence based on
a visual inspection of these plots were discarded. A total of
16 CNVs were identied using this strategy (Supplemen-
tary Material, pages 1-26).
Paired-end analysis
Paired-end methodologies are typically used for the
discovery of insertions and deletions via the identi-
cation of read pairs whose mapping position devi-
FORENSIC SCEINECE
198 Croat Med J. 2015;56:194-207
www.cmj.hr
ates signicantly from the expected distance relative to a
reference (1). This indicates that the read pair was derived
from an individual whose structure diers from the refer-
ence sequence. We performed an analysis of discordant
paired-end reads (26) in the three data sets previously de-
scribed. Because not all the 1000 Genomes Pilot 1 Proj-
ect samples were sequenced using paired-end reads, we
only used samples and libraries that were sequenced us-
ing paired-end reads. All the Complete Genomics sam-
ples were sequenced using paired-end reads. Paired-end
approaches rely heavily on the correct mapping of reads
in the genome. We ltered out the reads with a mapping
quality below 35 in all samples using samtools-0.1.15 (27).
Overall this resulted in the use of 58 samples from the
1000 Genomes Pilot 1 Project, 8 from Complete Genom-
ics, and 10 haplogroup pools. The average depth for the
three data sets was 0.8 × , 26.1 × , and 4.1 × , respectively
(Supplementary Table 1). Given these dierences in read
depth, a minimum of 2 read pairs supporting an SV was
required in the case of the 1000 Genomes Pilot 1 sam-
ples and 4 in the cases of Complete Genomics and hap-
logroup pools. For the rest of the parameters, default set-
tings were used. Individual images of the SVs detected
were created using the IGV viewer (28) and visually ana-
lyzed to lter out possible false positives. This strategy re-
sulted in 5 deletions that were selected for polymerase
chain reaction (PCR) validation.
Validation using OMNI SNP-chip data’
In order to search for additional supporting evidence for
all the variants detected, we used normalized SNP inten-
sity data from Illumina Omni 2.5 SNP-chip arrays, which is
available for all the samples analyzed. We could thus inves-
tigate whether or not a candidate CNV was supported by
independent copy number (ie, intensity) data. Samples not
present in our analysis, as well as SNPs not present on the Y
chromosome, were ltered out. This resulted in a total den-
sity of 1953 SNPs available for all the 70 samples previously
analyzed. Since intensity data were present for both alleles
of each SNP and we were interested in the overall intensity
(indicative of copy number) for each SNP position, intensi-
ties for both alleles were summed into single values. As in
the previous analyses, we used the sample NA12891 as a
reference and calculated log2-ratios between SNP inten-
sities in this sample and all the other samples. Individual
plots for all samples and variants were generated using R
(29) and deviations from 0 (no copy number dierence)
were compared with the log-2 ratio plots of the DOC
analysis on the same variants.
Experimental validation of large partial deletions in the
AZFc region
Single PCR reactions using the sequence-tagged site (STS)
markers sY1291, sY1191, sY1161, and a multiplex reaction in-
cluding sY1206 and sY1201, were performed on all samples.
The absence of the sY1291 product and the presence of the
rest of the markers indicated the presence a gr/gr deletion
(30), whereas the absence of sY1191 and the presence of
the rest indicated a b2/b3 (g1/g3) deletion (31,32). All sam-
ples shown to carry either of these deletions were tested a
second time using single PCR reactions for all the markers
and also using the combination of singleplex and multiplex
reactions. All results were successfully conrmed in this way.
Singleplex and multiplex PCR reactions were performed us-
ing 50 ng genomic DNA template, 2-8 pmol of each primer
in 50 mM KCl, 10 mM Tris-HCl (pH 8.3), 1.5 mM MgCl2, 0.1%
Triton-X100, 200 µM of each dNTP, and 1 unit of Taq poly-
merase (Promega, Madison, WI, USA) in a nal volume of 20
µL. Amplication cycles consisted of an initial denaturation
step at 94°C for 4 min, plus 35 cycles at 94°C for 30 s, anneal-
ing at 57°C, 61°C, 58°C (each for 45 s) and 72°C for 45 s, and
a nal extension of 72°C for 5 min. Reaction products were
analyzed by agarose gel electrophoresis.
Experimental validation of deletions <10 kb
Single PCR reactions were performed on all deletions small-
er than 10 kb that had no previous experimental validation.
Primer3 (33) was used for designing the primers; these were
located at a distance between 200-1000 bp outside the de-
tected start and end of the variants. In-silico PCR and Re-
peatMasker tools from the UCSC Browser (34) were used to
detect cases in which primers were predicted to generate
more than one amplicon and/or were placed in highly re-
peated regions of the genome; such primers were avoided.
PCRs were performed according to the length of the dele-
tion in two dierent ways. 1) For deletions shorter than 1
kb, we used 10 ng of genomic DNA template, 5 pmol of
each primer in 50 mM KCl, 10 mM Tris-HCl (pH 8.3), 3.5 mM
MgCl2, 0.01% (w/v) gelatin, 250 µM of dNTPs, and 0.45 units
of Taq polymerase (Applied Biosystems, Life Technologies,
Waltham, MA, USA) in a nal volume of 20 µL. Amplication
cycles consisted of an initial denaturation step at 94°C for 4
min, plus 35 cycles at 94°C for 30 s, annealing at 57°C, 61°C,
58°C (each for 45 s) and 72°C for 45 s, and a nal extension
of 72°C for 5 min. 2) For deletions greater than 1 kb, we used
50 ng genomic DNA template, 10 pmol of each primer in 50
mM KCl, 10 mM Tris–HCl (pH 8.3), 1.5 mM MgCl2, 0.1% Triton-
X100, 200 µM of each dNTP, and 1 unit of Taq polymerase in
199
Espinosa et al: Structural variation on the human Y chromosome from resequencing
www.cmj.hr
a nal volume of 20 µL. Amplication consisted of an initial
denaturation step at 94°C for 15 min, plus two rounds of 13
cycles each. The rst one at 94°C for 30 s, annealing at 68°C
for 30 s and decreasing 0.5°C each cycle, and extension at
68°C for 10 min. The second one at 94°C for 30 s, annealing
at 58°C for 30 s, and extension at 68°C for 10 min. Reaction
products were analyzed using agarose gel electrophoresis.
SVs that failed validation are also reported (Supplementary
Table 3 and Supplementary Table 4).
Data integration
Evidence from all analyses and data sources (8 in total)
were integrated into a highly curated data set. All SVs re-
ported in this analysis were supported by at least two dif-
ferent lines of evidence (Table 1). A full description of all
SVs, including information from the multiple sources of
evidence, is provided in Supplementary Table 3. Y-chro-
mosomal haplogroup analysis was based on the ISOGG
Y-DNA Haplogroup Tree 2015 (http://www.isogg.org/tree/
ISOGG_YDNATreeTrunk.html) and sub-branches.
RESULTS
We applied a combination of read-depth and read-pair
methods to discover SVs in 70 Y chromosomes from Africa,
Europe, and East Asia with both low and high sequence
coverage levels. Our strategy included the pooling of
closely-related samples by haplogroup in order to increase
the number of high coverage samples in the data set. We
then used comparisons between Illumina and Complete
Genomics sequencing, SNP-chip data, established and
novel PCRs, and compiled literature reports in order to vali-
date and support our set of calls. This combined strategy
resulted in a set of 19 highly-curated and validated SVs
(Supplementary Material, pages 1-26) (Table 1).
The SVs are numbered SV 01 to SV 19 according to their
location on the chromosome (Figure 1, Table 1). Of this to-
tal, our methodology was able to detect 16 SVs, compared
to 6, 1, and 4 that were reported in the same samples by
Complete Genomics, Pilot 1 (23) and Phase 1 of the 1000
Genomes Project (24), respectively. Likewise, 8 SVs are
unique to our analysis, compared to 0, 1, and 1 to the other
reports (Table 1). Importantly, 2 of the 16 new SVs have not
been reported and validated before. These numbers high-
light the low level of attention paid to the Y chromosome
by previous studies of these samples, coupled with the dif-
culty of correctly identifying large CNVs in the repeat-rich
Y chromosome.
Our data set contains SVs with sizes ranging from 81 bp to
1.9 Mb. Of the 19 SVs detected, 14 were larger than 10 kb, 4
were between 5 kb and 10 kb, and one was below 100 bp.
A bias toward identifying large SVs is evident, and the ap-
proaches based on read-depth that are more sensitive to
large events stand out for their ecacy in identifying such
events, with 16 SVs being detected, in comparison to only
one using paired-end approaches. In terms of the nature
of the events, a slight bias toward deletions was observed:
12 of the SVs identied showed a higher proportion of de-
letions vs duplications among samples, with 8 SVs show-
ing no duplication signal at all. Non-reference allele counts
ranged between one and 64 among the 70 Y chromo-
somes. SVs with the highest observed non-reference allele
frequencies were largely but not entirely concentrated in
the heterochromatic segments near the centromere and
DYZ19 locus. In contrast, low-frequency SVs, among which
one is novel, were more associated with X-degenerate and
ampliconic segments. Since the haplogroup assignments
of all the Y chromosomes studied here have previously
been identied (23) (Figure 2), this information is used be-
low, along with location, size, and frequency, in presenting
each SV.
Two SVs were found to overlap with the X-transposed re-
gions of the Y: SV 01 was reported in the Phase 1 release and
corresponded to a high-frequency 2 kb deletion present
only in haplogroup O3a-P199 (Figure 2). Although we did
not discover this variant using our approaches, we success-
fully validated it using PCR (Supplementary Material, page
2). SV 02 corresponded to a large (30 kb) and previously
undescribed duplication present in 22 individuals from the
CEU, CHB, and YRI populations within haplogroups O and
R. This duplication is located at the breakpoints of two seg-
mental duplications with high levels of sequence identity
(>99%) and has evidence from both low and high cover-
age samples as well as from the pooling of individuals in
haplogroup R1-P234 (Figure 3). The specic Y-haplogroup
distributions of these two SVs provide strong evidence for
their location on the Y-chromosomal, rather than X-chro-
mosomal copy of this transposed region.
Eight SVs were found in the ampliconic segments: SV 03
corresponded to a large (240 kb) and previously unde-
scribed SV present only in one individual belonging to hap-
logroup N-M231. SVs 04 and 05 corresponded to variation
in the TSPY arrays on the Y. Only the TSPY array overlapping
SV 05 was previously known to be variable (13,14,17), but
since the two arrays are comprised of the same repeti-
tive sequences, read mapping is expected to occur
FORENSIC SCEINECE
200 Croat Med J. 2015;56:194-207
www.cmj.hr
FIGURE 1. Schematic representation of the Y chromosome and the structural variants (SVs) detected. Y-chromosome left bar: the
two extreme tips in green correspond to the two pseudoautosomal regions, and the rest of the chromosome, made up of two
blue (euchromatic) and dark-gray (heterochromatic) sections, to the male-specic region. (A) Gaps/centromere: all dark-gray bars
indicate gaps in the Y reference including the long centromeric region at ~ 10-13 Mb. Sequence classes: pseudoautosomal (green),
X-degenerate (light-yellow), X-transposed (light-pink), ampliconic (light-blue), and heterochromatic (dark-gray) regions are indicated
along the chromosome. (B) SV locations; approximate locations of the 19 SVs described in this work. (C) Genomic features. Relevant
previous information available about some SVs; SVs not previously described are labeled as “New”. (D) The eight tracks shown in this
section indicate the dierent sources of evidence for each of the SVs. In all cases (A, B, C, and D) the regions beyond ~ 28 Mb (shown
in panel a) as gray and green blocks) correspond to heterochromatic regions variable in size, and the pseudoautosomal region in the
long arm of the Y, respectively.
201
Espinosa et al: Structural variation on the human Y chromosome from resequencing
www.cmj.hr
equally well at both (“shadowing”), thus making the ar-
ray overlapping SV 04 appear variable as well. SV 06 over-
lapped the non-coding RNA T TTY22 and corresponded to
a 10 kb deletion present only in YRI individuals within hap-
logroup E, and was previously reported only in the Com-
plete Genomics release but not validated by other sources
until now. This deletion is also located at the breakpoints of
two segmental duplications with high levels of sequence
identity (>99%) and is supported by evidence from both
low and high coverage samples as well as from the pooling
of individuals in haplogroup E1b1a1a1-P182; validation by
SNP-chip data was also observed (Figure 4). SVs 16 and 18
corresponded to the well-known large gr/gr and b2/b3 (g1/
g3) deletions, respectively (30-32). SVs 17 and 19 mapped
to the well-known DAZ-repeat regions within the DAZ 1-2
and DAZ 3-4 genes (35), respectively. These four SVs (16-19)
were present at high frequency across all populations stud-
ied, with the exception of SV 18, which was present at low
frequency in the CEU and CHB populations only. They were
all present in multiple haplogroups, and most were variable
within each haplogroup, indicating their multiple origins,
although the previously observation of xation of the gr/gr
deletion in haplogroup D (30) was replicated here.
Seven SVs were present in the heterochromatic regions
forming three dierent groups: the rst group (SVs 07 and
FIGURE 2. Phylogenetic framework for the study of Y-structural variants (SVs). (A) Branches of the Y phylogeny present in this work.
Haplogroup names and number of samples within each branch are shown at the bottom of the panel. Background colors repre-
sent the population that each of the haplogroups belongs to; red indicates JPT, purple YRI, blue CEU, and green CHB (acronyms are
explained in the Methods). Haplogroups O2b and O3a contain samples that belong to both JPT and CHB, indicated by the green and
red stripes. (B) Each row represents one of the 19 SVs reported in this work. There are four sections of information for each row. From
left to right: 1) ID of the variant. 2) Relevant information available for the variant. 3) Pie charts are shown for all haplogroups that
carry the variant. Black areas within these pie charts represent the proportion of samples containing the variant compared with the
total number of individuals in each haplogroup. 4) Horizontal black bars on the right of the panel show the total number of individu-
als that carry the variant.
FORENSIC SCEINECE
202 Croat Med J. 2015;56:194-207
www.cmj.hr
08) was located in the periphery of the centromeric region
on the short arm of the chromosome. These variants corre-
sponded to regions rich in alphoid repeats and were pres-
ent at high frequency. The second group (SVs 09, 10, and
11) was located in the opposite edge of the centromere,
on the long arm. These regions corresponded to repeats of
variable length and also showed high-frequency variation.
The third group (SVs 14 and 15) was located on the long
arm of the chromosome and corresponded to the highly
variable region DYZ19. SV 14 in fact was the most variable
region in this study, with non-reference structures detect-
ed in 64 samples.
Finally, 2 SVs were found in the X-degenerate regions of the
Y, both at very low frequency: SV 12 is an 81 bp deletion only
present in 4 individuals from the European I1-M307 haplo-
group and corresponded to the only region reported and
validated using PCR by the Pilot 1 release (24). It also was the
only SV in our validated set detected using paired-end ap-
proaches. SV 13 is a 5kb deletion present in all three sam-
ples from the Japanese haplogroup C-M216 (including the
pooled sample for this haplogroup), and is specic to this
haplogroup. PCR validation was successfully conducted on
this variant (Supplementary Material, page 18). This variant
was not associated with segmental duplications at the break-
points and the one SNP overlapping the region was found to
support the deletion in all deleted samples (Figure 5).
At the population level, the highest variability was found
in YRI samples, whereas the lowest was in individuals from
FIGURE 3. Evidence supporting the duplication structural variant (SV) 02. From top to bottom: 1) Read-depth analysis. Read-depth
plots for Pilot 1 sample NA18561 (blue), Pilot 1 pooled sample Haplogroup R1 (red), and Complete Genomics sample NA18940
(green). 2) Segmental duplications (from the UCSC genome browser). Bars colored in gray, dark-yellow, and dark-orange correspond
to duplications with 90%-98%, 98%-99%, and >99% sequence identity, respectively. All comparisons in the read-depth plots are
expressed in log2-ratios and use the reference individual NA12891. Red vertical dotted lines indicate the approximate start and end
positions of the SV (see Table 1).
203
Espinosa et al: Structural variation on the human Y chromosome from resequencing
www.cmj.hr
the CHB population. We also found that 12 (63%) of the
SVs detected overlapped segmental duplications (SDs).
Roughly 36% of the Y chromosome is composed of SDs
(36), so the proportion of SVs observed to overlap SDs is
larger than expected for a random distribution.
The Y chromosome codes for only 27 distinct proteins, al-
though the genes for several of these are present in mul-
tiple copies (36). Despite this very low gene density, 6 of
the 19 SVs (SVs 04-05, 16-19) overlapped with 5 of the gene
FIGURE 4. Evidence supporting the deletion structural variant (SV) 06. From top to bottom: 1) Read-depth analysis. Read-depth plots
for Pilot 1 sample NA18501 (blue), Pilot 1 pooled sample Haplogroup E1b1a1a1 (red), and Complete Genomics sample NA18501
(green). 2) Segmental duplications. Bars colored in gray, dark-yellow, and dark-orange correspond to duplications with 90%-98%,
98%-99%, and >99% sequence identity, respectively. 3) OMNI Chip data. Single nucleotide polymorphism (SNP) intensities of sample
NA18501 at the variant location. Blue horizontal lines are positioned at log2-ratios -0.6 and 0.6. All comparisons in the read-depth
and SNP intensity plots are expressed as log2-ratios and use the reference individual NA12891. Red vertical dotted lines indicate the
start and end positions of the SV (Table 1).
FORENSIC SCEINECE
204 Croat Med J. 2015;56:194-207
www.cmj.hr
families (BPY, CDY, DAZ, PRY, TSPY), and a seventh SV (SV 06)
with the long non-coding RNA TT TY22.
DISCUSSION
The Y chromosome has been under-represented in re-
cent studies of human SVs, particularly those based on
sequence data, where a total of just 5 Y-SVs were report-
ed in two large studies (7,24); moreover, all those Y-SVs al-
ready known to be present at high frequencies in some
or all of the populations investigated (5,13,15,16) were not
called in the sequence-based studies, although their large
sizes should have made them readily detectable. Here, we
showed that all of these common known SVs can be suc-
FIGURE 5. Evidence supporting the deletion structural variant (SV) 13. From top to bottom: 1) Read-depth analysis. Read-depth plots
for Pilot 1 sample NA18974 (blue) and Pilot 1 pooled sample from Haplogroup C (red). 2) Segmental duplications. Bars colored in
gray correspond to duplications with 90%-98% sequence identity. 3) OMNI Chip data. Single nucleotide polymorphism (SNP) intensi-
ties of sample NA18501 at the variant location. Blue horizontal lines are positioned at log2-ratios -0.6 and 0.6. All comparisons in the
read-depth and SNP intensity plots are expressed in log2-ratios and use the reference individual NA12891. Red vertical dotted lines
indicate start and end positions of the variant (Table 1).
205
Espinosa et al: Structural variation on the human Y chromosome from resequencing
www.cmj.hr
cessfully identied using current data when appropriate
analysis methods are applied.
Our ndings emphasize the major roles of both highly-
repeated heterochromatic regions and also segmental
duplications in providing a sequence environment for SV
generation. The surrounding sequence and haplogroup
distribution can provide insights into the mutational mech-
anisms that may have generated these Y-SVs (8). If the sur-
rounding sequences are repeated, non-allelic homologous
recombination (NAHR) may occur, duplicating or deleting
the region between the repeats; if they are not repeated,
SVs may be generated by a number of other mechanisms,
the most relevant of which here is non-homologous end
joining (NHEJ ). NAHR-generated SVs are quite likely to re-
cur, and duplications can revert, while NHEJ-generated SVs
are less likely to do either. The well-established Y-chromo-
somal phylogeny (22) allows us to ask whether SV haplo-
group distributions are consistent with a single mutational
event, or require more than one event, such as recurrence
of the same mutation or reversion, to explain their distri-
bution. For example, SV 01, where all four variant alleles
lie within haplogroup O3a-P199 (Figure 2) is consistent
with a single mutational origin. Moreover, this origin can
be placed in time after the mutation that created the SNP
dening haplogroup O3a-P199. In contrast, SV 15, where
both reference and non-reference alleles are found in all
haplogroups examined here (Figure 2) must have experi-
enced at least 10 mutational events. Applying this reason-
ing to all 19 SVs shows that just four (SVs 01, 03, 12, and
13) could have arisen by single mutations, while all the
others have more complex multi-mutational origins. Fur-
thermore, these four variants do not show any segmental
duplication structure surrounding the breakpoints, in con-
cordance with a single mutational origin associated with
NHEJ. In contrast, 11 of the remaining 16 SVs are associ-
ated with SDs and are thus likely to have arisen by NAHR.
In some cases we can infer the likely sequence of muta-
tions in more detail. For example, SV 02 occurs only in the
related haplogroups O2b-SRY465, O3a-P199, R1-P234, and
R1b1a2a1a2b-U152, yet both reference and duplicated al-
leles occur in all four haplogroups. Its absence from the 42
chromosomes belonging to other haplogroups suggests
that its origin (ie, the duplication event) may be rare, and
have occurred only once, in the common ancestor of hap-
logroups O and R. But the occurrence of reference alleles
in haplogroup N and each of the four O and R haplogroups
suggests that reversions to the reference state are more
frequent, and have occurred at least ve times, ie, once in
each haplogroup.
The observation that 6 (SVs 04, 05, 16-19) of the 19 SVs af-
fect the copy number of 5 gene families is particularly strik-
ing in the context of the very low gene density on the Y
chromosome. These SVs are relatively common, with non-
reference allele counts ranging from 3 (SV 18) to 19 (SVs
17 and 19) of the 70 Y-chromosomes. Several factors con-
tribute to this situation. First, several of the SVs are large,
two duplicating or deleting 1.6 and 1.9 Mb of the 24 Mb of
the male-specic Y euchromatin, so there is an increased
chance of them aecting genes. Second, each of the SVs
probably aects only some members of any gene family,
although for SVs 17 and 19, which both reect dierences
in copy number of the DAZ repeat domain, this is uncer-
tain because of the shadowing eect mentioned earlier.
Third, the known functions of all of the 5 genes are linked
to spermatogenesis, a phenotype which is variable in the
population and where duplication or deletion of a few of
the contributing genes may have only minor eects; in-
deed, only the gr/gr deletion (SV 16) and TSPY copy num-
ber variation (SVs 04 and 05) have been linked to a slightly
increased risk of spermatogenic failure in some popula-
tions (19,30), while the b2/b3(g1/g3) deletion (SV 18) and
DAZ repeat variation (SVs 17 and 19) appear to be neutral
(31,32,35). The phenotypes of the 1000 Genomes donors
are unknown, but our ndings suggest that the 5 gr/gr de-
letion carriers outside haplogroup D and the individuals
with the lowest TSPY copy number may be at increased
risk of spermatogenic impairment.
This survey of Y-chromosomal SVs is likely to be complete
for large common SVs in the haplogroups included in these
population samples and the regions of the chromosome
accessible to current sequencing technology. Neverthe-
less, populations carry far more small rare SVs (7), and so
the survey is likely to be very incomplete for this class. SVs
smaller than 5 kb were not detected by the major discovery
approach used here, read depth, although 2 are included
in the data set, discovered in other ways (7,23) and 5 more
candidates were detected by our paired-end analysis but all
turned out to be false positives after PCR testing. Despite the
limitation in discovering small CNVs from the low-coverage
sequence data used, this work demonstrates the power of
using sequence data for SV discovery and analysis on the
Y chromosome, and points to the need for larger and more
comprehensive surveys. Indeed, the 1000 Genomes Project
has itself increased the sequence coverage and sample size
considerably, so such improved studies are now possible.
Acknowledgments We thank the donors of the 1000 G enomes Proj-
ect samples for making this work possible, Jan Korbel for helpful com-
ments and suggestions regarding the SNP-chip analysis, and Andrea
FORENSIC SCEINECE
206 Croat Med J. 2015;56:194-207
www.cmj.hr
Massaia for comments on the manuscript. We also thank The Wellcome
Trust (grant 098051) and in particular the Wellcome Trust Trustees Commit-
tee for providing a Sanger Prize Fellowship to JRFE.
Funding received from The Wellcome Trust Sanger Institute Sanger Prize
committee (to JFRE); The Wellcome Trust grant number 098051.
Ethical approval provided for the 1000 Genomes Project (23,24). The data
and samples were then fully open access, and no additional ethical review
or approval was required.
Declaration of authorship Study design: JRFE, YX, CTS; Data analysis: JRFE,
QA, YC, YX; Experimental work: JRFE, QA; Manuscript drafting: JRFE, C TS;
Manuscript approval: All.
Competing interests All authors have completed the Unied Competing
Interest form at www.icmje.org/coi_disclosure.pdf (available on request
from the corresponding author) and declare: no support from any organi-
zation for the submitted work; no nancial relationships with any organiza-
tions that might have an interest in the submitted work in the previous 3
years; no other relationships or activities that could appear to have inu-
enced the submitted work.
References
1 Alkan C, Coe BP, Eichler EE. Genome structural variation
discovery and genotyping. Nat Rev Genet. 2011;12:363-76.
Medline:21358748 doi:10.1038/nrg2958
2 Stranger BE, Forrest MS, Dunning M, Ingle CE, Beazley C, Thorne
N, et al. Relative impact of nucleotide and copy number variation
on gene expression phenotypes. Science. 2007;315:848-53.
Medline:17289997 doi:10.1126/science.1136678
3 MacArthur DG, Balasubramanian S, Frankish A, Huang N, Morris
J, Walter K, et al. A systematic survey of loss-of-function variants
in human protein-coding genes. Science. 2012;335:823-8.
Medline:22344438 doi:10.1126/science.1215040
4 Lupski JR, Stankiewicz PT, eds. Genomic disorders: the genomic
basis of disease. Totowa, NJ, USA: Humana Press; 2006.
5 Redon R, Ishikawa S, Fitch KR, Feuk L, Perry GH, Andrews TD, et al.
Global variation in copy number in the human genome. Nature.
2006;444:444-54. Medline:17122850
6 onrad DF, Pinto D, Redon R, Feuk L, Gokcumen O, Zhang Y, et al.
Origins and functional impact of copy number variation in the
human genome. Nature. 2010;464:704-12. Medline:19812545
doi:10.1038/nature08516
7 Mills RE, Walter K, Stewart C, Handsaker RE, Chen K, Alkan C,
et al. Mapping copy number variation by population-scale
genome sequencing. Nature. 2011;470:59-65. Medline:21293372
doi:10.1038/nature09708
8 Jobling MA. Copy number variation on the human Y chromosome.
Cytogenet Genome Res. 2008;123:253-62. Medline:19287162
doi:10.1159/000184715
9 Sandberg AA, ed. The Y chromosome. Part A: basic characteristics
of the Y chromosome. New York, NY, USA: Alan R. Liss inc; 1985.
10 K ayser M, Kittler R, Erler A, Hedman M, Lee AC, Mohyuddin A, et al.
A comprehensive survey of human Y-chromosomal microsatellites.
Am J Hum Genet. 2004;74:1183-97. Medline:15195656
doi:10.1086/421531
11 Jobling MA, Bouzekri N, Taylor PG. Hypervariable digital DNA
codes for human paternal lineages: MVR-PCR at the Y-specic
minisatellite, MSY1 (DYF155S1). Hum Mol Genet. 1998;7:643-53.
Medline:9499417 doi:10.1093/hmg/7.4.643
12 Bao W, Zhu S, Pandya A, Zerjal T, Xu J, Shu Q, et al. MSY2: a slowly
evolving minisatellite on the human Y chromosome which
provides a useful polymorphic marker in Chinese populations.
Gene. 2000;244:29-33. Medline:10689184 doi:10.1016/S0378-
1119(00)00021-4
13 Oakey R, Tyler-Smith C. Y chromosome DNA haplotyping suggests
that most European and Asian men are descended from one
of two males. Genomics. 1990;7:325-30. Medline:1973137
doi:10.1016/0888-7543(90)90165-Q
14 Mathias N, Bayes M, Tyler-Smith C. Highly informative compound
haplotypes for the human Y chromosome. Hum Mol Genet.
1994;3:115-23. Medline:7909247 doi:10.1093/hmg/3.1.115
15 Tyler-Smith C, Taylor L, Muller U. Structure of a hypervariable
tandemly repeated DNA sequence on the short arm of the human
Y chromosome. J Mol Biol. 1988;203:837-48. Medline:3210241
doi:10.1016/0022-2836(88)90110-6
16 Jobling MA, Samara V, Pandya A, Fretwell N, Bernasconi B, Mitchell
RJ, et al. Recurrent duplication and deletion polymorphisms on the
long arm of the Y chromosome in normal males. Hum Mol Genet.
1996;5:1767-75. Medline:8923005 doi:10.1093/hmg/5.11.1767
17 Repping S, van Daalen SK , Brown LG, Korver CM, Lange J,
Marszalek JD, et al. High mutation rates have driven extensive
structural polymorphism among human Y chromosomes. Nat
Genet. 2006;38:463-7. Medline:16501575 doi:10.1038/ng1754
18 Vogt PH, Edelmann A, Kirsch S, Henegariu O, Hirschmann P,
Kiesewetter F, et al. Human Y chromosome azoospermia factors
(AZF) mapped to dierent subregions in Yq11. Hum Mol Genet.
1996;5:933-43. Medline:8817327 doi:10.1093/hmg/5.7.933
19 Giachini C, Nuti F, Turner DJ, Laface I, Xue Y, Daguin F, et al.
TSPY1 copy number variation inuences spermatogenesis and
shows dierences among Y lineages. J Clin Endocrinol Metab.
2009;94:4016-22. Medline:19773397 doi:10.1210/jc.2009-1029
20 Santos FR, Pandya A, Tyler-Smith C. Reliability of DNA-based sex
tests. Nat Genet. 1998;18:103. Medline:9462733 doi:10.1038/
ng0298-103
21 Jobling MA, Lo IC, Turner DJ, Bowden GR, Lee AC, Xue Y, et al.
Structural variation on the short arm of the human Y chromosome:
recurrent multigene deletions encompassing Amelogenin Y. Hum
Mol Genet. 2007;16:307-16. Medline:17189292 doi:10.1093/hmg/
ddl465
22 Jobling MA, Tyler-Smith C. The human Y chromosome: an
evolutionary marker comes of age. Nat Rev Genet. 2003;4:598-612.
Medline:12897772 doi:10.1038/nrg1124
23 The 1000 Genomes Project Consortium. A map of human
genome variation from population-scale sequencing. Nature.
2010;467:1061-73. Medline:20981092
24 The 1000 Genomes Project Consortium. An integrated map
207
Espinosa et al: Structural variation on the human Y chromosome from resequencing
www.cmj.hr
of genetic variation from 1,092 human genomes. Nature.
2012;491:56-65. Medline:23128226
25 Xie C, Tammi MT. CNV-seq, a new method to detect copy number
variation using high-throughput sequencing. BMC Bioinformatics.
2009;10:80. Medline:19267900 doi:10.1186/1471-2105-10-80
26 Chen K , Wallis JW, McLellan MD, Larson DE, Kalicki JM, Pohl CS,
et al. BreakDancer: an algorithm for high-resolution mapping
of genomic structural variation. Nat Methods. 2009;6:677-81.
Medline:19668202 doi:10.1038/nmeth.1363
27 Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, et al. The
Sequence Alignment/Map format and SAMtools. Bioinformatics.
2009;25:2078-9. Medline:19505943 doi:10.1093/bioinformatics/
btp352
28 Robinson JT, Thorvaldsdottir H, Winckler W, Guttman M, Lander
ES, Getz G, et al. Integrative genomics viewer. Nat Biotechnol.
2011;29:24-6. Medline:21221095 doi:10.1038/nbt.1754
29 Team RDCR. A language and environment for statistical
computing. 2008. Available from: http://www.R-project.org.
Accessed: May 26, 2015.
30 Repping S, Sk aletsky H, Brown L, van Daalen SK, Korver CM,
Pyntikova T, et al. Polymorphism for a 1.6-Mb deletion of the
human Y chromosome persists through balance between
recurrent mutation and haploid selection. Nat Genet. 2003;35:247-
51. Medline:14528305 doi:10.1038/ng1250
31 Fernandes S, Paracchini S, Meyer LH, Floridia G, Tyler-Smith C, Vogt
PH. A large AZFc deletion removes DAZ3/DAZ4 and nearby genes
from men in Y haplogroup N. Am J Hum Genet. 2004;74:180-7.
Medline:14639527 doi:10.1086/381132
32 Repping S, van Daalen SK , Korver CM, Brown LG, Marszalek
JD, Gianotten J, et al. A family of human Y chromosomes has
dispersed throughout northern Eurasia despite a 1.8-Mb deletion
in the azoospermia factor c region. Genomics. 2004;83:1046-52.
Medline:15177557 doi:10.1016/j.ygeno.2003.12.018
33 Koressaar T, Remm M. Enhancements and modications of
primer design program Primer3. Bioinformatics. 2007;23:1289-91.
Medline:17379693 doi:10.1093/bioinformatics/btm091
34 Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler
AM, et al. The human genome browser at UCSC. Genome Res.
2002;12:996-1006. Medline:12045153 doi:10.1101/gr.229102.
Article published online before print in May 2002
35 Saxena R, de Vries JW, Repping S, Alagappan RK, Skaletsky H,
Brown LG, et al. Four DAZ genes in two clusters found in the AZFc
region of the human Y chromosome. Genomics. 2000;67:256-67.
Medline:10936047 doi:10.1006/geno.2000.6260
36 Skaletsky H, Kuroda-Kawaguchi T, Minx PJ, Cordum HS, Hillier
L, Brown LG, et al. The male-specic region of the human Y
chromosome is a mosaic of discrete sequence classes. Nature.
2003;423:825-37. Medline:12815422 doi:10.1038/nature01722
... 5,18,19 Later studies, bolstered by developing technology, described many types of CNVs in larger numbers of men. 11,[20][21][22][23][24] Amplicon CNVs have recently been discovered on the Y chromosomes of chimpanzees, macaques, gorillas, and mice. [25][26][27][28] Some amplicon CNVs have been implicated in spermatogenic failure, sex reversal, Turner syndrome, and testis cancer. ...
... [5][6][7][8][9][10][11][12] However, the amplicon CNVs with well-described phenotypes represent only a small part of the spectrum of amplicon variation; the vast majority of amplicon CNVs that have been discovered have no known effect on spermatogenesis or any other trait. [20][21][22][23] Even though amplicon CNVs have been the subject of intense investigation, most previous studies made only nominal attempts to reconstruct the evolution of the amplicons, instead focusing on documenting amplicon variation. Here, we present a detailed reconstruction of Y chromosome amplicon evolution in humans. ...
... 58 The true breadth of amplicon copy number variation has been revealed by recent surveys. [20][21][22][23] In accordance with these studies, we found that most amplicon CNVs in the general population do not fall within the small set of CNVs with confirmed phenotypes, and that duplications are more common than deletions. Our results suggest that most or all amplicon CNVs have phenotypic effects that cause selection to remove them from the population. ...
Article
Amplicons-large, highly identical segmental duplications-are a prominent feature of mammalian Y chromosomes. Although they encode genes essential for fertility, these amplicons differ vastly between species, and little is known about the selective constraints acting on them. Here, we develop computational tools to detect amplicon copy number with unprecedented accuracy from high-throughput sequencing data. We find that one-sixth (16.9%) of 1,216 males from the 1000 Genomes Project have at least one deleted or duplicated amplicon. However, each amplicon's reference copy number is scrupulously maintained among divergent branches of the Y chromosome phylogeny, including the ancient branch A00, indicating that the reference copy number is ancestral to all modern human Y chromosomes. Using phylogenetic analyses and simulations, we demonstrate that this pattern of variation is incompatible with neutral evolution and instead displays hallmarks of mutation-selection balance. We also observe cases of amplicon rescue, in which deleted amplicons are restored through subsequent duplications. These results indicate that, contrary to the lack of constraint suggested by the differences between species, natural selection has suppressed amplicon copy number variation in diverse human lineages.
... Next generation sequencing (NGS) is now established as the prime data-generation method in genomics, and Y chromosome CNV analysis is no exception. NGS offers high throughput comparable to, and even higher than, microarray-based methods; it can potentially achieve base-pair resolution, and indeed has been employed recently as the main data source to study CNVs in the Y chromosome (Espinosa et al. 2015;Poznik et al. 2016) as well as in the rest of the genome (Hehir-Kwa et al. 2016;Mills et al. 2011;Sudmant et al. 2015). It should be noted, however, that the Y chromosome genomic context amplifies NGS's intrinsic limitations. ...
... In the same year, a study of Y-CNVs inferred from sequence data in samples from the 1000 Genomes pilot phase (Espinosa et al. 2015) was published. The sample set consisted of 70 males from four populations (YRI, CEU, CHB and JPT) sequenced at 2.3× average depth; ten samples at variable depth, obtained by merging sequencing data from subsets of the same males belonging to the same haplogroup; and eight samples from the Complete Genomics Public Data set (v36 v2.0.0), at high (25.4×) ...
Article
Full-text available
The human Y chromosome provides a fertile ground for structural rearrangements owing to its haploidy and high content of repeated sequences. The methodologies used for copy number variation (CNV) studies have developed over the years. Low-throughput techniques based on direct observation of rearrangements were developed early on, and are still used, often to complement array-based or sequencing approaches which have limited power in regions with high repeat content and specifically in the presence of long, identical repeats, such as those found in human sex chromosomes. Some specific rearrangements have been investigated for decades; because of their effects on fertility, or their outstanding evolutionary features, the interest in these has not diminished. However, following the flourishing of large-scale genomics, several studies have investigated CNVs across the whole chromosome. These studies sometimes employ data generated within large genomic projects such as the DDD study or the 1000 Genomes Project, and often survey large samples of healthy individuals without any prior selection. Novel technologies based on sequencing long molecules and combinations of technologies, promise to stimulate the study of Y-CNVs in the immediate future.
... Although beyond the scope of this work, another interesting area that requires further investigation is how different CNV calling tools perform based on various SV sizes and read coverage, both of which are known to affect detection and accuracy of SV calling [15,16,50]. In a similar manner, the distributions of SVs in biological regions (e.g., sex chromosomes) may require special attention, as specific SVs have been known to occur in a particular gender (e.g., [51,52]). Lastly, transcriptional regulation of the altered regions requires more investigations, so that the causative effect of CNVs can be elucidated, and potentially be predicted in each case. ...
Article
Full-text available
Copy-number variations (CNVs) have important clinical implications for several diseases and cancers. Relevant CNVs are hard to detect because common structural variations define large parts of the human genome. CNV calling from short-read sequencing would allow single protocol full genomic profiling. We reviewed 50 popular CNV calling tools and included 11 tools for benchmarking in a reference cohort encompassing 39 whole genome sequencing (WGS) samples paired current clinical standard—SNP-array based CNV calling. Additionally, for nine samples we also performed whole exome sequencing (WES), to address the effect of sequencing protocol on CNV calling. Furthermore, we included Gold Standard reference sample NA12878, and tested 12 samples with CNVs confirmed by multiplex ligation-dependent probe amplification (MLPA). Tool performance varied greatly in the number of called CNVs and bias for CNV lengths. Some tools had near-perfect recall of CNVs from arrays for some samples, but poor precision. Several tools had better performance for NA12878, which could be a result of overfitting. We suggest combining the best tools also based on different methodologies: GATK gCNV, Lumpy, DELLY, and cn.MOPS. Reducing the total number of called variants could potentially be assisted by the use of background panels for filtering of frequently called variants.
... Next-generation sequencing (NGS) allows the detection of variants or mutations by DNA or RNA sequencing, and the potential of base-pair resolution, and was recently the primary source of data for studying CNVs on the Y chromosome [3,11] and other parts of the genome [12,13]. However, the genomic context of Y chromosome amplifies the inherent limitations of NGS. ...
Article
Full-text available
PurposeTo provide a validated method to identify copy number variation (CNV) in regions of the Y chromosome of infertile men by next-generation sequencing (NGS).Methods Semen analysis was used to determine the quality of semen and diagnose infertility. Deletion of the azoospermia factor (AZF) region in the Y chromosome was detected by a routine sequence-tagged-site PCR (STS-PCR) method. We then used the NGS method to detect CNV in the AZF region, including deletions and duplications.ResultsA total of 326 samples from male infertility patients, family members, and sperm donors were studied between January 2011 and May 2017. AZF microdeletions were detected in 120 patients by STS-PCR, and these results were consistent with the results from NGS. In addition, of the 160 patients and male family members who had no microdeletions detected by STS-PCR, 51 cases were found to exhibit Y chromosome structural variations by the NGS method (31.88%, 51/160). No microdeletions were found in 46 donors by STS-PCR, but the NGS method revealed 11 of these donors (23.91%, 11/46) carried structural variations, which were mainly in the AZFc region, including partial deletions and duplications.Conclusion The established NGS method can replace the conventional STS-PCR method to detect Y chromosome microdeletions. The NGS method can detect CNV, such as partial deletion or duplication, and provide details of the abnormal range and size of variations.
... One of the major advantages of whole-genome sequencing is the ability to detect both small and large variants, including structural and CNVs. Such approaches have been used recently on the Ychromosome to achieve break-point resolution for CNVs [51,52], although no new causative genes have been identified to date. The ultra-repetitive nature of the Y-chromosome, which is rich in repeated elements and segmental duplications [53], makes CNV detection challenging using whole-genome sequencing data, in particular in terms of accurate mapping of short reads. ...
Article
Full-text available
Objectives: To identify the role of next-generation sequencing (NGS) in male infertility, as advances in NGS technologies have contributed to the identification of novel genes responsible for a wide variety of human conditions and recently has been applied to male infertility, allowing new genetic factors to be discovered. Materials and methods: PubMed was searched for combinations of the following terms: ‘exome’, ‘genome’, ‘panel’, ‘sequencing’, ‘whole-exome sequencing’, ‘whole-genome sequencing’, ‘next-generation sequencing’, ‘azoospermia’, ‘oligospermia’, ‘asthenospermia’, ‘teratospermia’, ‘spermatogenesis’, and ‘male infertility’, to identify studies in which NGS technologies were used to discover variants causing male infertility. Results: Altogether, 23 studies were found in which the primary mode of variant discovery was an NGS-based technology. These studies were mostly focused on patients with quantitative sperm abnormalities (non-obstructive azoospermia and oligospermia), followed by morphological and motility defects. Combined, these studies uncover variants in 28 genes causing male infertility discovered by NGS methods. Conclusions: Male infertility is a condition that is genetically heterogeneous, and therefore remarkably amenable to study by NGS. Although some headway has been made, given the high incidence of this condition despite its detrimental effect on reproductive fitness, there is significant potential for further discoveries.
Chapter
Klinefelter syndrome (KS) is the most common X chromosome abnormality encountered in men with infertility. The classic form 47,XXY accounts for 80–90% of cases. An error of nondisjunction during gametogenesis provides the extra X chromosome in KS men. The various karyotype variants of KS share the same features of hypergonadotropic hypogonadism but present with more distinct physical, medical and psychological features than the classic form. Spermatogenesis appears to be intact during infancy up to the prepubertal period; however, it progressively declines during adulthood. Learning and behavioural challenges are frequent at an earlier age, while androgen deficiency and infertility are usually encountered at an advanced age. Testosterone replacement therapy is the cornerstone to address hypogonadism to enhance the quality of life and prevent the long-term complications of the androgen-deficient state. Intracytoplasmic sperm injection (ICSI) is a major breakthrough in the treatment of infertility, particularly in KS men. Sperm can be found in the ejaculate in patients with KS and can be used for assisted reproductive technology. However, microdissection testicular sperm extraction (TESE) is the mainstay procedure of choice for sperm retrieval for ICSI, as most KS men present with non-obstructive azoospermia. This procedure provides significantly superior overall outcomes compared to other sperm retrieval techniques. Although men with sex chromosomal abnormalities have a low risk of producing offspring with the same abnormalities after ICSI, genetic counselling should be performed in a multidisciplinary approach to enhance the quality of life and overall health status of KS men.
Chapter
Male infertility affects 7% of all men worldwide, yet for the majority the underlying cause is not found. From the very first identified karyotyping abnormalities to the very recent discovery of point mutations disrupting spermatogenesis, it is clear that a substantial number of patients suffer from genetic abnormalities. However, the discovery of these causes has largely been limited by the resolutions of the technologies used for patient assessment. In recent years, the advent of better tools and more comprehensive databases of genetic variations has led to profound discoveries of genes and pathways underlying male infertility. This chapter reviews these technologies and the discoveries they have led to and sets the scene for the transformation of infertile patient care in the era of next-generation sequencing.
Article
Hypericum perforatum L. is a widely known medicinal herb used mostly as a remedy for depression because it contains high levels of naphthodianthrones, phloroglucinols, alkaloids, and some other secondary metabolites. Quantitative real-time PCR (qRT-PCR) is an optimized method for the efficient and reliable quantification of gene expression studies. In general, reference genes are used in qRT-PCR analysis because of their known or suspected housekeeping roles. However, their expression level cannot be assumed to remain stable under all possible experimental conditions. Thus, the identification of high quality reference genes is essential for the interpretation of qRT-PCR data. In this study, we investigated the expression of 14 candidate genes, including nine housekeeping genes (HKGs) (ACT2, ACT3, ACT7, CYP1, EF1-α, GAPDH, TUB-α, TUB-β, and UBC2) and five potential candidate genes (GSA, PKS1, PP2A, RPL13, and SAND). Three programs-GeNorm, NormFinder, and BestKeeper-were applied to evaluate the gene expression stability across four different plant tissues, four developmental stages and a set of abiotic stress and hormonal treatments. Integrating all of the algorithms and evaluations revealed that ACT2 and TUB-β were the most stable combination in different developmental stages samples and all of the experimental samples. ACT2, TUB-β, and EF1-α were identified as the three most applicable reference genes in different tissues and stress-treated samples. The majority of the conventional HKGs performed better than the potential reference genes. The obtained results will aid in improving the credibility of the standardization and quantification of transcription levels in future expression studies on H. perforatum.
Preprint
Full-text available
Hypericum perforatum is a widely known medicinal herb used mostly as a remedy for depression because of its abundant secondary metabolites. Quantitative real-time PCR (qRT-PCR) is an optimized method for the efficient and reliable quantification of gene expression studies. In general, reference genes are used in qRT-PCR analysis because of their known or suspected housekeeping roles. However, their expression level cannot be assumed to remain stable under all possible experimental conditions. Thus, the identification of high quality reference genes is very necessary for the interpretation of qRT-PCR data. In this study, we investigated the expression of fourteen candidate genes, including nine housekeeping genes and five potential candidate genes. Additionally, the HpHYP1 gene, belonging to the PR-10 family associated with stress control, was used for validation of the candidate reference genes. Three programs were applied to evaluate the gene expression stability across four different plant tissues, three developmental stages and a set of abiotic stress and hormonal treatments. The candidate genes showed a wide range of Ct values in all samples, indicating that they are differentially expressed. Integrating all of the algorithms and evaluations, ACT2 and TUB-β were the most stable combination overall and for different developmental stages samples. Moreover, ACT2 and EF1-α were considered to be the two most applicable reference genes for different tissues and for stress samples. Majority of the conventional housekeeping genes exhibited better than the potential reference genes. The obtained results will contribute to improving credibility of standardization and quantification of transcription levels in future expression research of H. perforatum.
Article
The properties of the human Y chromosome - namely, male specificity, haploidy and escape from crossing over - make it an unusual component of the genome, and have led to its genetic variation becoming a key part of studies of human evolution, population history, genealogy, forensics and male medical genetics. Next-generation sequencing (NGS) technologies have driven recent progress in these areas. In particular, NGS has yielded direct estimates of mutation rates, and an unbiased and calibrated molecular phylogeny that has unprecedented detail. Moreover, the availability of direct-to-consumer NGS services is fuelling a rise of 'citizen scientists', whose interest in resequencing their own Y chromosomes is generating a wealth of new data.
Article
Full-text available
By characterizing the geographic and functional spectrum of human genetic variation, the 1000 Genomes Project aims to build a resource to help to understand the genetic contribution to disease. Here we describe the genomes of 1,092 individuals from 14 populations, constructed using a combination of low-coverage whole-genome and exome sequencing. By developing methods to integrate information across several algorithms and diverse data sources, we provide a validated haplotype map of 38 million single nucleotide polymorphisms, 1.4 million short insertions and deletions, and more than 14,000 larger deletions. We show that individuals from different populations carry different profiles of rare and common variants, and that low-frequency variants show substantial geographic differentiation, which is further increased by the action of purifying selection. We show that evolutionary conservation and coding consequence are key determinants of the strength of purifying selection, that rare-variant load varies substantially across biological pathways, and that each individual contains hundreds of rare non-coding variants at conserved sites, such as motif-disrupting changes in transcription-factor-binding sites. This resource, which captures up to 98% of accessible single nucleotide polymorphisms at a frequency of 1% in related populations, enables analysis of common and low-frequency variants in individuals from diverse, including admixed, populations.
Article
Full-text available
Copy number variation (CNV) of DNA sequences is functionally significant but has yet to be fully ascertained. We have constructed a first-generation CNV map of the human genome through the study of 270 individuals from four populations with ancestry in Europe, Africa or Asia (the HapMap collection). DNA from these individuals was screened for CNV using two complementary technologies: single-nucleotide polymorphism (SNP) genotyping arrays, and clone-based comparative genomic hybridization. A total of 1,447 copy number variable regions (CNVRs), which can encompass overlapping or adjacent gains or losses, covering 360 megabases (12% of the genome) were identified in these populations. These CNVRs contained hundreds of genes, disease loci, functional elements and segmental duplications. Notably, the CNVRs encompassed more nucleotide content per genome than SNPs, underscoring the importance of CNV in genetic diversity and evolution. The data obtained delineate linkage disequilibrium patterns for many CNVs, and reveal marked variation in copy number among populations. We also demonstrate the utility of this resource for genetic disease studies.
Book
It is now abundantly clear that architectural features of the human genome can lead to DNA rearrangements that cause both disease and behavioral traits. In Genomic Disorders: The Genomic Basis of Disease, distinguished experts and pioneers in the field of genomics and genome rearrangements summarize and synthesize the tremendous amount of data now available in the postgenomic era on the structural features, architecture, and evolution of the human genome. The authors demonstrate how such architectural features may be important to evolution and explaining the susceptibility to those DNA rearrangements associated with disease. Technologies to assay for such structural variation of the human genome and model genomic disorders in mice are also presented. Two appendices detail the genomic disorders, providing genomic features at the locus undergoing rearrangement, their clinical features, and frequency of detection. Comprehensive and clinically relevant, Genomic Disorders: The Genomic Basis of Disease offers genome and clinical genetics researchers not only an up-to-date survey of genome architecture, but also details those rearrangements that can be the underlying cause or basis of many human traits and disorders.
Article
Structural variation of the genome is an important aspect in our understanding of human disease but has been difficult to systematically identify and genotype. Human genomes are particularly prone to structural variation when compared to other mammalian genomes due to idiosyncrasies in our genomic architectures. The recent application of massively parallel sequencing methods has led to an exponential increase in the discovery of smaller structural variation events. Significant global discovery biases remain, but the integration of experimental and computational approaches is proving fruitful for accurate characterisation of the copy, content, and structure of variable regions. I will present methods based on read-depth, split-read, and paired-end sequence approaches from both whole genome and exome sequence datasets and compare these to the utility of standard microarray approaches applied in the clinic. Next-generation sequencing platforms can assay copy and content of genes previously inaccessible by hybridisation-based methods but do not yet completely resolve the full complexity of structural variation.
Book
This book contains 23 chapters. Some of the chapter titles are: Genes of the H-Y Antigen System and Their Expression in Mammals, The Human Y Chromatin: An Overview, DNA Sequences and Analysis of the Human Y Chromosome, Staining and Banding of Characteristics of the Human Y Chromosoe, Evolution of Sex Chromosomes in Mammals, The ''Y'' Chromosome in the Female Phenotype, and Behavior of Y Chromatin in the Neonatal Period.