Development and application of a 6.5 million feature affymetrix genechip® for massively parallel discovery of single position polymorphisms in lettuce (Lactuca spp.).
ABSTRACT BACKGROUND: High-resolution genetic maps are needed in many crops to help characterize the genetic diversity that determines agriculturally important traits. Hybridization to microarrays to detect single feature polymorphisms is a powerful technique for marker discovery and genotyping because of its highly parallel nature. However, microarrays designed for gene expression analysis rarely provide sufficient gene coverage for optimal detection of nucleotide polymorphisms, which limits utility in species with low rates of polymorphism such as lettuce (Lactuca sativa). RESULTS: We developed a 6.5 million feature Affymetrix GeneChip® for efficient polymorphism discovery and genotyping, as well as for analysis of gene expression in lettuce. Probes on the microarray were designed from 26,809 unigenes from cultivated lettuce and an additional 8,819 unigenes from four related species (L. serriola, L. saligna, L. virosa and L. perennis). Where possible, probes were tiled with a 2 bp stagger, alternating on each DNA strand; providing an average of 187 probes covering approximately 600 bp for each of over 35,000 unigenes; resulting in up to 13 fold redundancy in coverage per nucleotide. We developed protocols for hybridization of genomic DNA to the GeneChip® and refined custom algorithms that utilized coverage from multiple, high quality probes to detect single position polymorphisms in 2 bp sliding windows across each unigene. This allowed us to detect greater than 18,000 polymorphisms between the parental lines of our core mapping population, as well as numerous polymorphisms between cultivated lettuce and wild species in the lettuce genepool. Using marker data from our diversity panel comprised of 52 accessions from the five species listed above, we were able to separate accessions by species using both phylogenetic and principal component analyses. Additionally, we estimated the diversity between different types of cultivated lettuce and distinguished morphological types. CONCLUSION: By hybridizing genomic DNA to a custom oligonucleotide array designed for maximum gene coverage, we were able to identify polymorphisms using two approaches for pair-wise.
- Citations (1)
-
Cited In (0)
-
Article: Large-scale identification of single-feature polymorphisms in complex genomes.
Justin O Borevitz, David Liang, David Plouffe, Hur-Song Chang, Tong Zhu, Detlef Weigel, Charles C Berry, Elizabeth Winzeler, Joanne Chory[show abstract] [hide abstract]
ABSTRACT: We have developed a high-throughput genotyping platform by hybridizing genomic DNA from Arabidopsis thaliana accessions to an RNA expression GeneChip (AtGenome1). Using newly developed analytical tools, a large number of single-feature polymorphisms (SFPs) were identified. A comparison of two accessions, the reference strain Columbia (Col) and the strain Landsberg erecta (Ler), identified nearly 4000 SFPs, which could be reliably scored at a 5% error rate. Ler sequence was used to confirm 117 of 121 SFPs and to determine the sensitivity of array hybridization. Features containing sequence repeats, as well as those from high copy genes, showed greater polymorphism rates. A linear clustering algorithm was developed to identify clusters of SFPs representing potential deletions in 111 genes at a 5% false discovery rate (FDR). Among the potential deletions were transposons, disease resistance genes, and genes involved in secondary metabolism. The applicability of this technique was demonstrated by genotyping a recombinant inbred line. Recombination break points could be clearly defined, and in one case delimited to an interval of 29 kb. We further demonstrate that array hybridization can be combined with bulk segregant analysis to quickly map mutations. The extension of these tools to organisms with complex genomes, such as Arabidopsis, will greatly increase our ability to map and clone quantitative trait loci (QTL).Genome Research 04/2003; 13(3):513-23. · 13.61 Impact Factor
Page 1
This Provisional PDF corresponds to the article as it appeared upon acceptance. Fully formatted
PDF and full text (HTML) versions will be made available soon.
Development and application of a 6.5 million feature affymetrix genechip® for
massively parallel discovery of single position polymorphisms in lettuce
(Lactuca spp.)
BMC Genomics 2012, 13:185doi:10.1186/1471-2164-13-185
Kevin Stoffel (kmstoffel@ucdavis.edu)
Hans van Leeuwen (hans.vanleeuwen@bayer.com)
Alexander Kozik (akozik@atgc.org)
David Caldwell (David.g.caldwell@monsanto.com)
Hamid Ashrafi (ashrafi@ucdavis.edu)
Xinping Cui (xinping.cui@ucr.edu)
Xiaoping Tan (xpitan@gmail.com)
Theresa Hill (tahill@ucdavis.edu)
Sebastian Reyes-Chin-Wo (sreyeschinwo@ucdavis.edu)
Maria-Jose Truco (mjtruco@ucdavis.edu)
Richard W Michelmore (rwmichelmore@ucdavis.edu)
Allen Van Deynze (avandeynze@ucdavis.edu)
ISSN
1471-2164
Article type
Methodology article
Submission date
14 November 2011
Acceptance date
14 May 2012
Publication date
14 May 2012
Article URL
http://www.biomedcentral.com/1471-2164/13/185
Like all articles in BMC journals, this peer-reviewed article was published immediately upon
acceptance. It can be downloaded, printed and distributed freely for any purposes (see copyright
notice below).
Articles in BMC journals are listed in PubMed and archived at PubMed Central.
For information about publishing your research in BMC journals or any BioMed Central journal, go to
http://www.biomedcentral.com/info/authors/
BMC Genomics
© 2012 Stoffel et al. ; licensee BioMed Central Ltd.
This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0),
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Page 2
Development and application of a 6.5 million feature
affymetrix genechip® for massively parallel discovery of
single position polymorphisms in lettuce (Lactuca spp.)
Kevin Stoffel1†
Email: kmstoffel@ucdavis.edu
Hans van Leeuwen1,6†
Email: hans.vanleeuwen@bayer.com
Alexander Kozik2†
Email: akozik@atgc.org
David Caldwell1,7†
Email: David.g.caldwell@monsanto.com
Hamid Ashrafi1†
Email: ashrafi@ucdavis.edu
Xinping Cui4,5
Email: xinping.cui@ucr.edu
Xiaoping Tan1
Email: xpitan@gmail.com
Theresa Hill1
Email: tahill@ucdavis.edu
Sebastian Reyes-Chin-Wo1
Email: sreyeschinwo@ucdavis.edu
Maria-Jose Truco2
Email: mjtruco@ucdavis.edu
Richard W Michelmore2,3*
*Corresponding author
Email: rwmichelmore@ucdavis.edu
Allen Van Deynze1,3*
*Corresponding author
Email: avandeynze@ucdavis.edu
1 Seed Biotechnology Center, University of California, Davis, CA 95616, USA
2 Genome Center, University of California, Davis, One Shields Ave., Davis, CA
95616, USA
3 Department of Plant Sciences, University of California, Davis, CA 95616, USA
Page 3
4 Department of Statistics, University of California, Riverside, CA 92521, USA
5 Center for Plant Cell Biology and Institute for Integrative Genome Biology,
University of California, Riverside, CA 92521, USA
6 Nunhems Netherlands B.V., P.O. Box 4005, 6080, AA Haelen, The Netherlands
7 Monsanto, Molecular Breeding Technology, 700 Chesterfield Pkwy W, BB34,
Chesterfield, MO 63017, England
†Major contributions were made by each of these authors at different stages of the
project.
Abstract
Background
High-resolution genetic maps are needed in many crops to help characterize the genetic
diversity that determines agriculturally important traits. Hybridization to microarrays to
detect single feature polymorphisms is a powerful technique for marker discovery and
genotyping because of its highly parallel nature. However, microarrays designed for gene
expression analysis rarely provide sufficient gene coverage for optimal detection of
nucleotide polymorphisms, which limits utility in species with low rates of polymorphism
such as lettuce (Lactuca sativa).
Results
We developed a 6.5 million feature Affymetrix GeneChip® for efficient polymorphism
discovery and genotyping, as well as for analysis of gene expression in lettuce. Probes on the
microarray were designed from 26,809 unigenes from cultivated lettuce and an additional
8,819 unigenes from four related species (L. serriola, L. saligna, L. virosa and L. perennis).
Where possible, probes were tiled with a 2 bp stagger, alternating on each DNA strand;
providing an average of 187 probes covering approximately 600 bp for each of over 35,000
unigenes; resulting in up to 13 fold redundancy in coverage per nucleotide. We developed
protocols for hybridization of genomic DNA to the GeneChip® and refined custom
algorithms that utilized coverage from multiple, high quality probes to detect single position
polymorphisms in 2 bp sliding windows across each unigene. This allowed us to detect
greater than 18,000 polymorphisms between the parental lines of our core mapping
population, as well as numerous polymorphisms between cultivated lettuce and wild species
in the lettuce genepool. Using marker data from our diversity panel comprised of 52
accessions from the five species listed above, we were able to separate accessions by species
using both phylogenetic and principal component analyses. Additionally, we estimated the
diversity between different types of cultivated lettuce and distinguished morphological types.
Conclusion
By hybridizing genomic DNA to a custom oligonucleotide array designed for maximum gene
coverage, we were able to identify polymorphisms using two approaches for pair-wise
Page 4
comparisons, as well as a highly parallel method that compared all 52 genotypes
simultaneously.
Background
Various types of microarrays have been used extensively for gene expression studies and,
more recently, for genotyping and marker discovery [1-5]. Affymetrix microarrays in
particular offer a massively-parallel approach to genotyping. The basis of identifying
polymorphisms, termed single feature polymorphisms (SFPs), is differential hybridization of
template RNA or DNA onto 25 bp oligonucleotide probes on the array due to the presence of
single nucleotide polymorphisms (SNPs) or small insertion/deletions (InDels). Using this
approach, thousands of genes can be queried and simultaneously analyzed allowing whole
genome approaches to mapping genes and quantitative trait loci (QTL) discovery [6], as well
as determining linkage disequilibrium (LD) [7] and population structure [8,9]. When arrays
represent coding sequences, they can also be used for genotyping closely related species [2].
Although Affymetrix expression arrays can be used for genotyping, their drawback is that all
except the most recently produced microarrays have been designed with approximately 11
perfect match probes per unigene giving only limited coverage of each gene. Gresham et al.
[10] showed that an array designed with 25 bp oligos in a 5 bp overlapping tile format had
greater sensitivity (ability to detect true SFPs) in yeast and increased specificity (reduced rate
of false positives) in calling SFPs. This overlapping tile design offers technical
reproducibility and extensive genome coverage if the number of features on the microarray is
sufficient.
Genotyping by microarray hybridization has proven to be challenging in species with
complex genomes. Microarrays have been successfully used to detect SFPs in small genomes,
for example the 13.5 Mb genome of yeast [10], the 145 Mb genome of Arabidopsis [1,8] and,
more recently, the 389 Mb genome of rice [3]. Although different algorithms were used for
each of these three species, 87% of the SFPs in yeast [10] and 75% of those in rice and
Arabidopsis [1,3] were independently validated. To identify polymorphisms in the barley
genome, complexity was circumvented by using RNA to hybridize to the microarray and
67% [4] to 80% [11] validation of SFPs was achieved. When DNA from barley (5,200 Mb
genome) was hybridized directly to the Barley1 GeneChip® a significant overlap between
SFPs identified using genomic DNA and those identified and validated using RNA was
reported [4]. The increased efficiency reported by Rostoks et al. from using RNA is,
however, complicated by incomplete and variable transcriptome representation due to tissue-
and environment-specific gene expression and false SFP discovery due to alternative splicing
or adenylation [4] associated with sampling RNA versus DNA.
Several types of analyses have been implemented for SFP detection from microarray data.
Generally, the data have been processed using expression analysis software to correct for
background signal variation using Robust Multi-array Analysis (RMA) [12] followed by
correction for overall signal variation by quantile normalization across arrays [13]. To call
SFPs, a modified T-test [1], Robustified Projection Pursuit (RPP) [11] and SFP deviation [5]
have been developed to first estimate the normalized hybridization of a reference set of
probes and then test with appropriate statistics or ratios for differential hybridization of
specific probes across genotypes. In addition, a maximum likelihood procedure using the
source of sequence on the chip as a reference was developed by Gresham et al. [10] to take
Page 5
advantage of overlapping tile data to call SFPs. As each microarray and experiment design
tends to be different, new methods for analysis have been developed in attempts to gain
greater specificity and sensitivity.
Cultivated lettuce, L. sativa, has substantial genetic and genomic resources including
approximately 76,000 ESTs and another 20,000 to 50,000 ESTs in each of four related
species (http://compgenomics.ucdavis.edu/). Furthermore, several mapping populations have
been developed including a reference mapping population of 214F7 recombinant inbred lines
(RILs) between L. sativa cv. Salinas and L. serriola acc. US96UC23 ([14]; RW Michelmore
et al., unpublished). This population segregates for multiple agronomic, disease resistance
and quality traits. It has approximately 1,500 mapped DNA markers comprised of
approximately 700 mapped unigenes with the remainder amplified fragment length
polymorphisms (AFLPs), restriction fragment length polymorphisms (RFLPs) and simple
sequence repeats (SSRs) [14]. The large number of available sequences and recombinant
inbred lines provide ideal resources for further marker discovery and high density mapping.
Considerable genetic resources are also available within germplasm collections of L. sativa.
Furthermore, several Lactuca species have variable cross-compatibility with L. sativa [15]
and represent a diverse genetic resource for investigations of novel alleles and population
structure.
In this paper, we describe the development of a microarray designed to provide extensive
gene coverage and maximize detection of SFPs for marker discovery and genotyping in
lettuce. We analyzed the parents of the reference L. sativa x L. serriola mapping population
to demonstrate that DNA representing complex genomes (2,639 Mb) [16] can be effectively
hybridized onto microarrays. Parameters affecting DNA hybridization and accurate detection
of polymorphism were optimized. Algorithms from West et al. [5] and Borevitz et al. [1]
were modified to take advantage of the overlapping tile design to detect polymorphisms. The
use of the multiple probes covering a single position led to the identification of single
position polymorphisms (SPPs). Additionally, we assessed SPPs in a diverse panel of
Lactuca species concentrating on the cultivated L. sativa.
Results
Identification of a non-redundant consolidated unigene set from Lactuca spp.
for design of an oligonucleotide array
A consolidated Lactuca unigene set (CLUS) was created using stringent CAP3 conditions
[17]. This set represented all the currently available genes in March 2006 that had been
identified in cDNA libraries prepared from L. sativa cv. Salinas, plus additional genes that
were not present in those libraries from four other related species of Lactuca (see Methods).
The selection of unigenes was performed reiteratively in order of increasing genetic
divergence from L. sativa; first, unigenes from L. serriola, US96UC23, were analyzed by
TBLASTX followed by unigenes from L. saligna, L. virosa and lastly L. perennis. The final
set comprised of 26,809 unigenes from L. sativa plus 4,065, 1,391, 1,686 and 1,686 from L.
serriola, L. saligna, L. virosa, and L. perennis respectively, totaling 8,828 unigenes from the
four other Lactuca species (Table 1). This resulted in a final CLUS of 35,637 unigenes (Table
1). An additional 151 unique L. sativa genomic sequences possessing a TBLASTX hit (<e-
10) to the Arabidopsis genome and characterized mRNA sequences were then mined from
Page 6
Genbank and added, resulting in a final list of 35,788 Lactuca sequences that were submitted
to Affymetrix for probe design.
Table 1 The number of ESTs and Unigenes for each species before and after filtering
L. sativa L. serriola L. saligna L. perennis L. virosa Total
ESTs 76043 52034
Unigenes 29417 22327
Unigenes after filtering 26809 4065
28851
11990
1391
28066
12661
1686
28335
12733
1686
35637
Design of microarray with overlapping probes
In collaboration with Affymetrix, probes from both sense and anti-sense strands were
selected to create a 2 bp overlapping tiling path (See Additional file 1: Figure S1). The
resulting 11.4 million candidate probes designed from the 35,788 submitted sequences were
triaged down to ~6.5 million that could be accommodated on the chip through a series of
steps based on: 1) Affymetrix probe quality score (> 0.25) except for a select set of unigenes
with putative polymorphisms where probes with a quality score above 0.1 were retained. 2)
probes matching mitochondrial or chloroplast genomes were discarded. 3) Probes that
matched to more than one target were synthesized only once on the chip and computationally
associated to corresponding unigene for analysis.
In addition to lettuce probes, background and standard Affymetrix control probes [18] were
included on the microarray. In order to determine background hybridization signal, 13,567
anti-genomic (AG) background probes were synthesized on the microarray, with an average
of 1,355 probes representing each G/C bin (probes with the same number of guanines and/or
cytosines in the 25 bp probe). These AG probes represent sequences that had no BLAST hits
in GenBank at the time of chip design. The use of AG probes reduced the number of probes
included on the chip for background correction by 99% compared to the use of mismatch
probes (allocated to half of the array positions in previous designs), without substantially
compromising the ability to perform accurate background correction [19]. From 8,000
visually interrogated EST contigs, ~2,000 putatively polymorphic regions (50 to 100 bp)
were represented from 1,184 contigs.
In total, 6,410,923 lettuce probes representing 35,788 unigenes were synthesized on the
microarray. The average and median number of probes representing a unigene were both 187,
with ~80% of the unigenes being represented by 50 to 275 probes per sequence (See
Additional file 2: Figure S2). Each unigene had an average of 591 bps and median of 603 bps
covered by probes; the average number of contiguous stretches of overlapping probes per
unigene is 3.1. Due to the selection parameters described above, a contiguous overlapping tile
across the unigenes was not possible. Consequently, the average and median lengths of the
contiguous stretches of overlapping probes within probe sets are 190 and 120 bps,
respectively. Regions of high or low G/C content were sparsely covered by probes (See
Additional file 3: Figure S3) likely due to the removal of probes with low Affymetrix probe
quality scores. The total number of probes on the array was 6,482,479.
Page 7
Preparation and hybridization of genomic DNA to the lettuce GeneChip
Large amounts of high quality genomic DNA were a critical prerequisite for robust
hybridization signals (see below). To meet these criteria we compared fragmented genomic
DNA to amplified DNA using whole genome amplification (WGA) from Sigma (see
Methods). Analysis of scatter plots comparing hybridization intensities resulting from
amplified and unamplified genomic DNA revealed that WGA resulted in marked biases (See
Additional file 4: Figure S4). High quality DNA was extracted for each of L. sativa cv.
Salinas and L. serriola acc. US96UC23. Each of these extractions were hybridized twice
providing two technical replicates and hybridization intensities were evaluated using scatter
plots of 600,000 random probes (Figure 1) or probes belonging to a set of ultra-conserved
sequences (http://compgenomics.ucdavis.edu/compositae_reference.php). Comparison
between replicates showed a nearly 97% correlation while between species showed
approximately 93% correlation indicating infrequent hybridization differences as expected
with low rates of polymorphism between the two species.
Figure 1 Pair-wise scatter plots of 600,000 probes from SAL and SER. Pair-wise scatter
plots of RMA background corrected hybridization values for 600,000 random probes for two
technical replicates of L. sativa cv. Salinas (SAL_TG_01, SAL_TG_02) and L. serriola acc.
US96UC23 (SER_TG_01, SER_TG_02). Comparisons across species show larger deviations
than those between replicates
We investigated several methods for fragmentation and labeling of genomic DNA including
the Bioprime kit (Invitrogen, Carlsbad, CA, USA). Although incorporation of biotinylated
dCTP during amplification of target sequence via random priming resulted in elevated
hybridization intensities compared to end-labeling of fragmented DNA, hybridization
intensity of both lettuce and background probes increased with GC content dramatically (See
Additional file 5: Figure S5), resulting in a decreased number of informative probes (probes
above background) with high GC content (Figure 2). We concluded that direct fragmentation
of genomic DNA by digestion with DNase I and end-labeling was the most cost and time
effective, least biased (due to the lack of amplification and selection steps), and most
informative method for preparing genomic DNA for hybridization to microarrays.
Figure 2 Frequency of probes with hybridization signals greater than background per
GC bin. Frequency (y-axis) of lettuce probes on the GeneChip®, per GC content bin (x-
axis), with a hybridization signal that is higher than the 90 percentile of the hybridization
signal of the anti-genomic probes with corresponding GC content. Results are shown for L.
sativa cv. Salinas hybridizations with 7.5, 30 or 39 µg of gDNA, L. sativa cv. Salinas
hybridizations with Genomic DNA fragmented and labeled using BioPrime (Invitrogen,
Carlsbad, CA, USA), and L. serriola acc. UC96US23 hybridizations with 30 µg of gDNA
To determine the quantity of genomic DNA required to achieve adequate hybridization, three
different quantities of genomic DNA, 7.5 (the amount typically used in cDNA
hybridizations), 30, and 39 µg, from L. sativa cv. Salinas were sheared with DNase I and end-
labeled. The number of lettuce-specific probes with fluorescent intensities above the 90th
percentile intensity for the AG control probes in the same GC bin was determined for each
sample (e.g. Figure 2). The 30 µg sample of fragmented DNA yielded the highest percent of
probes above the 90th percentile (62%) in these GC bins when compared to the 7.5 (45%) and
39 µg (60%) DNA samples (Figure 2). DNase I fragmentation conditions were consequently
optimized for this amount to consistently provide fragments predominantly between 50 and
Page 8
250 bp in length (Figure 3). Thirty micrograms was selected as the standard concentration of
genomic DNA for subsequent experiments.
Figure 3 Agarose gel electrophoresis of genomic DNA extracted from lettuce. Two µl
from a 30 µg fragmentation of lettuce genomic DNA with DNase I was separated on 2%
agarose gel and visualized by ethidium bromide staining. Lengths in bp the O’GeneRuler™
50 bp DNA ladder (Fermentas, Glen Burnie, MD, USA) are shown. Samples were accepted
for labeling provided that the majority of fragments were within 50 to 250 bp
Improvement of the algorithm for detecting polymorphisms
The algorithms developed previously by West et al. [5], were modified to take advantage of
the tiling design of the lettuce GeneChip®. The new algorithm calculated the SFPdev value
for each of the probes that overlap a given position and then performed a sliding window
analysis to calculate an average weighted SFPdev value for each 2 bp position across the
unigene covered by at least one probe. This strategy markedly reduced background noise
relative to individual probe measurements (Figure 4). Additionally, removal of probes below
the 90th percentile of AG probes in the same GC bin increased confidence in calls while
identifying polymorphisms missed by inclusion of poorly performing probes (See Additional
file 6: Figure S6). An empirically determined weighting factor based on sensitivity of bases
within an oligo to sequence polymorphisms (Figure 5) was included in our custom algorithm
to give more significance to the 16 most centrally located bases in a probe. This weighting
factor allowed us to retain probes nearest the edges of tiling blocks where polymorphism
could be found, but reduced the emphasis given to SPPs detected by those probes rather than
disregarding them completely, allowing users to potentially filter out or retain them. As our
algorithm uses data from multiple informative features that interrogated each 2 base pair
position rather than a single feature, we designated the polymorphisms detected as single
position polymorphisms (SPPs).
Figure 4 Graphical representation of SFP vs. SPP calls along a contig. The x-axis shows
the position of a probe along a contig. The y-axis shows the difference in the average
weighted SFPdev or SPPdev values between the two genotypes. The SPP analysis detected
only the true SNPs while the SFP analysis indicated multiple false positives
Figure 5 A graphical representation of the equation used to determine a weighting
factor at each position. The graph shows the mean hybridization difference between L.
sativa cv. Salinas and L. serriola acc. US96UC23 (y-axis) plotted against the 2 bp
interrogation position relative to the central position of the probe being measured (x-axis).
The equation below was used for calculating the weighting factor used in the two-genotype
comparison algorithms
We also modified the algorithm described by Borevitz et al. [1] for detection of SFPs to
exploit the tiling array design. This modified SFP algorithm (MSA) interrogated all weighted
probes above background at every 2 bp position, and calculated a dstat similar to that
described by Borevitz et al. [1]. This, however, was done for each position rather than each
probe. A false discovery rate was calculated for each threshold cutoff value for both pairwise
analyses using permutation analysis as described by Borevitz et al. [1].
Detection of SPPs between L. sativa cv. Salinas and L. serriola acc. US96UC23
Page 9
Genomic DNA from L. sativa cv. Salinas (SAL) and wild L. serriola acc. US96UC23 (SER)
were hybridized to the GeneChip® in three technical replicates. The Affymetrix .CEL file
data were background-corrected by RMA and quantile-normalized across all chips. The data
were analyzed for SPPs using both our modified SFPdev and MSA algorithms. SPPs were
filtered to require a minimum 4 bp range and two informative probes covering the
interrogated position to increase confidence in the SPPs called. SPPs were defined by the
range of positions (bp) that met a selected FDR value of 0.1 and a Delta value of 0.2 for the
SFPdev and MSA methods, respectively. Furthermore, only the SFPdev values with a ratio
(SAL/ SER) less than 1 were considered, as values above one had an actual FDR of 79%.
With these requirements the SFPdev method identified 40,462 SPPs in 19,345 contigs; while
40,960 SPPs in 18,290 contigs were identified with the MSA method. The coincidence of
reported SPPs between the two methods showed that 73.6% of SPPs detected by the SFPdev
method were found by MSA, and 81.1% of SPPs reported by MSA were found by SFPdev.
To provide an independent assessment of polymorphisms, Illumina mRNA-seq reads (IGA
set) from L. sativa cv. Salinas and L. serriola acc. US96UC23 were aligned to the unigenes
used for chip design to identify a set of SNPs. Identified SPPs were compared to this set as
well as SPPs identified and mapped in the reference RIL population [20] to validate the true
and false positive rate of our detection methods. Further, if the SPP regions were identified as
being duplicated in the genome they were removed as they were falsely called one third of
the time (Methods). Using the modified SFPdev method with an estimated FDR of 0.1 23,835
SPPs with an actual FDR of 24.35% were identified. Our MSA method at a delta value of 0.2
with an estimated FDR of 36.45% filtered for the same conditions as above called 23,075
SPPs with an actual FDR of 14.46%. Table 2 shows the relative numbers of SPPs identified
and their corresponding FDRs for each method.
Table 2 Observed FDR for each cut-off value for the three SPP prediction methods
a. MSA Observed
Delta 0.2 14.50%
Delta 0.4 13.80%
Delta 0.6 11.00%
Delta 0.8 10.20%
Delta 1.0 9.80%
Delta 1.2 9.10%
Delta 1.4 8.50%
Delta 1.6 7.90%
b. SFPdev Observed
FDR 0.10 24.30%
FDR 0.05 21.40%
FDR 0.01 20.10%
c. DP Analysis Observed
SFPDev 1.2 2.80%
SFPDev 1.5 1.70%
SFPDev 2.0 1.10%
Table 2 shows the observed and permuted FDRs for each of the three SPP detection methods.
a) Delta values from 0.2 to 1.6 for MSA analysis show decreasing FDRs as stringency
Permuted
36.45%
17.54%
10.85%
8.34%
7.27%
6.80%
6.58%
6.45%
Permuted
10.00%
5.00%
1.00%
Permuted
N/A
N/A
N/A
Page 10
increases. At values lower than 0.6 the observed values surpass those expected by
permutation analysis. b) Three cutoff values for SFPdev were reported from 10% to 1%.
Observed values did not change drastically as permuted stringencies increased. c) Observed
FDR values in DP analysis were low and did not drastically change as cutoff values increased
The modified SFPdev and MSA methods provide pair-wise comparisons but are too
computationally intensive to assess polymorphisms between many pairs of lines. Therefore, a
third analysis was performed using the RIL algorithm method described by Truco et al. [20].
All 2 bp windows were assessed for bimodal distributions in hybridization values from a
diverse panel of genotypes composed of 52 accessions from five Lactuca species including
the parents, Salinas and US96UC23, of the reference RIL mapping population. Selection of
SPPs that were identified as polymorphic between these two parental lines using this method
identified 18,237 SPPs. This set was filtered with the same criteria as described above and
included no inconsistent data points yielding an actual FDR of only 2.83%. A Venn diagram
of contigs with SPPs was created to show the coincidence of potentially polymorphic contigs
detected using each method (Figure 6). We included polymorphic contigs rather than SNPs
detected in this figure as the SPPs reported are ranges. These ranges often covered more than
one SNP rendering it impossible to determine which polymorphism was the contributor to the
detected SPP. Overall, the majority of contigs containing SPPs identified by each of the three
methods coincide (4,707). The two methods that have the largest overlap (9,897) were the
two pair-wise comparison methods (MSA and SFPdev).
Figure 6 Venn diagram showing the overlap of contigs containing SPPs among the
three SPP identification methods. A Venn diagram of the three SPP identification methods
shows the overlap of contigs containing SPPs between each method. The two pair-wise
comparison methods had the largest overlap of identified contigs. MSA – pairwise
comparison using a modified version of the algorithm described by Borevits et al. [1].
SFPdev – pairwise comparison using a modified version of the algorithm described by West
et al. [5]. RIL - massively parallel approach looking at the distribution of SPP calls for all
individuals in a diversity panel
Analysis of a Diversity Panel (DP)
The massively parallel genotyping of 52 lines (41 L. sativa, 5 L. serriola, 3 L. saligna, 2 L.
virosa, and 1 L. perennis, Table 3) using the modified RIL algorithm was employed to
investigate inter- and intra-specific polymorphisms. The SPPs were filtered to allow zero
missing and zero inconsistent data, coverage by a minimum of two probes at the interrogated
position, a minimum SPP width of 4 bp and a SFPdev ratio of 1.2. This resulted in 74,034
SPPs from 23,144 contigs. The SPPs detected in the diversity panel, were further filtered to
exclude any SPPs with a duplicate pattern of markers across genotypes within the same
unigene yielding 46,237 distinct haplotypes. We were able to identify 43,464 SPPs that
showed polymorphism between species but not within the five species in the diversity panel.
Most were private to L. perennis, followed by L. virosa and L. saligna.
Table 3 Individuals in the diversity panel and their relative species and horticultural class
within L. sativa
Panel ID S9upecies Horticultural class Panel ID
BSP001
L.sativa Butterhead
BSP002
L.sativa Butterhead
Species
L.sativa
L.sativa
Horticultural class
Iceberg
Iceberg
SAL
UCD004
Page 11
BSP003
BSP004
BSP005
DIANA
OLOF
UCD005
UCD007
BSP006
BSP007
BSP008
BSP009
BSP010
UCD002
BSP011
BSP012
BSP013
BSP014
BSP015
BSP016
BSP017
BSP018
BSP019
BSP020
GREENLAKE L.sativa
Note : Batavia and Iceberg are sub-classes of the Crisphead class.
L.sativa
L.sativa
L.sativa
L.sativa
L.sativa
L.sativa
L.sativa
L.sativa
L.sativa
L.sativa
L.sativa
L.sativa
L.sativa
L.sativa
L.sativa
L.sativa
L.sativa
L.sativa
L.sativa
L.sativa
L.sativa
L.sativa
L.sativa
Butterhead
Butterhead
Butterhead
Butterhead
Butterhead
Butterhead
Butterhead
Cos
Cos
Cos
Cos
Cos
Cos
Batavia
Batavia
Batavia
Batavia
Batavia
Batavia
Iceberg
Iceberg
Iceberg
Iceberg
Iceberg
UCD006
BSP021
BSP022
BSP023
BSP024
BSP025
UCD003
BSP026
BSP027
BSP028
BSP029
BSP030
UCD014
SER
UCD010
UCD011
UCD012
UCD013
LSALIGNA L.saligna
UCD01
UCD017
UCD018
UCD019
UCD020
L.sativa
L.sativa
L.sativa
L.sativa
L.sativa
L.sativa
L.sativa
L.sativa
L.sativa
L.sativa
L.sativa
L.sativa
L.sativa
L.serriola
L.serriola
L.serriola
L.serriola
L.serriola
Iceberg
Leafy
Leafy
Leafy
Leafy
Leafy
Leafy
Leafy
Leafy
Leafy
Leafy
Leafy
Oil
L.saligna
L.saligna
L.virosa
L.virosa
L.perennis
Polymorphism within L. sativa is most pertinent to breeding efforts. To survey polymorphism
within this species, SPPs were filtered from the 74,034 previously described to include only
those polymorphic within L. sativa resulting in 8,211 SPPs on 4,412 contigs. The leafy and
Batavia crisphead plant types had the most diversity while iceberg crisphead contained the
least (Table 4). However, 2,343 SPPs were identified even within the iceberg type allowing
distinction of cultivars within this genetically narrow plant type. Several SPPs showed
diversity within one plant type, while being monomorphic in other L. sativa types.
Table 4 Number of SPPs and Contigs containing SPPs identified for each class/species by
DP analysis
a) Species
L. sativa
L. serriola
L. saligna
L. virosa
L. perennis
Total Markers
Contigs
394
293
3,651
3,315
13,305
20,958
SPPs
507
431
6,729
7,023
28,774
43,464
Page 12
b)
a) The number of contigs containing SPPs and the number of SPPs private to the genotypes
within a species. These SPPs distinguish all the genotypes in a species from all others in the
panel. b) The number of contigs containing SPPs and the number of SPPs for each L. sativa
horticultural class. These markers are polymorphic within a class but not specific to that class
L. Sativa types
Butterhead
Cos
Batavia
Iceberg
Leafy
Oil only
4,412
1,956
1,931
2,135
1,279
3,467
364
8,211
3,531
3,420
3,840
2,343
6,178
557
A phylogenetic analysis was then performed with the PHYLIP 3.69 package [21]. Using the
filtered 46,237 marker set for the 52 genotypes, a bootstrap consensus tree was constructed
(Figure 7a). A representative phylogram (Figure 7b) estimates the genetic differences
between the genotypes in the panel. The L. sativa genotypes separated into two distinct clades
with 100% bootstrap value separating butterheads from the cos and majority of crisphead
types. The leafy lettuce genotypes showed high variability locating within both the butterhead
and cos/crisphead clades. Iceberg varieties showed the least amount of polymorphism and
grouped together in a monophyletic clade with 100% bootstrap support. One genotype,
BSP024, showed a L. sativa-like morphology but has a seed shattering phenotype
characteristic of the wild species, positioned between the wild species and the remaining L.
sativa with 100% support. Upon further analysis, branch lengths of L. sativa genotypes
indicate similar divergence from their common ancestor with the exception of BSP024 and
UCD14 (oil type) (See Additional file7: : Figure S7).
Figure 7 Phylogenetic trees estimating the relationship of all individuals in the diversity
panel. a) Dendogram estimating the relationship of genotypes in the diversity panel.
Bootstrap values indicate the confidence in branch positioning. b) Representative phylogram
showing the relative relatedness of individuals. Each species is monophyletic
Two separate principal component analyses (PCA) were performed on the entire diversity
panel as well as on just L. sativa. The first three eigenvalues account for 71.4% of the
variation seen across all five species. While each of the three principal components (PC) were
significant at <0.0001 when separated by species, two dimensional scatter plots of PC values
for each genotype using the first two PC show clear separation of species (Figure 8a). When
considering separation between types within L. sativa, the first three PCs accounted for
27.6% of the variation within the population with the first and third PCs being significant at
<0.0001. A two dimensional plot (Figure 8b) of the first and third PC values showed some
overlap of types consistent with Figure 7a. Two genotypes, UCD14 and BSP024, are again
outliers and show drastic deviation from the rest of the L. sativas (See Additional file 7:
Figure S7 and Figure 8b).
Figure 8 Two dimensional scatter graphs of eigenvalues from principal component
analysis. The first two significant eigenvalues from principal component analyses performed
with SAS software PRINCOMP procedure are plotted against each other to show resolution
within species or classes. a) Eigenvalues significant at P < 0.0001 from principal components
one and two are plotted against each other and show clear resolution of species. b)
Page 13
Eigenvalues one and three were both significant at P < 0.0001 and were plotted against each
other. The y-axis was altered to show clearer resolution in non-oil genotypes. Batavia and
Leafy classes show distribution through the scatter plot similar to that seen in Figure 7a and
7b
Discussion
Highly parallel genotyping has become an important component of genomics. Hybridization
of genomic DNA and RNA to microarrays has been used in the past for detection of
polymorphisms between genotypes [1,4,5,10,22]. However, the previously available arrays
for complex genomes only provided limited transcriptome coverage. We developed an array
designed to maximize transcriptome coverage while maintaining the possibility of performing
other analyses. Our custom designed Lettuce GeneChip® combined the benefits of
overlapping probes across unigenes, similar to that demonstrated by Gresham et al. [10] for
yeast, with the use of anti-genomic probes to maximize the possible coverage of unigenes
while maintaining the sensitivity to detect polymorphisms and retaining appropriate controls
to normalize and correct for background noise. The tiling path design allows for multiple
assessments of hybridization differences between lines for single positions rather than single
assessments of a few positions as obtained with most expression arrays. We developed
custom scripts for analysis of our hybridization data taking into account the multiple probes
covering a single position as well as filtering out poorly performing probes. We used recent
advances in high throughput sequencing technology to validate our SPP calls as well as filter
out potentially unreliable data.
Genomic DNA and cDNA are two options for hybridization to an array for SFP detection.
The decision of which to use becomes more difficult as genome size and complexity
increases. DNA as well as cDNA are both viable targets for species with smaller genomes
such as Arabidopsis [1,5] and rice [3,23]. However, with larger and more complex genomes
such as barley, cDNA was indicated as a more reliable option for hybridization even with the
added difficulty of subtracting out expression effects [4]. The genome of lettuce is nearly 17x
larger than Arabidopsis although it is half the size of barley. Given the difficulty of
accounting for spatio-temporal expression effects as seen in cDNA, we focused on
developing methods to use genomic DNA. Rostoks et al. [4] suggested that genomic DNA
may be a feasible target in larger genomes with added replication. With the redundancy of the
overlapping probes in the lettuce array, the need for additional replication was reduced
because they provide technical replicates within a chip. The need for additional replication
may also have been reduced by using elevated amounts of genomic DNA and the use of end-
labeling rather than BioPrime may have increased the reliability of calls. The protocol used
for hybridization of lettuce genomic DNA was also subsequently highly effective for pepper
(genome size = 3,000 MB) and other Solanaceae [24]. Furthermore, the use of genomic DNA
is a desirable target because SFPs identified using cDNA may be a result of alternate splicing
or gene expression differences [25]. Rostoks et al. [4] indicated that 40% of the SFPs they
identified may have been falsely called and partially explained them as being mRNA
structural variants. They also reported a high predicted false positive rate of 22% (their mid-
stringency cutoff value) for SFPs detected using genomic DNA. We concluded that
fragmented, end-labeled genomic DNA provided a suitable target for detection of
polymorphisms while reducing false positive sequence polymorphism (2 to 24% in our
experiments, Table 2).
Page 14
The overlapping tile design increases the likelihood of detecting polymorphisms due to
redundancy at individual positions, coverage along the contigs and optimal position of the
SNP within a probe [1,11]. Furthermore, the number of probes and hence the possible
genome coverage was increased by substituting mismatch probes with AG probes for
background correction and normalization of data. Because the peripheral 1 to 6 bases of a 25
bp oligonucleotide are less sensitive than the central bases, in terms of detecting sequence
polymorphisms [1,4, Figure 4, 11], the tiling strategy reduces the loss of coverage due to
probe position. The number and reliability of SPP calls in our experiments demonstrates that
the overlapping tiling array design has improved coverage, sensitivity and specificity to
detect polymorphisms.
SPP calls were validated using several approaches. The data from the two pair-wise
comparisons (MSA and SFPdev) yielded 20 to 41 thousand and 27 to 40 thousand SPPs
respectively, depending on the criteria used for specificity and sensitivity. When SPPs from
MSA and SFPdev were compared to the 51,552 SNPs detected between RNAseq reads of
Salinas and US96UC23, 61.5% and 57.8% were found in or within at least 8 bp of the SPP
range respectively, similar to that described by Gresham et al. [10]. However, because of the
high FDR associated with duplicated sequences, SPPs that were found to have a duplicated
locus within the chip assembly, the gene space assembly or the genome assembly were
removed from consideration; one third of the SPPs called that had duplicated loci did not
contain a SNP in any of our validation tests. These identified SPPs likely were due to
differences between paralogs rather than alleles at a single locus. Due to the increased
redundancy (up to 107 genetic replicates) provided by the mapping population of 213 RILs
compared to the pair-wise comparison of the parents, SPPs in the SFPdev and MSA pair-wise
comparisons that coincided with SPP mapped by Truco et al. [20] but were absent of a SNP
were considered real. Removal of duplicated loci and inclusion of mapped SPPs provided a
balance between false positive and false negative rates and allowed us to optimize FDR while
still discovering a high number of SPPs. Taking into consideration the lower observed FDR
(Table 2) we concluded that the MSA method performed best as a pairwise comparison;
however using multiple detection methods would yield a higher confidence in the subset of
SPPs identified by both methods.
The SPPs identified in the diversity panel (DP) that were polymorphic between L. sativa cv.
Salinas (SAL) and L. serriola acc. US96UC23 (SER) showed a low FDR. However, as a
result of the filtering, sensitivity of this analysis (detection of true positives) was reduced
compared to the two-genotype analyses by MSA and SFPdev. Specific analysis of the DP
data for regions containing known SNPs showed that SFPdev values would have been
significant in a pair-wise comparison, between SAL and SER but due to inclusion of data
from all genotypes in the DP, the two were not called as polymorphic (data not shown). The
lack of some called SPPs in the DP may be due to larger genetic differences between L.
perennis, L. virosa, or L. saligna relative to L. serriola and L. sativa. As a result of smaller
hybridization differences between the more closely related genotypes, genotypes differing at
a locus may have been grouped together reducing the number of SPPs called between the two
genotypes. Consequently, the DP analysis showed a lower false positive rate, but a higher
false negative rate when comparing SAL and SER to sequence and mapping data.
As part of our goal was to investigate the diversity and relationships of the genotypes in the
DP, SPPs identified by the DP analysis were evaluated. Removal of SPPs in duplicated
regions with inconsistent data or missing data (see Methods) was a reasonable method of
removing unreliable data as these data may be from poorly performing probes in one or all
Page 15
replicates, heterozygous loci, paralogous genes or deleted genes. There was not a large
difference in the observed FDRs for the three SPFdev cutoff values (1.2, 1.5, & 2.0) for the
DP analysis; so in order to maximize the number of markers used in our phylogenetic
analysis and principal component analysis, we used the least stringent cutoff value of 1.2. As
the assumptions for analysis with the PHYLIP [21] package were not violated with the large
number of markers, they were left as independent. To meet the constraints of the PC analysis
software, markers were limited to those that were mapped.
The markers discovered in our DP analysis were used to generate a phylogenetic tree
showing species separation with 100% boot strap support. L. virosa and L. saligna are
sexually incompatible species with L. sativa [26] and appear to be more closely related to
each other than to other species in the DP. Our data supports the conclusion by Kesseli et al.
[27], that these two species are not progenitors of L. sativa. By limiting markers to those
polymorphic within cultivated lettuce we are able to separate most genotypes into classes
representing each of the plant types. The butterhead type formed a distinct clade from the
iceberg and cos types with 100% bootstrap support. However, the leafy type and the Batavia
type both showed a wide distribution across the L. sativa clade. This is not unexpected and
may reflect admixture between types during breeding programs. Alternatively, this
distribution may indicate that these types are artificial polyphyletic groups based on loose
morphological criteria [28]. The leafy types are non-heading with a broad range of leaf
morphology [28,29]. Batavia types vary from heading to non-heading phenotypes. Batavia
and iceberg cultivars are both considered crisphead types [28]; however our phylogenetic and
PC analyses showed that the two did not cluster together and are significantly different from
each other (See Additional file 8: Table S1).
Rapid advancements in sequencing technology today are changing the methods for genetic
analyses. Microarray technology presented in this paper yielded an in depth analysis of
diversity for lettuce germplasm separating even closely related lines such as the crisphead
class. It also has potentially several other uses including: detection of copy number variants,
splice site identification, expression analysis or use with other species within the Compositae.
The SPPs identified in this study were highly reproducible and showed similar false positive
results to current sequencing methods in the literature [30,31]. This technology has also been
used to create an ultra-dense, inter-specific genetic map between L. sativa cv. Salinas and L.
serriola acc. US96UC23 to dissect phenotypic traits as well as validate and align genomic
assemblies of lettuce into chromosomal linkage groups[20].
Conclusion
We designed and exploited a custom lettuce microarray using an overlapping tiling path and
by using anti-genomic probes rather than mismatch probes to provide comprehensive unigene
coverage. Our protocols for genomic DNA preparation and labeling, assisted by positional vs.
feature-based analyses reliably identified DNA polymorphisms using both pair-wise
genotype comparisons as well as a highly parallel comparison within a diverse panel of
genotypes including five species and focused on the cultivated L. sativa. The phylogenetic
and principal component analyses clearly distinguished species while the analysis of L. sativa
supports previous analyses of cultivated lettuce and revealed differences among the more
heterogeneous horticultural types as well as polymorphisms within the most genetically
narrow type.