Article

Using RNA-Seq for gene identification, polymorphism detection and transcript profiling in two alfalfa genotypes with divergent cell wall composition in stems

USDA-Agricultural Research Service, Plant Science Research Unit, St, Paul, MN 55108, USA.
BMC Genomics (Impact Factor: 3.99). 04/2011; 12(1):199. DOI: 10.1186/1471-2164-12-199
Source: PubMed
ABSTRACT
Alfalfa, [Medicago sativa (L.) sativa], a widely-grown perennial forage has potential for development as a cellulosic ethanol feedstock. However, the genomics of alfalfa, a non-model species, is still in its infancy. The recent advent of RNA-Seq, a massively parallel sequencing method for transcriptome analysis, provides an opportunity to expand the identification of alfalfa genes and polymorphisms, and conduct in-depth transcript profiling.
Cell walls in stems of alfalfa genotype 708 have higher cellulose and lower lignin concentrations compared to cell walls in stems of genotype 773. Using the Illumina GA-II platform, a total of 198,861,304 expression sequence tags (ESTs, 76 bp in length) were generated from cDNA libraries derived from elongating stem (ES) and post-elongation stem (PES) internodes of 708 and 773. In addition, 341,984 ESTs were generated from ES and PES internodes of genotype 773 using the GS FLX Titanium platform. The first alfalfa (Medicago sativa) gene index (MSGI 1.0) was assembled using the Sanger ESTs available from GenBank, the GS FLX Titanium EST sequences, and the de novo assembled Illumina sequences. MSGI 1.0 contains 124,025 unique sequences including 22,729 tentative consensus sequences (TCs), 22,315 singletons and 78,981 pseudo-singletons. We identified a total of 1,294 simple sequence repeats (SSR) among the sequences in MSGI 1.0. In addition, a total of 10,826 single nucleotide polymorphisms (SNPs) were predicted between the two genotypes. Out of 55 SNPs randomly selected for experimental validation, 47 (85%) were polymorphic between the two genotypes. We also identified numerous allelic variations within each genotype. Digital gene expression analysis identified numerous candidate genes that may play a role in stem development as well as candidate genes that may contribute to the differences in cell wall composition in stems of the two genotypes.
Our results demonstrate that RNA-Seq can be successfully used for gene identification, polymorphism detection and transcript profiling in alfalfa, a non-model, allogamous, autotetraploid species. The alfalfa gene index assembled in this study, and the SNPs, SSRs and candidate genes identified can be used to improve alfalfa as a forage crop and cellulosic feedstock.

Full-text

Available from: Carroll P Vance, Dec 20, 2013
RESEARCH ARTICLE Open Access
Using RNA-Seq for gene identification,
polymorphism detection and transcript profiling
in two alfalfa genotypes with divergent cell wall
composition in stems
S Samuel Yang
1*
, Zheng Jin Tu
2
, Foo Cheung
3,5
, Wayne Wenzhong Xu
2
, JoAnn FS Lamb
1,4
,
Hans-Joachim G Jung
1,4
, Carroll P Vance
1,4*
and John W Gronwald
1,4*
Abstract
Background: Alfalfa, [Medicago sativa (L.) sativa], a widely-grown perennial forage has potential for development as
a cellulosic ethanol feedstock. However, the genomics of alfalfa, a non-model species, is still in its infancy. The
recent advent of RNA-Seq, a massiv ely parallel sequencing method for transcriptome analysis, provides an
opportunity to expand the identification of alfalfa genes and polymorphisms, and conduct in-depth transcript
profiling.
Results: Cell walls in stems of alfalfa genotype 708 have higher cellulose and lower lignin concentrations
compared to cell walls in stems of genotype 773. Using the Illumina GA-II platform, a total of 198,861,304
expression sequence tags (ESTs, 76 bp in length) were generated from cDNA libraries derived from elongating
stem (ES) and post-elongation stem (PES) internodes of 708 and 773. In addition, 341,984 ESTs were generated
from ES and PES internodes of genotype 773 using the GS FLX Titanium platform. The first alfalfa (Medicago sativa)
gene index (MSGI 1.0) was assemble d using the Sanger ESTs available from GenBank, the GS FLX Titanium EST
sequences, and the de novo assembled Illumina sequences. MSGI 1.0 contains 124,025 unique sequences including
22,729 tentative consensus sequences (TCs), 22,315 singletons and 78,981 pseudo-singletons. We identified a total
of 1,294 simple sequence repeats (SSR) among the sequences in MSGI 1.0. In addition, a total of 10,826 single
nucleotide polymorphisms (SNPs) were predicted between the two genotypes. Out of 55 SNPs randomly sel ected
for experimental validation, 47 (85%) were polymorphic between the two genotypes. We also identified numerous
allelic variations within each genotype. Digital gene expression analysis identified numerous can didate genes that
may play a role in stem development as well as candidate genes that may contribute to the differences in cell wall
composition in stems of the two genotype s.
Conclusions: Our results demonstrate that RNA-Seq can be successfully used for gene identification,
polymorphism detection and transcript profiling in alfalfa, a non-model, allogamous, auto tetraploid species. The
alfalfa gene index assembled in this study, and the SNPs, SSRs and candidate genes identified can be used to
improve alfalfa as a forage crop and cellulosic feedstock.
* Correspondence: sam.yang@ars.usda.gov; carroll.vance@ars.usda.gov; john.
gronwald@ars.usda.gov
1
USDA-Agricultural Research Service, Plant Science Research Unit, St. Paul,
MN, 55108, USA
Full list of author information is available at the end of the article
Yang et al. BMC Genomics 2011, 12:199
http://www.biomedcentral.com/1471-2164/12/199
© 2011 Yang et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons
Attribution License (http://creativecom mons.org/licenses/by/2.0), which permits unrestricted use , distribution, and reproduction in
any medium, provided the original work is properl y cited.
Page 1
Background
The advent of next genera tion high-thro ughput sequen-
cing has revolutionized the analysis of genomes and
transcriptomes [1-5]. When applied to the transcrip-
tome, this methodology is referred to as RNA-Seq (RNA
sequencing). RNA-Seq has been used for gene annota-
tion, expression analysis and SNP discovery [6,7]. This
methodology has also proven useful for discovery of
novel transcripts (coding and non-coding) and identifi-
cation of alternative splice variants [5,8]. It is expected
that RNA-Seq methodologies will supersede microarrays
for transcript profiling because of higher sensitivity,
base-pair resolution and the larger range of expression
values that can be detected [3,5,9]. Furthermore, in con-
trast to microarrays, RNA-Seq does not require prior
knowledge of gene sequences. However, RNA-Seq pre-
sents bioinformatic challenges because of the required
assembly of millions of short sequence reads that are
generated by the methodology.
RNA-Seq has been successfully used for ann otation,
trans cript profiling and/or SN P discovery in a number of
plant species. For model plant spe cies with sequenced
genomes, sequence reads can be mapped to the reference
genome. The model species where RNA-Seq analysis has
been applied include Arabidopsis [10,11], soybean
[12,13], rice [14], maize [15] and Medi cago truncatula
[16]. There are also examples of the application of RNA-
Seq to non-model plant species that lack a reference gen-
ome. In the absence of a reference genome, de novo
assembly of sequence reads into contigs is required.
RNA-Seq has been used for transcript profiling i n Euca-
lyptus grandis [17], grape (Vitis vinifera L.) [18], Califor-
nia poppy (Eschschlozia califonica)[11],avocado(Persea
americana) [11], Pachycladon e nysii [19] and Artemisia
annua [20]. In Eucalyptus grandis and ra pe (Brassica
napus), RNA-Seq was used for SNP discovery [17,21].
Alfalfa is the most widely cultivated forage legume in
the world and the fourth most widely grown crop in the
US [22,23]. In addition to its value as a livestock feed,
alfalfa also has potential as a cellulosic ethanol feedstock
[24,25]. Alfalfa is an allogamous autotetraploid with
complex polysomic inheritance [26-28]. Slow progress
has been made in improving the agronomic traits of this
species using traditional breeding approaches based o n
phenotypic selection. For the most part, genomic
approaches for crop improvement (e.g., molecular
breeding) have not been applied to this legume because
of limited genomic resources. As of February 2010,
there were 12,371 alfalfa ESTs available in the public
database. A few SSRs have been detected but SN Ps have
not yet been identified [28-30]. Recently, we reported on
the results of transcript profiling and single feature poly-
morphism (SFP) detecti on in alfalfa using the Medicago
GeneChip as a cross-species platform [25,31]. The
Medicago Gene Chip conta ins probe sets designed for
the model plant, Me dicago truncatula, a diploid relative
of alfalfa. Using a method based on probe affinity differ-
ences and affinity shape power, we identified over
10,000s SFPs in the stem i nternodes of alfalfa genotypes
252 and 1283 that differed in cellulose and lignin con-
centrations in c ell w alls [31]. In a subsequent study
using the Medicago GeneChip for transcript profiling of
alfalfa genotypes 252 and 1283, interspecies variable
regions and SFPs were masked prior to data analysis
resulting in a 2-fold increase in the number of differen-
tially expresse d genes detected in stem internodes of the
two genotypes [25]. Although the research of Yang et al.
[25,31] significantly advanced alfalfa genomics, the use
of a cross-species platf orm for microarray analysis limits
the sensitivity and specificity of transcriptome analysis
and polymorphism detection.
The stem tissue of alfalfa is important in determini ng
the val ue of this forage as a livestock feed and cellulosic
feedstock. Increa sing the cellulose and decreasi ng t he
lignin content in cell wal ls in stems would i mprove
alfalfa for both uses. In this study, we applied RNA-Seq
to gene identification, polymorphism detection and tran-
scri pt profiling of two alfa lfa clona l lines (708, 77 3) that
differ in cell wall composition in stems. The results
were used to assemble the first gene atlas for alfalfa
(MSGI 1.0). Our research also provides the first report
of high-throughput SNP detection and digital gene
expression analysis in the alfalfa transcriptome.
Results and discussion
Cell wall composition of stems of genotypes 708 and 773
The alfalfa genotypes 708 and 773 used in this study were
selected for divergent cel l wall composition in stems
under field conditions (see Methods for details). Cell wall
composition of greenhouse grown stems used for RNA
sampling in the current study is shown in Table 1. Cell
wall concentration in stems of the two cl ones did not dif-
fer. In contrast, cellulose con tent (defined a s glucose) in
the stems of genotype 708 was 5.2% greater compared to
genotype 773 (p < 0.05) (Table 1). In addition, galactose
and mannose concentrations were 14.2% (p < 0.05) and
8.5% (p < 0.01) greater, respectively, in stems of genotype
708 compared to genotype 773 (Table 1). Klason lignin
concentration in the cell wall was 8.0% greater in stems
of 773 compared to stems of 708 (p < 0.05) (Table 1).
These genotypes consiste ntly displayed differences in cell
wall cellulose and l ignin content in stems when plants
were grown under different field environments (Figure 1)
and in the greenhouse (Table 1).
RNA-Seq using the Illumina GA-II platform
For RNA-Seq analysis, we developed a total of four
cDNA libraries derived from elongating stem (ES) and
Yang et al. BMC Genomics 2011, 12:199
http://www.biomedcentral.com/1471-2164/12/199
Page 2 of 19
Page 2
post-elongation stem (PES) internodes of alfalfa genotypes
708 and 773 (see Methods for details). In alfalfa stems,
genes associated with primary cell wall development are
preferentially expressed in ES internodes while genes asso-
ciated with secondary xylem development are enriched in
PES internodes [25]. For sequencing by synthesis using the
Illumina GA-II platform, cDNA libraries 708ES, 708PES
and 773ES were run on two lanes per library while the
773PES library was run on one lane. A total of
234,908, 899 EST reads were generated by a single run of
76 cycles. After filtering low quality reads, a total of
198,861,304 reads (76-bp in size) were selected for further
analysis (see Methods for details). The Illumina reads
generatedinthisstudyareavailableattheNCBISRA
browser (accession number GSE26757; http://www.ncbi.
nlm.nih.gov/geo/query/acc.cgi?acc=GSE26757.
de novo assembly of short RNA-Seq reads without a
known reference is a challenging task especially for
alfalfa, an allogamous autotetraploid with complex poly-
somic i nheritance. In this study, we used the Velvet algo-
rithm[32]forde novo assembly of the 198,861,304
Illumina re ads (76 bp) into a total of 132,153 uniqu e
sequences with an average length of 284 bp (Additional
file 1). The Velvet algorithm has also been used success-
fully for de novo transcriptome assembly in previous stu-
dies [33,34]. The Velvet algorithm was ori ginally
developed for de novo assembly of ge nome sequences
where the coverage is expe cted to be homogeneo us
throughout the gen ome. However, the coverage of tran-
scripts is highly heterogeneous due t o difference in gene
expression. P revious studies showed that de novo assem-
bly using the Velvet program with longer k-mers results
in a mo re contiguous transcript assembly but lower tran-
script diversity compared to shorter k-mers [32,33].
Although several recent studies introduced new algo-
rithms and methodologies developed for de novo tran-
scriptome assembly [35-38], a consensus standard
protocol has n ot yet emerged for de novo transcriptome
assembly. In this study, we optimized our Velvet de novo
transcriptome assembly to favor transcript c ontiguity
with high specificity as opposed to increased transcript
diversity (see Methods for details). To complement the
limitation of the hig h k-mer that we selected for the Ve l-
vet assembly in this study (lower diversity and probably
biased toward highly expressed genes), we generated
additional ESTs using the GS FLX Titanium platform.
RNA-Seq using the GS FLX Titanium platform
We generated a total of 341,984 additional EST s (average
length 243 bp, minimum length 40 bp, maximum length
792 bp) using the GS FLX Titanium platform http://
www.454.com. The additional EST sequences were gen-
erated from the cDNA libraries derived from ES (124,533
ESTs, average length 230 bp) and PES (217,451 ESTs,
average length 256 bp) internodes of the genotype 773.
The additional ESTs obtained using the GS FLX Tita-
nium platform increased the diversity of transcripts dis-
covered and hence provided broader coverage of the
alfalfa transcriptome than would have been achieve d
based on the de novo assembly of the Illumina reads
alone. The additi onal ESTs are also available at the NCBI
SRA browser (accession number GSE26757; http://www.
ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE26757.
Alfalfa Gene Index 1.0 (MSGI 1.0)
We used the Gene Index Assembly protocol [39,40]
for reference transcriptome assembly in alfalfa. This
Envir
o
nm
e
n
t
a
l In
de
x
-20-10 0 10203040
150
200
250
300
350
150
200
250
300
350
Stem Cellulose Concentration (g
/
kg dry matter)
Stem Klason Li
g
nin Concentration (
g
/k
g
dry matter)
Clone 708 Stem Cellulose y=305+1.2x, r
2
=0.94
Clone 773 Stem Cellulose y=284+1.2x, r
2
=0.94
Clone 773 Stem Klason Lignin 169+1.1x, r
2
=0.80
Clone 708 Stem Klason Lignin y=146+1.0x, r
2
=0.84
Figure 1 Regression analyses of cellulose and Klason lignin
concentrations in stems of two alfalfa genotypes. The stems of
genotype 708 were consistently higher in cellulose and lower in
Klason lignin compared to stems of genotype 773 across twelve
environmental indexes (field environments). The high r
2
values for
all regression lines suggest that genotypic differences in stem
cellulose and Klason lignin concentrations were environmentally
stable.
Table 1 Comparison of cell wall components in stems of
genotypes 708 and 773 on a cell wall basis
Component Genotype 708 Genotype 773 SEM p-value
———————— gkg
-1
cell wall
————————
Klason lignin 162 175 2 p < 0.05
Glucose 443 421 2 p < 0.05
Xylose 137 149 3 NS
Arabinose 39 39 1 NS
Galactose 32 28 1 p < 0.05
Mannose 33.1 30.5 0.1 p < 0.01
Rhamnose 11.5 11.4 0.4 NS
Fucose 3.01 3.1 0.03 NS
Uronic acids 139 142 6 NS
Values are least square means based on an analysis of variance with three
biological replicates for each clone arranged in a randomized complete block
design (see Methods for details). SEM = Standard error of mean, NS = Non-
significant (p > 0.05).
Yang et al. BMC Genomics 2011, 12:199
http://www.biomedcentral.com/1471-2164/12/199
Page 3 of 19
Page 3
protocol has been used for over a decade to build uni-
gene assemblies for numerous species of animals, plants
and microorganisms http://compbio.dfci.harvard.edu/tgi/
plant.html. However, no gene index is currently a vail-
able for alfalfa. In this study, the first alfalfa (Medicago
sativa) gene index (MSGI 1.0) was built by combining
the de novo assembled Illumina reads using the Velvet
program (132,153 sequences), the 341,984 ESTs
obtained using the GS FLX Titanium platform, and
12,371 Sanger ESTs for alfalfa available in the public
database http://www.ncbi.nlm.nih.gov following the
Gene Index Assembly protocol previously described
[39,40].
MSGI 1.0 contains a total of 124,025 unique
sequences including 22,729 tentative consensus
sequences (TCs), 22,315 singletons and 78,981 pseudo-
singletons (Additional file 2). Pseudo-singletons refer to
the de novo ass embled Illumina sequence s that were not
assembled into contigs during the Gene Index Assembly
process.Theaveragelengthoftheuniquesequencesin
MSGI 1.0 is 384 bp. Unique se quence lengths ra nged
from 100 to 6,956 bp with more than 10,000 sequences
larger than 800 bp. The total base count of the
sequences in MSGI 1.0 is 47,628,953 bp. The newly
built alfalfa gene index increases the number of alfalfa
sequences publicly available by about 10-fold.
Gene annotation and functional classification
We assigned putative functio ns for the unique
sequences in MSGI 1.0 by conducting BlastX searches
against the non-redundant (NR) protein database (e-
value cutoff of 1e-10) (Additional file 3). Putative func-
tions could be assigned for about 83% o f the sequences.
We also assigned gene ontology (GO) functional classes
and MapMan functional classifications [41] to the
unique sequences in MSGI 1.0 (Additional file 3) (see
Methods for details). To examine whether bias occurs
among the functional classes represented in MSGI 1.0,
we compared the percentages of each GO functional
class and pathway in MSGI 1.0 with the percentages
found in the M. (Medicago) truncatula Gene Index
(MTGI 9.0), the M. truncatula coding sequences (Mt3. 0
cds) and the Arabidopsis coding sequences (At cds)
(Figure 2). Although most of the sequences in MSGI 1.0
were derived from stem tissues, similar levels of repre-
sentation of most funct ional classes wer e found in
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Percentage
GO Biological Process Class
MSGI
MTGI
Mt3_cds
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
1.8
2.0
Percentage
Pathway
MSGI
MTGI
Mt3_cds
At_cds
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
Percentage
GO Cellular Component Class
MSGI
MTGI
Mt3_cds
0.0
0.5
1.0
1.5
2.0
2.5
Percentage
GO Molecular Function Class
MSGI
MTGI
Mt3_cds
At_cds
Figure 2 Comparison of percentage distribution of gene ontology and pathway classifications using four reference databases.The
percentage distributions of gene ontology (GO) classes and pathways are shown for the following reference databases: (1) the Medicago sativa
Gene Index (MSGI 1.0) assembled in this study, (2) the Medicago truncatula Gene Index (MTGI 9.0), (3) the M. truncatula coding sequences (Mt3.0
cds), and (4) the Arabidopsis coding sequences (At cds).
Yang et al. BMC Genomics 2011, 12:199
http://www.biomedcentral.com/1471-2164/12/199
Page 4 of 19
Page 4
MSGI 1.0 and the other databases (MTGI 9.0, Mt3.0
cds, and At cds). These results suggest that MSGI 1.0
can serve as a reference sequence database for genomic
analysis in alfalfa.
SSR detection
We detected simple sequence repeats (SSRs) among
sequences i n MSGI 1.0 using the MISA program [42]
(see Methods for det ails). A total of 1,294 SSRs were
identified among 1,245 sequences which represents
about 1.7% o f the total unique sequences in MSGI 1.0
(Additional f ile 4). The estim ated frequency of SSRs
among the expressed sequences was one SSR per 37 kb.
SSR detection frequency is dependent on the SSR detec-
tion parameter [43]. The SSR frequency measured in
this study is significantly lower than that detected in
other species (one SSR per 11 kb) where the same SSR
detection parameter was used [40]. The significantly
reduced SSR detection frequency found in MSGI 1.0
sequences may be due to the reduced detection effi-
ciency of short length sequences (384 bp on average for
MSGI1.0). Alternatively, the SSR frequency among
expressed seque nces may be lower in alfalfa compared
to other species. SSRs with mono-, di-, tri-, tetra-,
penta- and hexanucleotide repeats composed about
5.4%, 30.4%, 47.2%, 10.6%, 3.9% and 2.5% of the SSRs in
MSGI 1.0, respectively. Using the defa ult parame ter of
the Primer3 program [44], we designed SSR primers
spanning a total of 664 SSRs (Additional file 4).
SNP detection
To identify SNPs between alfalfa genotypes 708 and 773,
Illumina EST reads fr om ES and PES internode libraries
were combined for each genotype. The combined ES
and PES reads for each genotype were independently
aligned to the MSGI 1.0 sequence s using the Maq pro-
gram [45]. From the alignment output of each gen otype,
we summarized the depth (frequency) of each nucleotide
(A, G, C, or T) at each base position in each refere nce
sequence. Next, to reduce the identification of false
positive SNPs, we filtered potential SNPs using a strin-
gent nucleotide depth cutoff of 10 [e.g., at least 10 ade-
nines (A) in one genotype vs. at least 10 guanines (G) in
the other genotype] for each genotype (see Methods for
details). Using this protocol, we identified 1 0,826 SNPs
between genotypes 708 and 773 in 7,282 se quences in
MSGI 1.0 (Additional file 5). About 74% of these
sequences contained a single SNP while about 2.3% con-
tained 5 or more SNPs.
To validate the SNPs that were predicted using the
RNA-Seq data generated in this study , we randomly
selected 55 SNPs. Gen omic DNAs purified from geno-
types 708 and 773 were genotype d by MALDI-TOF
mass spectrometry using the iPLEX Gold spectrometry
system http://www.sequenom.com. Out of 55 SNPs
tested, 47 (85%) were polymorphic between the two
genotypes (Additional file 6) . In addition to genotypes
708 and 773, we also genotyp ed 51 additional alfalfa (M.
sativa) genotypes selected from different populations of
M. sativa ssp. sativa or M. sativa ssp. falcata.The47
validated SNPs between 708 and 773 also showed poly-
morphism among the other Medicago genotypes tested
(Additional file 6). This suggests that the SNPs predicted
in this study can also be use d for genotyping in other
alfalfa genotypes.
In a previous study that described single-feature poly-
morphism (SF P) discovery in alfalfa using the Medicago
GeneChip as a cross-species platform [31], w e proposed
candidate gene -based association ma pping for selecting
alfalfa ger mplasm with modified cel l wall composit ion in
stems. In this study, SNPs were also identified in genes
with various functional cla sses in cluding numerous cell
wall-related genes (Figure 3A). For example, SNPs were
identified in 14 genes inv olved in cellulose biosynthesis
including 11 cellulose synthase and three COBRA genes
[46] (Figure 3A). In addition, SNPs were identified in 21
lignin pathway genes, 20 genes involved in ce ll wall pre-
cursor pathways (Figure 3A) and in numerous regulatory
genes including various transcription factor families, sig-
nalling genes and hormone genes (Figure 3B).
To detect functional classes over- or under-repre-
sented among the SNP-harboring genes, we performed
Fishers exact test with Bonfe rroni correction (z-value
cutoff = 1) as previously described [ 31] (Additional file
7). The functional classes over-represented among SNP-
harboring genes included photosynthesis, cell wall,
amino acid metabolism, stress response (biotic and abio-
tic), nodulin-like, protein synthesis and WRKY tran-
scription factor classes (Additional file 7). The SNPs
developed in thi s stud y can be used for either candidate
gene-based or whole genome scanning assoc iation map-
ping studies to identify SNPs associated with cell wall
traits in alfalfa stems. With further development, the
SNPs identified in this study may prove to be useful in
molecular breeding pro grams f ocused on improving
alfalfa as a forage crop and biomass feedstock via mar-
ker-assisted selection.
In this study, we also identified allelic variations
(SNPs) within genotypes. Using a minimum SNP depth
cutoff of 10, we detected 287,555 and 168,966 allelic
variations (SNPs) within genotypes 708 and 773, respec-
tively (Additional files 8 and 9). These SNPs within gen-
otype were detected in 55,320 and 33,406 sequences for
gen otypes 708 and 773, respectively. De tection of allelic
variations (SNPs) within genotypes is equally important
as detecting SNPs betw een genot ypes for understanding
phenotypic differences (e.g. cell wall composition) a nd
for future applications such as marker-assisted selection.
Yang et al. BMC Genomics 2011, 12:199
http://www.biomedcentral.com/1471-2164/12/199
Page 5 of 19
Page 5
HRGPs
AGPs
extensins
pectin esterases
Expansins, XETs
cellulose
biosynthesis
cell wall
degradation
raffinose
trehalose
FA synthesis
starch
synthesis
starch
degradation
Calvin
cycle
Phospholipid
synthesis
lipid
degradation
glycolysis
cell wall
precursor synthesis
Lignin genes
lacases
A
callose
other
minor CHO
hemicellulose
biosynthesis
exotics
B
Figure 3 MapMan overview of cellular metabolism (A) and regulation (B) showing SNP-harboring genes and SNP frequencies.
Individual genes are represented by small squares. The SNP frequency for each gene is indicated by the intensity of the blue color on a 0 to 3
scale. Dark blue (scale intensity 3) indicates genes with three or more SNPs. A complete list of SNP-harboring genes, corresponding MapMan
functional categories and SNP frequencies are provided in Additional file 5.
Yang et al. BMC Genomics 2011, 12:199
http://www.biomedcentral.com/1471-2164/12/199
Page 6 of 19
Page 6
Comparison of MSGI 1.0 and Mt3.0 cds as reference
sequences for digital transcript profiling
The alfalfa gene index (MSGI 1.0) developed in this
study provides a reference sequence database that can
be used for digital gene expression analysis in alfalfa.
However, another option for RNA-Seq analysis in alfalfa
is to use Mt3.0 cds as a reference sequence because M.
truncatula and alfalfa share si gnificant coding sequence
homology [25]. Furthermore, sequences in Mt3.0 cds are
full-length sequences (pred icted gene models) with bet-
ter coverage than sequences in MSGI 1.0 where the
majority are partia l sequences. As an initial step to eval-
uate the utility of MSG I 1.0 and Mt 3.0 cds as reference
sequenc es for transcript profiling of alfalfa, t he Illumina
EST reads generated in this study w ere mapped to
MSGI 1.0 and Mt3.0 cds sequences using the bowtie
program [47] (see Methods for details). On average,
about 70% of the EST reads in each library (708 ES, 773
ES, 708 PES, and 773 PES) could be mapped to the
MSGI 1.0 sequences. In contrast, only ab out 30% of the
EST reads could be mapped to the Mt3.0 cds sequences
(data not shown). We measured the raw digital e xpres-
sion counts for each gene by quantifying the number of
EST reads that were mapped to each reference
sequence. The raw digital gene expression counts were
normalized using the RPKM (reads/Kb/Million) method
[1,48] to correct the digital gene expression counts for
bias caused by reference sequence size a nd total EST
numbers per library (see Methods for details).
Further evaluation of MSGI 1.0 and Mt3.0 cds as
reference sequence databases for alfalfa was conducted
by comparing RNA-Seq d ata with the p reviously gener-
ated GeneChip data for the same stem tissues but in dif-
ferent alfalfa genotypes [25] (see Methods for details).
The RNA-Seq data generated using MSGI 1.0 or Mt3.0
cds showed a linear relationship with GeneChip data
with similar Pearson correlation coefficients (R = 0.89
and R = 0.87, respectively) (Figure 4A and 4B). A total
of 1,254 genes were commonly-selecte d betw een RNA-
Seq and GeneChip data when MSGI 1.0 was used as
reference sequences (Figure 4A). However, when Mt3.0
cds was used as reference sequences, the number of
genes commonly-selected between RNA-Seq and Gene-
Chip data decreased to 337 reflecting a significant
decrease in detection sensitivity (Figure 4B). This is not
surprising because, as described above, only about 30%
of the EST reads could be mapped to the Mt3.0 cds
while about 70% of the EST reads could be mapped to
the MSGI 1.0 (data not shown).
As a final evaluation of MSGI 1.0 and Mt3.0 c ds as
reference sequences for digital gene expression analysis
in alfalfa, we compared the digital gene expression data
generated using MSGI 1.0 and Mt3.0 cds sequences
with real -time quantitative RT-PCR (qRT-PCR) data
obtained from 97 genes (63 randomly selected, 34 cell
wall genes) (Additional file 10) (see Methods for details).
Previous studies showed a linear relationship between
ΔΔC
T
values from qRT-PCR and the l og gene expres-
sion ratio obtained in microarray analysis [25,49,50]. We
plotted ΔΔC
T
values obtained from the qRT-PCR data
for randomly selected genes against Log
2
(708ES/773ES)
values from the RNA-Seq data with MSGI 1.0 or Mt3.0
cds as reference sequences. The results showed a linear
relationship between qRT-PCR data and the RNA-Seq
data using both reference sequences. However, using
MSGI 1.0 increased the Pearson correlation coefficient
(R) from 0.63 to 0.85 (Figure 4C). Next, we plotted
ΔΔC
T
values obtained from the qRT-PCR data for
selected cell wall genes against Log
2
(708PES/773PES)
values from the RNA-Seq data. Using MSGI 1.0 as the
reference sequence database also increased the Pearson
correlation coefficient (R) for selected cell wall gene s
from 0.45 to 0.76 (Figure 4D). On the basis of these
results, we chose to use MSGI 1.0 as reference
sequences for digital gene expression analysis of stems
of alfalfa genotypes 708 and 773.
Transcript profiling of stems of alfalfa genotypes 708 and
773
For transcript profiling of stems of alfalfa genotypes 708
and 773, we analyzed the RPKM-normalized digital gene
expression c ounts for each sequence in MSGI 1.0 for
cDNA libraries derived from ES and PES internodes of
each genotype (Additional file 11). Among the 124,025
sequences in MSGI 1.0, about 94.7% were transcription-
ally active (RPKM > 0) in at least one library while
about 5.3% (6,629 sequences) were silent in all four
libraries examined (RPKM = 0 in al l 4 libraries) (Addi-
tional file 11).
Among the transcriptionally-active genes in each
library, we identified the top 500 most abundant tran-
scripts (Additional file 12). T he Fishers exact test with
Bonferroni correction (z-value cutoff = 1) revealed that
genes belonging to photosynthesis, amino acid metabo-
lismandtransportclassesweresignificantly over-repre-
sented among the most abundantly expressed transcripts
in all 4 libraries which suggests role s as housekeeping
genes in alfalfa stems (Additional file 13). We also identi-
fied functional classes over-represented among the most
abundant genes expressed in a genotype- or tissue-speci-
fic manner suggesting their role in determining genotype
or tissue identity (Additional file 13). Interestingly, genes
involved in lignin biosynthesis were signif icantly over-
repres ented among the most abundant genes. The lignin
genes over-represented in one or more libraries include
CCoAOMT (caffeoyl-CoA O-methyltransferase), CCR1
(cinnamoyl-CoA reductase1) and COMT (caffeic acid O-
methyltransferase) genes (Additional file 13). On the
Yang et al. BMC Genomics 2011, 12:199
http://www.biomedcentral.com/1471-2164/12/199
Page 7 of 19
Page 7
other hand, the transcription factor family class was sig-
nificantly under-represented among the most abundant
transcripts in three libraries (Additional file 13). T able 2
shows the top 10 most abundant protein-coding tran-
scripts identified in each alfalfa stem internode library.
Interestingly, a putative COMT gene (M SGI1_1270) was
among the top 10 most abundant protein-coding tran-
scripts and it was up-regulated in 773 (high lignin geno-
type) in both ES and PES internodes compared to 708
(low lignin genotype). The promoters of these highly
expressed genes, including strong constitutive and tissue-
specific promoters, may be useful for transgenic studies
in alfalfa.
We also identified putative housekee ping genes (HKGs)
thatshowedlittlevariationinexpressionbutwere
expressed at relatively high levels. To identify HKGs, we
first sele cted gen es wi th an average RPKM-normalized
transcript count greater than 10. Next, we selected the
top 300 genes with the lowest coefficient of variation
(CV = standard deviation/mean) (Additional file 14)
[13]. These HKGs may be useful as reference genes in
qRT-PCR or other experiments to normalize gene
expression levels across different conditions [51].
Identification of differentially expressed genes
We used a MA-plot-based method with a random sam-
pling model in a DEGSeq program [52] to identify genes
differentially expressed between stems of alfalfa geno-
types 708 and 773. A total of 3,838 and 4,428 genes
were differentially expressed between ES and PES tissues
of genotypes 708 and 773, respectively (p < 0.001, FDR
< 0.025, 2-fold difference) (Additional files 15 and 16).
-8
-6
-4
-2
0
2
4
6
-10 -8 -6 -4 -2 0 2 4 6
Log
2
(PES/ES) GeneChip
Log
2
(PES/ES) RNA-Seq
Mt3.0 cds
R = 0.87
(337 genes)
-10
-8
-6
-4
-2
0
2
4
6
8
10
12
-8-6-4-20246810
ΔCt(773ES)-ΔCt(708ES) qRT-PCR
Log
2
(708ES/773ES) RNA-Seq
MSGI1.0, R=0.85
Mt3.0 cds, R=0.63
Linear (MSGI1.0, R=0.85)
Linear (Mt3.0 cds, R=0.63)
-10
-5
0
5
10
15
20
-6 -4 -2 0 2 4 6 8 10
ΔCt(773PES)-ΔCt(708PES) qRT-PCR
Log
2
(708PES/773PES) RNA-Seq
MSGI1.0, R=0.76
Mt3.0 cds, R=0.45
Linear (MSGI1.0, R=0.76)
Linear (Mt3.0 cds, R=0.45)
-8
-6
-4
-2
0
2
4
6
-10-8-6-4-2 0 2 4
6
Log
2
(PES/ES) GeneChip
Log
2
(PES/ES) RNA-Seq
MSGI1.0
R = 0.89
(1,254 genes)
A
C
B
D
Figure 4 Comparison of MSGI 1. 0 and Mt 3.0 cds as reference sequences for digital gene expression analysis.Forasubsetofgenes
involved in stem development independent of genotypic variation, Log
2
(PES/ES) values from the RNA-Seq data (x-axis) generated using (A) the
Medicago sativa Gene Index (MSGI1.0) or (B) the Medicago truncatula coding sequences (Mt3.0 cds) as reference sequences were plotted against
Log
2
(PES/ES) values from the GeneChip data (y-axis) previously generated [25]. For 63 randomly selected genes (C) and 34 selected cell wall
genes (D), Log ratio values from the RNA-Seq data (x-axis) generated using MSGI1.0 (O) and Mt3.0 cds (Δ) as reference sequences were plotted
against ΔΔC
T
values obtained from the qRT-PCR data (y-axis).
Yang et al. BMC Genomics 2011, 12:199
http://www.biomedcentral.com/1471-2164/12/199
Page 8 of 19
Page 8
Among the genes that w ere differenti ally expressed
between ES and PES internodes, 849 genes were
detected in internodes of both genotypes. In addition, a
total of 8,883 and 4,799 genes were differentially
expressed between genotypes 708 and 773 within ES
and PES internodes, respectively (p < 0.001, FDR <
0.025, 2-fold difference) (Additional files 17 and 18).
Of the genes that were differentially expressed betw een
the two genotypes, 2,422 were detected in both ES a nd
PES internodes. Among the 13,797 differentially
expressed genes identified in four pair-wise comparis ons
of ES and PES int ernodes of the two genotypes, about
85% were ubiquitously expressed in all four libraries
(RPKM-normalized transcript count > 0 i n all 4
libraries), about 5.5% were expressed in three libraries,
about 9.6% were expressed in two libraries, and 16
genes were expressed in only one library (Additional file
19). These results suggest that stem tissue internodes in
alfalfa may be characterized on the b asis of differential
expression of ubiquitous genes or tissue/genotype-speci-
fic expression of selected genes as shown in previous
studies with other species [12,13,40]. SNPs were
detected in 700 differenti ally e xpressed ge nes. I nterest-
ingly, about 14% of these SNP-harboring differentially
expressed genes were cell wall-related genes.
To illustrate the differential expression of genes
detected in the stem internodes of 708 and 773, we gen-
erated a heatmap of RPKM-normalized transcript counts
for the top 200 most differentially expressed genes in
each pair-wise comparison (Figure 5, Additional file 20).
Groups I and III in Figure 5 contain genes that were dif-
ferentially expressed in a tissue-specific manner which
suggests their role in alfalfa stem developmen t. For
example, one expansin and four pectin esterase genes
included in group I were up-regulated in ES compared
to PES internodes in both genotypes. These genes are
involved in cel l wall loosening and cell elongation
[53,54]. On the other hand, a putative alfalfa cellulose
synthase gene, IRREGULAR XYLEM 3 (IRX3), included
in group III (Figure 5) was up-regulated in PES inter-
nodes compared to ES in both genotypes. Several pre-
vious studies demonstrated xylem specific expression of
IRX3 and its role in secondary cell wall development in
Arabidopsis [55-57]. Groups II and IV in Figure 5 con-
tain genes differentially expressed i n a gen otype-specific
manner suggesting possible roles in the genotypic varia-
tion between stems of 708 and 773. For example, two
extensin genes and a cellul ose synthase gene (CESA4)
included i n group II were up-regulated in genotype 708
compared to 773 in both ES and PES internodes. These
Table 2 Top 10 most abundant protein-coding transcripts identified in each alfalfa stem internodes library
Unique_ID Libraries Putative Functions
708_ES 708_PES 773_ES 773_PES
—————————— RPKM-normalized expression counts ——————————
MSGI1_2417 6068 (1)
6754 (1) 3271 (2) 6034 (1) Leucine-rich repeat family protein
MSGI1_8746 4812 (2) 4697 (3) 3850 (1) 3586 (2) Chlorophyll a/b binding protein
MSGI1_523 4213 (3) 5428 (2) 2719 (6) 3552 (3) Beta ketoacyl CoA synthase
MSGI1_18145 3859 (4) 979 2555 (8) 3400 (4) Rubisco small chain
MSGI1_27309 748 2453 (7) 1317 3336 (5) Metallothionein
MSGI1_11989 2315 (6) 2574 (5) 1171 2387 (6) Uncharacterized protein
MSGI1_6529 265 328 2393 (9) 2350 (7) Glycine rich protein
MSGI1_1166 1458 1486 901 2182 (8) AAA ATPase
MSGI1_62398 160 225 1267 2168 (9) Stress (ABA)-inducible protein
MSGI1_21335 2012 (8) 2465 (6) 1275 2155 (10) Cytochrome P450-like
MSGI1_8707 2762 (5) 2632 (4) 1854 1833 Chlorophyll a/b binding protein
MSGI1_4749 1425 1682 (10) 1298 1693 Polyubiquitin
MSGI1_5229 2287 (7) 2140 (8) 1507 1561 Chlorophyll a/b binding protein
MSGI1_1270 744 969 3145 (3) 1470 Caffeic acid O-methyltransferase
MSGI1_1415 1723 (9) 1215 764 1468 Elongation factor 1-alpha
MSGI1_36219 1633 1777 (9) 723 1256 Uncharacterized protein
MSGI1_5153 1705 (10) 1531 1058 1090 Chlorophyll a/b binding protein
MSGI1_29285 324 513 2861 (4) 1035 Stress (ABA)-inducible protein
MSGI1_13276 86 77 2750 (5) 423 Cold acclimation responsive protein
MSGI1_7576 274 114 2357 (10) 284 Cold-acclimation-specific protein (CAS)
MSGI1_96533 11 5 2580 (7) 16 Cold acclimation-specific protein CAS)
Top 10 most abundant protein-coding transcripts selected from each library are highlighted in bold. Numbers enclosed in parenthesis represent rank based on
transcript frequency for the top 10 most abundant protein-coding transcripts in each librar y.
Yang et al. BMC Genomics 2011, 12:199
http://www.biomedcentral.com/1471-2164/12/199
Page 9 of 19
Page 9
genes may be responsible for the higher cellulose con-
tent in stem internodes of genotype 708 compared to
773. Group V in Figure 5 contains genes differentially
expressed in both a genotype- and tissue-specific
manner.
Lignin content in alfalfa stems affects the quality of
alfalfa as a forage crop and biomass feedstock. Lignin is
indigestible and reduces cell wall digesti bility in
ruminants [58-60]. In addition, the pre-treatment pro-
cess to remove lignin is one the most costly steps of cel-
lulosic ethanol production [61-64] . Over multiple
environments, alfalfa genotype 773 consistently showed
higher cell wall lignin content in stems compared to
genotype 708 (Figure 1) suggesting differences in the
genetics of lignin biosynthesis. In an effort to identify
key genes responsible for differences in cell wall proper-
ties in stems of genotypes 708 and 773, we identified lig-
nin (phenyl propanoid) pathway gen es among the 13,797
genes detected (Additional file 21). Next, we generated a
heatmap of gene expression ratios for each selected lig-
nin pathway gene for ea ch pai r-wise comparison (see
Methods for details). The heatmaps ge nerated were
inserted into the lignin biosynthetic pathway (Figure 6).
As expected, numerous lignin pathway genes were up-
regulated in PES compared to ES internod es (Figure 6,
Additional file 21). We also identified lignin genes dif-
ferentially expressed between the two alfalfa genotypes.
For exa mple, several CAD and COMT genes were up-
regulate d in 773 compared to 708 especially in ES inter-
nodes (Figure 6, Additional file 21). These genes may
contribute to difference in lignin content in cell walls of
stems of genotypes 708 and 773.
A previous study [25] and the current study both sug-
gest significant genotypic variation for gene expression
in alfalfa stem i nternodes. To identify g enes involved in
general stem development (ES vs. PES internodes) inde-
pendent of genotypic variation in gene expre ssion, we
selected a subset of alfalfa gene s differentially expressed
between ES and PES internodes in both genotype 708
and genotype 773 (p < 0.001, FDR < 0.025, 2-fold dif-
ference). A total of 594 genes were identified by further
selecting genes with similar differential expression pat-
terns in both genotypes [Log
2
(PES/ES) 1or -1 in
both genotypes] (Additional file 22). Among these
genes, about 19% were cell wall-related genes. These
genes included 5 cellulose synthase genes (a putative
IRX3,twoCesA8 s,andtwoCOBRAs) and six lignin
pathway genes (three 4CLs and three F5Hs) that were
up-regulated in PES compared to ES internodes in both
genotypes (Additional file 22). In Arabidopsis, IRX3,
CesA8(IRX1)andCOBRA genes are involved in cellu-
lose biosynthesis during secondary cell wall development
[46,55-57,65,66]. The gene families that were signifi-
cantly over-represented among genes up-regulated in
PES compared to ES internodes in both genotypes (Fish-
ers exact test with Bonfferoni correction with z-value
cutoff of 1) included arabinogalactan protein (AGP),
arginosuccinate synthase, metal handling, and transpor-
ter (suc rose, amino acids, and phosphate) fami lies
(Additional file 23). The gene families signi ficantly over-
represented among genes up-regulated in ES compared
to PES internodes in both genotypes included invertase,
708 ES
708 PES
773 ES
773 PES
I
II
II
I
IV
045
V
Figure 5 Hierarchical clustering analysis of the top 200 most
differentially expressed genes selected from pair-wise
comparisons. Pair-wise comparisons of gene expression were made
between stem tissues (ES, PES) in alfalfa genotypes 708 and 773.
The RPKM-normalized expression counts for each gene in each
library are represented by intensity of the red color on a 0 to 45
scale. Dark red (scale intensity 45) indicates genes with RPKM-
normalized expression counts 45. See Methods for details. Groups
I and III, genes differentially expressed in a tissue-specific manner;
Groups II and IV, genes differentially expressed in a genotype-
specific manner; and Group V, genes differentially expressed in both
a genotype- and tissue-specific manner. A complete list of the
genes, RPKM-normalized expression counts, and corresponding
MapMan functional categories are provided in Additional file 20.
Yang et al. BMC Genomics 2011, 12:199
http://www.biomedcentral.com/1471-2164/12/199
Page 10 of 19
Page 10
pectin esterase, simple phenol, gibberellin- responsive,
cold-responsive, lipid transfer protein (LTP), and GDSL-
motif lipase families (Additional fil e 23). Cell wall family
genes were over-represented among genes up-regulated
in both ES and PES Internodes.
Assimilated photosynthetic carbon is transl ocated pri-
marily as sucrose in highe r plants [67]. Membrane-
bound, energy dependent, H
+
-symporting sucrose trans-
porters (SU C or SUT proteins) play an essential role in
sucrose uptake in sink tissues and sucrose release in
source tissues [67]. In this study, members of the
sucrose transporter gene family w ere ov er-represented
among genes up-regulated in PES compared to ES inter-
nodes in both genotypes (Additional file 23). Previous
studies showed that the expression of sucrose transpor-
ter genes was developmentally regulated in plants
[68-73]. For example, sucrose transporter genes were
up-regulated during secondary cell wall synthesis in
developing cotton fibers [73]. In this study, we identified
five putative sucrose transporters (MsSUCs)thatwere
up-regulated in PES compared to ES internodes in both
genotypes (Additional file 22, Addi tional file 24). As
stem development progresses from ES to PES, sink
strength may also increase due to secondary cell wall
formation in seconda ry xy lem. The up-regulation of
MsSUCs in PES internodes may be in response to
increased demand for sucrose and UDP-glucose to sup-
port cellulose synthesis during secondary cell wall for-
mation. Consistent with this explanation is our finding
that three s ucrose synthase (MsSuSy) genes were up-
regulated in P ES compared to ES internodes in both
genotypes. Sucrose synthase provides the UDP-glucose
needed for cellulose synthesis [74,75]. In addition to
their roles in providing sucrose and UDP-glucose for
cellulose synthesis in secondary cell walls, MsSUCs and
MsSuSy genes, respectively, may play important roles in
modulating sugar sensing and signal transduction path-
ways during stem development in alfalfa [76].
Phenylalanine, tyrosine
& tryptophan biosynthesis
Phenylalanine
4.3.1.24
4.3.1.25
Cinnamic
acid
1.14.13.11
1.14.13.14
6.2.1.12
2.4.1.114
5.2.1.-
3.2.1.21
1.2.1.44
1.14.13.11
Flavonoid biosynthesis
Stilbenoid,
diarylheptanoid and
gingerol biosynthesis
Spermidine
1.14.13.-
4.3.1.23
4.3.1.25
1.14.13.-
1.14.13.-
1.14.13.-
6.2.1.12
4.1.1.-
2.3.1.133
1.2.1.44
1.1.1.195
1.11.1.7
3.2.1.126
2.4.1.111
2.3.1.133
1.14.13.36
1.14.13.36
2.1.1.-
1.13.11.22
2.1.1.68
4.1.1.-
6.2.1.12
2.3.1.133
2.3.1.99
2.3.1.133
1.2.1.44
1.1.1.195
2.1.1.68
2.1.1.68
2.1.1.104
6.2.1.12
1.2.1.44
1.1.1.195
1.1.1.195 1.1.1.195
1.14.13.- 2.1.1.-
2.4.1.128
F5H
F5H
F5H
1.2.1.68
6.2.1.12
1.2.1.44
2.1.1.68
2.1.1.104
2.1.1.68
2.1.1.68
3.2.1.126
1.11.1.7
2.4.1.111
1.11.1.7
1.11.1.7
2.4.1.111
3.2.1.126
1.2.1.68
1.2.1.44
6.2.1.12
3.1.1.49
2.3.1.91
2.3.1.92
2.4.1.120
trans-2-
Hydroxy-
cinnamate
Beta-D-
Glucosyl-2-
coumarate
Beta-D-
Glucosyl-2-
coumarinate
Coumarinate
Cinnamoyl
-CoA
Cinnamaldehyde
N1,N5,N10-
Tricoumaroyl
spermidine
Tyrosine
p-Coumaric acid
p-Coumaroyl-CoA
p-Coumaroyl
quinic acid
p-Coumaraldehyde
p-Coumaryl alcohol
Coumarine
p-Hydroxy-
phenyl lignin
4-Hydroxycinnamyl-
alcohol 4-D-glucoside
4-Hydroxystyrene
4-Coumaroyl
shikimate
Caffeoyl
shikimic acid
Caffeoyl
aldehyde
Caffeoyl
quinic acid
N1,N5,N10-
Tricaffeoyl
spermidine
Caffeic acid
Caffeoyl-CoA
Caffeoyl
alcohol
N1,N5,N10-
Tri-(hydroxyferuloyl)-
spermidine
N1,N5-
di(hydroxyferuloyl)-
N10-sinapoyl-spermidine
N1,N5,N10-
Triferuloyl
spermidine
3-(2-Carboxyethenyl)-
cis,cis-muconate
3,4-Dihydroxy-
styrene
Ferulic acid
Feruloyl-CoA
Coniferyl aldehyde
Coniferyl alcohol
Coniferin
Guaiacyl
lignin
5-Hydroxy-
guaiacyl
lignin
5-Hydroxy-
coniferyl
alcohol
5-Hydroxy-
coniferaldehyde
5-Hydroxy-
Feruloyl-CoA
5-Hydroxy-
Ferulic acid
Scopoletin
Scopolin
Sinapic acid
1-O-Sinapoyl-
beta-D-glucose
Sinapoyl-
malate
Sinapoyl-
choline
Sinapoyl-CoA
Sinapaldehyde
Sinapyl alcohol
Syringyl
lignin
Syringin
PAL
C4H
C4H
4CL
CCR
CAD
CAD
CAD CAD
CAD
HCT
HCT
C3H
C3H
C3H
C3H
COMT
COMT
COMT
CCoAOMT
-3.0 0 +3.0
COMT
COMT
COMT
Figure 6 Lignin pathway g enes diff erentially expressed in stem tissues of two alfalfa genotypes. Pair-wise comparisons were made
between stem tissues (ES, PES) of genotypes 708 and 773. Columns in each heatmap from left to right: Log
2
(708ES/773ES), Log
2
(708PES/773PES),
Log
2
(708PES/708ES), and Log
2
(773PES/773ES). The rows in each heatmap represent lignin gene sequences identified in MSGI 1.0. The Log
2
expression ratio values were false color-coded using a scale of -3 to 3. The intensity of blue and red indicates the degree of up- and down-
regulation of the corresponding lignin gene in the denominator in each column mentioned above. The red and blue color saturates at -3 and 3,
respectively. See Methods for details. The heatmaps generated were inserted next to the corresponding lignin gene in the lignin biosynthetic
pathway diagram downloaded from the KEGG pathway database http://www.genome.jp/kegg/pathway/map/map00940.html. PAL, phenylalanine
ammonia-lyase; C4H, cinnamate-4-hydroxylase; 4CL, 4-coumarate-CoA ligase; HCT, hydroxycinnamoyltransferase; C3H, p-coumarate 3-hydroxylase;
CCoAOMT, caffeoyl-CoA 3-O-methyltransferase; CCR1, cinnamoyl-CoA reductase 1; F5H, ferulate 5-hydroxylase; COMT, caffeic acid O-
methyltransferase; CAD, cinnamyl-alcohol dehydrogenase.
Yang et al. BMC Genomics 2011, 12:199
http://www.biomedcentral.com/1471-2164/12/199
Page 11 of 19
Page 11
In addition to the SUC transporter gene family, we
also found that the phosphate (Pi) transporter gene
family was over-represented amo ng genes u p-regulated
in PES compared to ES internodes in both genotypes
(Additional file 23). We identified six putative PHOS-
PHATE1 (PHO1) genes up-regulated in PES compared
to ES inter nodes in both ge notypes ( Additional file 22,
Additional file 24). In Arabidopsis root epidermal and
cortical cells, PHO1 is involved in P i loading into the
xylem [77,78]. A recessive mutation in PHO1 in Arabi-
dopsis resulted in reduced Pi loading into xylem [77,78].
PHO1 is exp ressed predominantly in roots and up-regu-
lated under condi tions of Pi starvation [78-80] . A recent
study in Arabidopsis showed that the expression of
PHO1 was modulated by WRKY6 and WRKY42 tran-
scription factors in response to low Pi [81]. Up-r egula-
tion of PHO1 genesinPESmaybeneededtomeetthe
require ments of Pi uptake and redistribution during cel-
lulose synthesis in secondary cell walls. For example, the
fructose released by SuSy (sucrose -> UDP-glucose +
fructose) needs to be phosphorylated to be recycled by
sucrose phosphate synthase (SPS).
The plant hormone auxin is a key regulator of plant
growth and development [82]. In addition to its role in
cell wall loosening and cell elongati on [82], auxin also
regulates vascular tissue differentiation and patterning in
plants [ 82-85], secondary xylem development in trees
[86,87], and fiber development in cott on [88]. Indole-3-
acetic acid (IAA), the major auxin s pecies, is made in
the shoot apex and transported to the ro ot apex [82].
Directional auxin transport is mainly controlled by the
coordinated action of auxin influx (AUX1)andefflux
(PIN) carrier complexes [82]. AUX1, an amino acid per-
mease-like membrane protein, was originally identified
after screening for auxin resistant mutant s [89]. In Ara-
bidopsis, AUX1 was preferentially expressed in xylem
compared to phloem and nonvascular tissues of the
root-hypocoty l [90]. Arabidopsis AUX1 mutants showed
a reduction in lateral ro ot formatio n [91] but enhanced
root generation in shoot regeneration media [92]. In
addition, disruption of polar auxin transport in Arabi-
dopsis resulted in ectopic v ascular differentiation in
leaves [93]. Polarized auxin transport is essential for
providing directional and positional signals f or various
developmental processes such as apical dominance,
organ development, tropic growth, embryogenesis and
vascular de velopment [82-85,94-98]. In this study, the
amino acid transporter gene families, which include
AUX1 genes, were over-represented among genes up-
regulated in P ES compared to ES internodes in both
genotypes (Additional file 23). A total of 5 putative
AUX1 genes were up-regulated in PES (Additional file
22, Additional file 24).The up-regulation of AUX1 in
PES internodes of alfalfa and the resultant increase in
auxin uptake may play an important role in the forma-
tion of secondary xylem. A recent study in trees sug-
gested that the radial auxin concentrati on gradient in
cell types of secondary xylem modulates the expression
of a small number of key genes that regulate secondary
xylem development [87].
In addition to transporter family genes that were dif-
ferentially expressed between ES and PES internodes of
both genotypes, we also identified transporter family
genes that were differentially expressed between geno-
types. For example, several sugar (glucose, hexose, and
sucrose) transporters and AUX1 ge nes w ere u p-regu-
lated in 708 compared to 773 in both ES and PES inter-
nodes (Additional file 24, Additional file 25). These
transporters may play a role in the higher cellulose and
sugar (galactose and mannose) content in stem inter-
nodes of genotype 708 compared to 773 (Table 1). We
also identified numerous transporter families that were
up-regulated in both ES a nd PES internodes of 773
compared to 708. Among these up-regulated transporter
families were the multi-drug toxic efflux carrier (MATE)
and ATP-binding cassette (ABC) transporter families
(Additional file 24, Additional file 25). Recent studies
suggest that monolignols synthesized in the cytoplasm
aretransportedacrosstheplasmamembraneintothe
cell wall matrix where they are polymerized into lignin
[99,100]. However, little is known about the transport
mechanism. Previous studies have suggested th at mono-
lignol transpo rt acros s the plasma membrane may
involve passive diffusion [101] or may be mediated by
membrane-bound transporters [102]. Genes in the
MATE transporter family may be good candidates for
monolignol transporters because they are involved in
transport of proanthocyanidin p recursors ac ross the
tonoplast in Arabidopsis and M. truncatula [103,104]. A
role for ABC transporters in monolignol transport
ac
ross the plasma membrane has been postulated
because of their known ro le in transporting various sec-
ondary metabolites in plants [99,100,1 05,106]. Addi-
tional research will be required to determine whether
the up-regulation of the MATE efflux car rier and ABC
transporter families in stems of 773 (high-lignin) com-
pared to 708 (low lignin) (Additional file 24, Additional
file 25) contributes to the higher lignin content in cell
walls of 773 (Table 1). The up-regulated MATE efflux
car rier and ABC transport genes that we i dentifi ed pro-
vide a list of candidate genes that will be useful in future
research to evaluate the involvement of these gene
families in monolignol transport.
Conclusion
This study represents the first application of RNA-Seq
technology for genomic studies in alfalfa. Our results
demonstrate that RNA-Seq can be successfully used for
Yang et al. BMC Genomics 2011, 12:199
http://www.biomedcentral.com/1471-2164/12/199
Page 12 of 19
Page 12
gene identification, polymorphism detection and tran-
script profiling in alfalfa. Using R NA-Seq has several
advantages over other technologies, especially for non-
model species with few genomic resources such as
alfalfa. Unlike hybridizatio n-based technologies such as
microarrays, RNA-Seq does not require pre-e xisting
sequence information and, as shown in this study, RNA-
Seq can integrate multiple tasks in a single pipelin e sav-
ing time and money. The integrated approach used in
this study can be applied to other non-model species.
The newly built alfalfa gene index (MSGI 1.0), and the
SNPs, SSRs and candidate genes identified in this study
will be a valuable resource for advancing genetic/geno-
mic research in alfalfa and eventually for improving
alfalfa as a forage crop and cellulosic ethanol feedstock.
Methods
Plant materials and cell wall analysis
Alfalfa [Medicago sativa (L) subsp. sativa] genotypes 708
and 773 were selected from a population (UMN 3097)
created by mixing seeds from six commercial alfalfa culti-
var s (5312, Rus hmore, Magnagraze, Wintergre en, Wind-
star and WL 325HQ) as previously described [25]. The
alfalfa clonal lines 708 and 773 were propagated from
cuttings and grown in the greenhouse. The greenhouse
experiments consisted of three replicates arranged in a
randomized complete block design. For each replicate,
there were eight plants of each clone in individual pots.
For cell wall analysis, stem intern odes tissues were har-
vested at full bloom and plant material for analysis was
composited within each replicate (2 blocks × 3 reps = 6
data points per genotype). Cell wall analysis was per-
formed in duplicate as previously described [25]. An ana-
lysis of variance was done to test if the me ans (g kg
-1
cell
wall) for cell wall components of the two genotypes we re
equal (Table 1). For RNA-Seq, ES and PES internodes
were harvested as previously described [25].
RNA extraction, cDNA library preparation and sequencing
Total RNA was purified from three replicates of elongat-
ing an d post-elon gation stem intern odes of gen otypes
708 and 773 using the CTAB based protocol previously
described [40]. Contaminati ng ge nomic DNA was
removedfromeachRNAsampleusingtheDNA-free
kit following the manufacturers recommendations
http://www.ambion.c om. An equal amount of total RNA
was pooled from each replicate for each stem tissue
sample. RNA samples were quantified using Quant-iT
RiboGreen
®
RNA Reagent http://www.invitrogen.com
and the RNA integrity was checked with RNA6000
Nano Assay using the Agilent 2100 Bioanalyzer (Agi-
lent Technologies, Palo Alto, CA). cDNA library prepara-
tion and sequencing reactions were conducted in the
Biomedical Genomics Center, University of Minnesota.
Illum ina library prep, clustering and sequencing reagents
were used throughout the process following the manu-
facturers recommendations http://www.illumina.com.
Briefly, mRNAs were purified using poly-T oligo-attached
magnetic beads and then fragmented. The first and the
second strand cDNAs were synthesized and end repaired.
Adaptors were ligated after adenylation at the 3-ends.
After gel purification, cDNA templates were enriched by
PCR. cDNA l ibraries were validated using a High Sensi-
tivity Chip on the Agilent2100 Bioanalyzer (Agilent
Technologies, Palo Alto, CA). The cDNA library was
quantified using PicoGreen Assay and by qPCR. The
samples were clustered on a flow cell using the cBOT.
After clustering, the samples were loaded on the Illumina
GA-II machine. The samples were sequenced using a sin-
gle read with 76 cycles. Initial base calling and quality fil-
tering of the Illumina GA-II image data were performed
using the default parameter s of the Illumina GA Pipeline
GERALD stage http://www.illumina.com. Additional fil-
tering for homopolymers and read size (< 75 bp) was per-
formed using custom written code.
For RNA-Seq using the GS FLX Titanium platform
http://www.454.com, mRNA was r everse transcribed
with SuperScript III reverse transcriptase http://www.
invitrogen.com using dT15VN2 primer. cDNA was
synthesized using E. coli DNA Ligase, E. coli DNA poly-
merase I and EcoliRNaseH. cDNA was then fragmen-
ted by sonication. The cDNA was then used for 454
sstDNA preparation in the GS20 DNA Library Prepara-
tion step2 http://w ww.454.com. The rest of the library
preparation and the 454 sequencing procedures were
performed f ollowing th e manu facturers recommenda-
tions http://www.454.com. Standard post-run and bioin-
formatics processing on the 454 platform to determine
reads that passed v arious quality filters were also per-
formed following the manufacturers recommendations
http://www.454.com.
de novo transcriptome assembly
The Velvet algorithm [32] was used for de novo assembly
of the 198,861,304 Illumina reads (76 bp). During the de
novo assembly using the Velvet program, short EST reads
were first hashed based on a predefined hash length in
base pairs (k-mer length). Next, the contigs were built
based on a series of overlapping k-mers using de Brujin
graphs [32]. In general, longer k-mers increase transcript
contiguity (longer t ranscript length) and specificity (less
spurious overlaps) but decrease diversity (smaller number
of contigs) compared to shorter k-mers [32]. To optimize
our Velvet assembly toward higher transcript contiguity
and specificity, we tested a series of k-mers (31, 37, 41,
47, 51, 57, 61, 63, 65) for de novo assembly of short EST
reads (Additional file 26). We used the median contig
length (N50) generated for each k-mer as an indicator of
Yang et al. BMC Genomics 2011, 12:199
http://www.biomedcentral.com/1471-2164/12/199
Page 13 of 19
Page 13
the transcript contiguity of de novo assembly. As k-m er
values increased from 31 to 61, N50 values increased to a
value of 289 reflecting increased efficiency of de novo
assembly. The N 50 values decline d significantly at k-mer
values above 61 (Additional file 26). On the basis of these
results, we used a k-mer value of 61 for de nov o assembly
of alfalfa EST reads.
Alfalfa Gene Index assembly
The alfalfa gene index (MSGI 1.0) was built following
the Gene Index Assembly protocol previously described
[39,40]. The gene ontology (GO) functional classes and
pathways for each sequenc e in MSGI 1.0 were assigned
based on Arabidopsis GO SLIM and pathway annotation
ftp://ftp.arabidopsis.org/home/tair/Ontologies/. For GO
characterization, the unique sequences in MSGI 1.0
were compared with the Arabidopsis proteome using
the BlastX program with e-value cutoff of 1e-10. Top
protein matches from Arabidopsis sequences were
assigned to each of the MSGI 1.0 sequences. The Map-
Man gene functional classification system [41] was
assigned to each sequence in MSGI 1.0 following the
method pre viously described [31]. Th e funct ional clas s
over-representation analysis was performed using Page-
Man [107] as previously described [25,31].
Polymorphism detection
The MISA program [42] was used to detect simple
sequence repeats ( SSRs) among sequences in MSGI 1.0.
The minimum number of nucleotide repeats specified
during SSR analysis was 20, 10, 7, 5, 5, and 5 for mono-,
di-, tri-, tetra-, penta-, and hexanucleotide repeats,
respectively. The maximum number of bases interrupt-
ing 2 SSRs in a comp ound microsatell ite was set at 100
bp. The primers spanning each SSR were designed using
the default parameter of the Primer3 program [44].
For SNP detection, the Illumina GA-II reads were
mapped to the sequences in MSGI 1.0 using the Maq
program [45]. Next, t he coverage and nucleotide differ-
ences were extracted using the pileup command of the
Maq program. The pileup output was further compiled
for genotypes 708 and 773 with custom writte n script
using filtering based on coverage and quality sco res.
Custom written script was used for additional sorting
and filtering of the pileup output based on a nucleotide
depth cutoff of 10 for each SNP.
Digital gene expression analysis
For digital gene expression analysis, the raw digital gene
expression counts were measured by quantifying the num-
ber of Illumina GA-II reads that were mapped to the refer-
ence sequences (MSGI 1.0 or Mt3.0 cds) using the bowtie
program [47]. The best-match option with a maximum of
3 nucleotide mismatches was used (-v 3 best). The raw
digital gene expression counts were normalized using the
RPKM (reads/Kb/Million) method [1,48]. Custom written
scripts were used to summarize the bowtie output from
the raw digital expression counts and the RPKM-normal-
ized expression counts. To identify differentially expressed
genes, an expression profile matrix was built representing
the digital gene expression count for eac h gene in each
library, then imported into the DEGSeq program [52]. A
DEGSeq program that utilized a MA-plot-based method
with random sampling model was used to identify differ-
entially expressed genes in each pair-wise comparison (p <
0.001, FDR < 0.025, 2-fold difference). Heatmaps based
on hierarchical cluster analysis [108] of RPKM-normalized
expression counts (Figure 5, Additional file 25) and
expression ratios (Figure 6) were generated using MultiEx-
periment Viewer http://www.tm4.org/mev/.
In a previous s tudy, we generated GeneChip data for
ES and PES internodes of alfalfa genotypes 252 and
1283 [25]http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?
acc=GSE13602. To compare the digital gene expression
data generated using MSGI 1.0 and Mt3.0 cds sequences
with the pre viously generated GeneChip data [25], we
first compared two Medicago reference sequences
(MSGI 1.0 and Mt3.0 cds) with Medicag o GeneChip
probe set consensus sequences using the Blastn program
(e-value cutoff of 1e-10). Top sequence matches from
the Medicago GeneChip probe sets were assigned to
each RNA-Seq reference sequence. Next, we selected
from GeneChip data and RNA-Seq data a subset of
genes involved in general stem development indepen-
dent of genotypic variation in gene expression (Log
2
(PES/ES) 1or -1 in b oth genotypes). Genes that
were commonly selected between RNA-Seq and Gene-
Chi p data were identified based on sequence homology.
Log
2
(PES/ES) values from the RNA-Seq data generated
using MSGI 1.0 and Mt3.0 cds as reference sequences
were compared with Log
2
(PES/ES) values from the Gen-
eChip data (Figure 4A, 4B).
To compare the digital gene expression data generated
using MSGI 1.0 and Mt3.0 cds sequences with the qRT-
PCR data, we first compared two Medicago refe rence
sequences (MSGI 1.0 and Mt3.0 cds) using the the
Blastn program (e-value cutoff of 1e-10). Top sequence
matches from the Mt3.0 cds were assigned to each
MSGI 1.0 seque nce. Primers for qRT-PC R were
designed based on the MSGI 1.0 sequences (Additional
file 10). Log ratio values from the RNA-Seq data gener-
ated using MSGI and Mt3.0 cds as reference sequences
were compared with ΔΔC
T
values obtained from the
qRT-PCR data (Figure 4C, 4D).
SNP genotyping
The SNP genotypin g was conduct ed in the Biomedical
Genomics Center, University of Minnesota. Briefly, a
Yang et al. BMC Genomics 2011, 12:199
http://www.biomedcentral.com/1471-2164/12/199
Page 14 of 19
Page 14
total of 55 SNPs predicted between genotypes 708 and
773 were randomly selected for validation by MALDI-
TOF mass spectrometry using the iPLEX Gold spectro-
metry system http://www.sequenom.com. Genomic
DNAs were purified from young leaves of genotypes 708
and 773 using DNeasy Plant Mini Kit http://www.qia-
gen.com. The multiplex assays we re designed using
Mass-ARRAY Assay Design 3.0 software and primers
were obtai ned from IDT (Coralville, Iowa). Reactions
(PCR, shrim p alkali ne pho sphatase treatment foll owed
by extension) were performed according to iPLEX Gold
method http://ww w.sequenom.com. Mass ARRAY work-
station software (v. 3.3) was used to analyze the SNP
genotyping results.
Real-time quantitative RT-PCR (qRT-PCR)
A portion of the pooled total RNA used for the RNA-
Seq analysis was used to make cDNAs for qRT-PCR.
The first strand cDN A for each sample was made using
random hexamers and Taqman Reverse Transcription
Reagents (Applied Biosystems, CA) following the manu-
facturer s recommendations. Gene specific primers
based on MSGI 1.0 sequences were subsequently
designed using Primer Express (Applied Biosystems,
CA) (Additiona l file 10). Samples and standards were
run in triplicate on each plate and repeated on two
plates using SYBR-Green PCR Master Mix (Applied Bio-
systems, CA) on a StepOnePlus Real-Time PCR Sys-
tem (Applied Biosystems, CA) following the
manufacturers recommendations. qRT-PCR was per-
formed in a 20 μl reaction containing 4 μlddH
2
O, 10 μl
PCR mix, 1 μl forward primer (1 μM), 1 μlreverse
primer (1 μM), and 4 μloftemplatecDNA(5ng/μl).
The PCR conditions were as follows: two minutes of
pre-incubation at 50°C, 10 minutes of pre-denaturation
at9C,40cyclesof15secondsat9Candonemin
at 60 °C, followed by steps for dissociation curve genera-
tion (30 seconds at 95 °C, 60 seconds at 60 °C and 30
seconds at 95 °C). The StepOnePlus software (Applied
Biosystems, CA) was used for data collection and analy-
sis. Dissociation curves for each amplicon were carefully
examined to confirm lack of multiple amplicons at dif-
ferent melting temperatures (Tms). Relative transcript
levels for each sample were obtained using the com-
parative C
T
method [109] using the C
T
value of the
18S rRNA for each sample as a normaliser.
Additional material
Additional file 1: de novo assembly of alfalfa Illumina GA-II EST
reads. A fasta file containing a total of 132,153 unique sequences
generated after de novo assembly of Illumina GA-II EST reads derived
from 4 cDNA libraries developed in this study. The Velvet program [32]
with k-mer 61 was used for de novo assembly.
Additional file 2: Alfalfa Gene Index 1.0 (MSGI 1.0). A fasta file
containing Alfalfa Gene Index 1.0 (MSGI 1.0) sequences. MSGI 1.0
contains a total of 124,025 unique sequences including 22,729 tentative
consensus sequences (TCs), 22,315 singletons and 78,981 pseudo-
singletons. The average length of the unique sequences in MSGI 1.0 is
384 bp (100 bp minimum and 6,956 bp maximum) with more than
10,000 sequences larger than 800 bp. The total base count of the
sequences in MSGI 1.0 is 47,628,953 bp. Unfortunately, the current pipe
line of the DFCI gene index database http://compbio.dfci.harvard.edu/tgi/
is not suited for short reads (personal communication with a DFCI Gene
Index staff). The Gene Index Project team has indicated that it plans to
address this issue soon. When a gene index database is established for
alfalfa, MSGI1.0 will be uploaded to the DFCI gene index database.
Additional file 3: Functional classification and annotation of
sequences in the Alfalfa Gene Index 1.0 (MSGI 1.0). A table listing
Gene ontology (GO), pathway, MapMan functional classes and gene
annotation for sequences in the Alfalfa Gene Index 1.0 (MSGI 1.0).
Additional file 4: Simple sequence repeats (SSRs) detected in MSGI
1.0. A table listing SSR-containing sequence IDs, SSR types and position,
and primers spanning each SSR for the sequences in the Alfalfa Gene
Index 1.0 (MSGI 1.0).
Additional file 5: Single nucleotide polymorphisms (SNPs) predicted
between alfalfa genotypes 708 and 773. A table listing SNPs
predicted between alfalfa genotypes 708 and 773 including SNP-
containing sequence ID, SNP type, SNP position and depth in each
genotype.
Additional file 6: Validation of SNPs predicted between alfalfa
genotypes 708 and 773 using RNA-Seq data. A table showing SNP
validation results. A total of 55 SNPs were randomly selected to
genotype genomic DNAs purified from the genotypes 708 and 773 by
MALDI-TOF mass spectrometry using the iPLEX Gold spectrometry
system http://www.sequenom.com. In addition to genotypes 708 and
773, we also genotyped 51 additional alfalfa (M. sativa) genotypes
selected from different populations of M. sativa ssp. sativa or M. sativa
ssp. falcata.
Additional file 7: Functional classes over- or under-represented
among SNP-harboring genes. A figure showing the functional class
over-representation analysis conducted for SNP-harboring genes.
Functional classes that are over- or under-represented among SNP-
harboring genes were identified using the PageMan over-representation
analysis module. The z-vlaues for significant classes identified after
Fishers exact test with Bonferroni correction (z-value cutoff of 1) were
false color coded using a scale of -5 to +5. The intensity of blue and red
indicate the degree of over- and under-representation of the
corresponding class, respectively.
Additional file 8: Allelic variations (SNPs) detected within genotype
708. A table listing a total of 287,555 allelic variations (SNPs) detected
within genotype 708 using minimum SNP depth cutoff of 10.
Additional file 9: Allelic variations (SNPs) detected within genotype
773. A table listing a total of 168,966 allelic variations (SNPs) detected
within genotype 773 using minimum SNP depth cutoff of 10.
Additional file 10: qRT-PCR validation of RNA-Seq data generated
by two reference sequences (MSGI 1.0 and Mt3.0 cds). A table
showing the source data used to generate Figure 4. The table contains
MSGI 1.0 and Mt3.0 cds IDs of genes used for qRT-PCR, qRT-PCR and
RNA-Seq data generated by two reference sequences (MSGI 1.0 and
Mt3.0 cds), and primers used for qRT-PCR.
Additional file 11: An expression profile matrix for each library
showing digital gene expression count of each gene in MSGI 1.0.A
table showing the digital gene expression counts of each gene in MSGI
1.0 for ES and PES internodes of alfalfa genotypes 708 and 773. The raw
expression counts generated by bowtie program were normalized using
the RPKM method [1,48].
Additional file 12: Top 500 most abundant transcripts in each
library. A table showing the RPKM-normalized digital gene expression
counts and MapMan functional classes for the top 500 most abundant
transcripts selected in each library.
Yang et al. BMC Genomics 2011, 12:199
http://www.biomedcentral.com/1471-2164/12/199
Page 15 of 19
Page 15
Additional file 13: Functional classes over- or under-represented
among the top 500 most abundant transcripts in each library.A
figure showing the results from functional class over-representation
analysis for the top 500 most abundant transcripts in ES and PES
internodes of alfalfa genotypes 708 and 773. For details, see the
description for additional file 7.
Additional file 14: 300 housekeeping genes selected. A table listing
300 housekeeping genes (HKGs) with relatively high levels of expression.
To identify these HKGs, we first selected genes with an average RPKM-
normalized transcript count greater than 10. Next, we selected the top
300 genes from the list with the lowest coefficient of variation (CV =
standard deviation/mean). The RPKM-normalized expression counts,
MapMan functional class and description for each HKG selected are also
presented in the table.
Additional file 15: Genes differentially expressed between ES and
PES internodes of alfalfa genotype 708.A table listing 3,838 genes
differentially expressed between ES and PES internodes of alfalfa
genotype 708 in MSGI 1.0. We used a MA-plot-based method with
random sampling model in a DEGSeq program to select these genes (p-
value < 0.001, FDR < 0.025, 2-fold difference). RPKM-normalized
expression counts, log ratios, z-scores, p-values, and q-values for each
gene selected are also presented in the table.
Additional file 16: Genes differentially expressed between ES and
PES internodes of alfalfa genotype 773. A table listing 4,428 genes
differentially expressed between ES and PES internodes of alfalfa
genotype 708 in MSGI 1.0. For details, see the description for additional
file 11.
Additional file 17: Genes differentially expressed between alfalfa
genotypes 708 and 773 in ES internodes.A table listing 8,883 genes
differentially expressed between alfalfa genotypes 708 and 773 in ES
internodes in MSGI 1.0. For details, see the description for additional file
11.
Additional file 18: Genes differentially expressed between alfalfa
genotypes 708 and 773 in PES internodes. A table listing 4,799 genes
differentially expressed between alfalfa genotypes 708 and 773 in PES
internodes in MSGI 1.0. For details, see the description for additional file
11.
Additional file 19: Genes differentially expressed in ES and PES
internodes of alfalfa genotypes 708 and 773. A table listing 13,797
genes differentially expressed in ES and PES internodes of alfalfa
genotypes 708 and 773 in MSGI 1.0. Genes selected in additional files 15,
16, 17 and 18 were combined together to produce this table. The RPKM-
normalized expression counts, MapMan functional class and description
for each gene selected are also presented in the table.
Additional file 20: Top 200 most differentially expressed genes in
each pair-wise comparison. A table that lists 657 genes that were
generated after combining the top 200 most differentially expressed
genes selected in each pair-wise comparison of gene expression
between ES and PES internodes of genotypes 708 and 773. This table is
a data source for Figure 5. The RPKM-normalized expression counts,
MapMan functional class and description for each gene selected are also
presented in the table.
Additional file 21: Phenylpropanoid (lignin) pathway genes
differentially expressed in ES and PES internodes of alfalfa
genotypes 708 and 773. A table listing phenylpropanoid (lignin)
pathway genes differentially expressed in ES and PES internodes of alfalfa
genotypes 708 and 773 (p-value < 0.001, FDR,0.025, 2-fold difference).
This table is a data source for Figure 6. The log ratios from each pair-
wise comparison, EC number, and enzyme ID for each gene selected are
also presented in the table.
Additional file 22: Candidate genes identified in 708 and 773 that
may be involved in general stem development independent of
genotypic variation in gene expression. A table listing 594 genes
potentially involved in general stem development independent of
genotypic variation in gene expression in alfalfa (Log2(PES/ES)1or-1
in both genotypes 708 and 773). The RPKM-normalized expression
counts, log ratios, MapMan functional class and description for each
gene selected are also presented in the table.
Additional file 23: Functional classes over- or under-represented
among genes involved in general stem development independent
of genotypic variation in alfalfa. A figure showing the functional class
over-representation analysis for genes involved in general stem
development independent of genotypic variation in alfalfa (Log2(PES/
ES)1or-1 in both genotypes 708 and 773). Up in PES and Up in ES
indicate genes up-regulated in PES and ES internodes in both genotypes,
respectively. For details, see the description for additional file 7.
Additional file 24: Putative transporter genes differentially
expressed in ES and PES internodes of alfalfa genotypes 708 and
773. A table listing 478 transporter genes in ES and PES internodes of
alfalfa genotypes 708 and 773 in MSGI 1.0. The RPKM-normalized
expression counts, log ratios from each pair-wise comparison, MapMan
functional class and description for each transporter gene selected are
also presented in the table.
Additional file 25: Hierarchical clustering analysis of selected
transporter genes differentially expressed between 708 and 773 in
both ES and PES internodes. A figure showing a heatmap for 42
transporter genes differentially expressed between 708 and 773 in both
ES and PES internodes (p < 0.001, FDR < 0.025, 2-fold difference). The
RPKM-normalized expression counts for each gene in each library are
represented by the intensity of the red color on a 0 to 22 scale. Dark red
(scale intensity 22) indicates genes with RPKM-normalized expression
counts 22. See Methods for details. A complete list of the transporter
genes selected, RPKM-normalized expression counts, and corresponding
MapMan functional categories are provided in Additional file 24.
Additional file 26: Optimization of de novo assembly of Illumina
GA-II EST reads with a series of k
-mers using the Velvet program
[
32]. A figure showing the median sequence length of the contigs (y-
axis) for a series of k-mers (31, 37, 41, 47, 51, 57, 61, 63, 65) tested using
the Velvet program. k-mer 61 produced the longest median sequence
length.
Abbreviations
ES: elongating stem; PES: post-elongation stem; SNP: single nucleotide
polymorphism; EST: expressed sequence tag; GO: gene ontology; cds: coding
sequence; SSR: simple sequence repeat; RPKM: reads/Kb/Million; q-RT PCR:
real-time quantitative RT-PCR; HKG: housekeeping gene; CV: coefficient of
variation; CesA: cellulose synthase; PAL: phenylalanine ammonia-lyase; C4H:
cinnamate-4-hydroxylase; 4CL: 4-coumarate-CoA ligase; HCT:
hydroxycinnamoyl transferase; C3H: p-coumarate 3-hydroxylase; CCoAOMT:
caffeoyl-CoA 3-O-methyltransferase; CCR1: cinnamoyl-CoA reductase 1; F5H:
ferulate 5-hydroxylase; COMT: caffeic acid O-methyltransferase; CAD:
cinnamyl-alcohol dehydrogenase; AGP: arabinogalactan protein; LTP: lipid
transfer protein; LHB1B1: Photosystem II light harvesting complex gene;
RBCS-1A: rubisco small subunit 1; SUC: sucrose transporter; SuSy: sucrose
synthase; PHO1: PHOSPHATE 1; IAA: Indole-3-acetic acid; AUX1: auxin influx
carrier; MATE: multi-drug toxic efflux carrier; ABC: ATP-binding cassette.
Acknowledgements
This work was carried out in part using computing resources at the University
of Minnesota Supercomputing Institute for Advance Computat ional Research.
Funding for this research was provided by USDA-ARS CRIS Project 3640-
12210-001-00D. Mention of trade names or commercial products in this
publication is solely for the purpose of providing specific information and
does not imply recommendation or endorsement by the U.S. Department of
Agriculture. We thank Dr. David Garvin, Dr. Jamie ORourke, and Dr. Deborah
Samac for critical review of the manuscript.
Author details
1
USDA-Agricultural Research Service, Plant Science Research Unit, St. Paul,
MN, 55108, USA.
2
Supercomputing Institute for Advanced Computat ional
Research, University of Minnesota, Minneapolis, MN 55455, USA.
3
The J. Craig
Venter Institute, Rockville, MD 20892, USA.
4
Department of Agronomy and
Plant Genetics, University of Minnesota, St. Paul, MN 55108, USA.
5
Center for
Human Immunology, Autoimmunity and Inflammation, National Institute of
Health, Bethesda, MD 20892, USA.
Yang et al. BMC Genomics 2011, 12:199
http://www.biomedcentral.com/1471-2164/12/199
Page 16 of 19