Page 1
December 15, 2008 11:45 WSPC - Proceedings Trim Size: 9.75in x 6.5in MoustafaChan˙CameraReadyBW
A PHYLOGENOMIC APPROACH FOR STUDYING PLASTID
ENDOSYMBIOSIS
AHMED MOUSTAFA1* CHEONG XIN CHAN2*
ahmed-moustafa@uiowa.edu cx-chan@uiowa.edu
MEGAN DANFORTH2 DAVID ZEAR2 HIBA AHMED2
megan-danforth@uiowa.edu drzear@gmail.com hiba-ahmed@uiowa.edu
NAGNATH JADHAV2 TREVOR SAVAGE2 DEBASHISH BHATTACHARYA1,2
n-jadhav@uiowa.edu trevor-savage@uiowa.edu debashi-bhattacharya@uiowa.edu
*These authors contributed equally to this work.
1Interdisciplinary Genetics Program, University of Iowa, Iowa City, IA 52242, U.S.A.
2Department of Biology and Roy J. Carter Center for Comparative Genomics, Uni-
versity of Iowa, Iowa City, IA 52242, U.S.A.
Gene transfer is a major contributing factor to functional innovation in genomes. En-
dosymbiotic gene transfer (EGT) is a specific instance of lateral gene transfer (LGT) in
which genetic materials are acquired by the host genome from an endosymbiont that has
been engulfed and retained in the cytoplasm. Here we present a comprehensive approach
for detecting gene transfer within a phylogenetic framework. We applied the approach
to examine EGT of red algal genes into Thalassiosira pseudonana, a free-living diatom
for which a complete genome sequence has recently been determined. Out of 11,390 pre-
dicted protein-coding sequences from the genome of T. pseudonana, 124 (1.1%, clustered
into 80 gene families) are inferred to be of red algal origin (bootstrap support ≥ 75%).
Of these 80 gene families, 22 (27.5%) encode novel, unknown functions. We found 21.3%
of the gene families to putatively encode non-plastid-targeted proteins. Our results sug-
gest that EGT of red algal genes provides a relatively minor contribution to the nuclear
genome of the diatom, but the transferred genes have functions that extend beyond pho-
tosynthesis. This assertion awaits experimental validation. Whereas the current study is
focused within the context of secondary endosymbiosis, our approach can be applied to
large-scale detection of gene transfer in any system.
Keywords: phylogenomics; endosymbiotic gene transfer; lateral gene transfer; plastid;
chromalveolates.
1. Introduction
Lateral gene transfer (LGT) is a phenomenon in which genetic materials are trans-
mitted between non-lineal individuals (e.g., between two different strains or species).
This phenomenon is one of the major mechanisms for functional innovation in the
genomes of prokaryotes [1, 2] and eukaryotes [3, 4], as well as for the acquisition
of new virulence genes in pathogens [5]. Therefore, the elucidation of gene transfer
events will enhance our understanding of how genomes evolve. Here we present a
systematic approach for detecting LGT within the context of plastid endosymbiosis.
Genome Informatics 21: 165-176 (2008)
165
Page 2
December 15, 2008 11:45 WSPC - Proceedings Trim Size: 9.75in x 6.5in MoustafaChan˙CameraReadyBW
166 A. Moustafa et al.
1.1. Plastid endosymbiosis and gene transfer
The origin and establishment of the photosynthetic organelle (plastid) in algae
and plants are important for understanding biotic evolution because these taxa
form the primary food source for all life on earth. The endosymbiosis hypothesis
postulates that the plastid originated from the ancient engulfment and retainment
of a free-living cyanobacterium (the endosymbiont) by a heterotrophic, unicellular
protist. This ancestral photosynthetic eukaryote diversified into the red, green, and
glaucophyte algae [6, 7]. Subsequent to this, a secondary endosymbiosis occurred,
in which a red alga, that had gained its photosynthetic capability from primary
endosymbiosis, was itself engulfed by a non-photosynthetic protist, giving rise to
the progenitor of the eukaryote supergroup Chromalveolata [7, 8]. The process of
endosymbiosis and the origin of plastid are detailed in [9–11] and Figure 1 in [6].
The phenomenon of endosymbiosis led to the transfer of genetic material from the
endosymbiont to the host nuclear genome via endosymbiotic gene transfer (EGT),
which is a specific case of LGT.
Chromalveolata is one of the six major “supergroups” of eukaryotes. This lineage
consists of a taxonomically diverse group of species that are of high ecological and
economic importance, including diatoms, seaweeds, dinoflagellates, and the malaria
parasite Plasmodium. Our group has previously demonstrated EGT (and LGT) in
chromalveolate genomes [3, 12–14], but the extent of EGT from red algae into chro-
malveolates, vis-a`-vis secondary endosymbiosis, has not been studied in a rigorous
manner.
Among the chromalveolates, diatoms are unicellular eukaryotes and one of the
primary contributors to the marine food chain. The diatoms are estimated to gen-
erate ≈ 40% of the organic carbon produced annually in the sea [15]. These taxa
affect the flux of atmospheric carbon dioxide into the oceans, which in turn has
effects on global climate [16]. Recently, the genome of the free-living diatom Tha-
lassiosira pseudonana was sequenced to completion [17]. Using the available genomic
sequences, here we present a rigorous, phylogenomic pipeline to examine the extent
of EGT of red algal genes in T. pseudonana, and investigate if these transferred
genes are restricted to photosynthesis-related functions.
2. A phylogenomic approach for inferring phylogenies
With the increasing amount of available genome data, phylogenomics, the intersec-
tion of evolutionary and genomic approaches [18], has become a key instrument in
studying genomes on a gene-by-gene basis. This is done primarily by the automated
generation and inspection of phylogenetic trees. In many recent studies, phyloge-
nomics has been employed to answer various questions including, e.g., prediction of
biochemical gene functions [19], evolution of gene functions [20], detection of gene
transfer events [1, 3], and resolution of complex taxonomic relationships [13].
Our phylogenomic pipeline consists of four basic steps as shown in Figure 1.
First, homologous genes for the target sequences are identified (step 1) using WU-
Page 3
December 15, 2008 11:45 WSPC - Proceedings Trim Size: 9.75in x 6.5in MoustafaChan˙CameraReadyBW
A Phylogenomic Approach for Studying Plastid Endosymbiosis 167
(PERL) WU-BLAST
(query)
(target)
Export
FASTA
FASTA(MySQL)
Database XML Parsing(Java & PERL)
FASTAIdentication of homologous genes
Alignment(e.g. MUSCLE)
Multiplesequencealignment
Re!nement & conversion(Java)PHYLIP
Phylogeny inference(e.g. RAxML)PHYLIP
Phylogeny sorting(PhyloSort)
Phylogenyinference
Patterns ofinterest Topological analysis of phylogeny
1
2
3
4
Fig. 1. A schematic diagram of the phylogenomic pipeline: functional components and data flow.
BLAST (http://blast.wustl.edu/) searches against a database containing sequences
collected from public resources, e.g. NCBI (http://www.ncbi.nlm.nih.gov/) and
JGI (http://www.jgi.doe.gov/). We used WU-BLAST because this program shows
higher time-efficiency than the original BLAST algorithm [21]. Following this, mul-
tiple sequence alignment (step 2) is performed for each homologous gene family prior
to phylogeny inference (step 3). We used MUSCLE [22] to align the sequences, and
both neighbor-joining (NJ) [23] and maximum likelihood (ML) [24] to reconstruct
the phylogenies, because these yield high accuracy in a reasonably short period
of time [22, 24]. However, other approaches for sequence alignment and phylogeny
inference can easily be incorporated into our pipeline. Finally, once the phylogeny
for each gene family is obtained, these can be searched for topological patterns of
interest (step 4). In the current study, we used PhyloSort [25] to sort and examine
monophyletic relationships between chromalveolates and other taxa of interest.
2.1. Analysis of EGT in Thalassiosira pseudonana
We obtained all 11,390 predicted protein-coding sequences from the complete Tha-
lassiosira pseudonana genome from JGI (http://www.jgi.gov/). We performed a
preliminary screening using BLAST (at e-value ≤ 0.001) for sequences that are
highly similar to and thus possibly share a common ancestry (i.e., homologous) with
the genes in red algae. Using 5,014 protein sequences from the complete genome
of the red alga Cyanidioschyzon merolae [26], we found 4,894 (43.0% of 11,390)
protein-coding sequences in T. pseudonana to have homologs in C. merolae.
These protein-coding sequences were used as input in our phylogenomic pipeline
that utilizes our local database, which consists of 2,555,575 sequences from 62 eu-
karyote genomes, inclusive of complete and partial expressed sequence tag (EST)
sequences spanning Plantae, chromalveolates, Rhizaria, excavates, animals, fungi,
and Amoebozoa, and 500 complete bacterial genomes. Initially, the phylogenetic
Page 4
December 15, 2008 11:45 WSPC - Proceedings Trim Size: 9.75in x 6.5in MoustafaChan˙CameraReadyBW
168 A. Moustafa et al.
trees were constructed using NJ with a Poisson-distance correction and 100 repli-
cates for the bootstrap analysis. By searching for the monophyly of cyanobacteria
and chromalveolates, with or without Plantae, we identified and removed 1,907
chromalveolate genes with a potential cyanobacterial origin. This step was designed
to exclude genes that were introduced via EGT into the red algal nucleus as a re-
sult of primary endosymbiosis. For the remaining 2,987 trees, we searched for the
monophyly of red algae and chromalveolates, with or without green and glaucophyte
algae (≥ 75% bootstrap support). We identified 288 protein-coding sequences in T.
pseudonana with potential red algal origin through EGT (as a result of secondary
endosymbiosis).
Following this, we inferred ML phylogenies for each of the 288 genes using
RAxML [24] (WAG model [27]; 100 bootstrap replicates). Using the same approach
for detecting secondary EGT (described above), we identified 124 genes in chroma-
lveolates with a putative red algal origin, and clustered these into 80 distinct fami-
lies. We manually annotated the functions of these gene families. Blast2GO [28] was
used to annotate each family based on significant matches (e-value ≤ 10−5) in the
Gene Ontology (GO) database (http://geneontology.org/), for the three GO classes:
molecular function, biological processes, and cellular components. The GO protein
target prediction was complemented with PSORT [29] and Predotar [30]. Plastid-
targeting localization was inferred when two out of the three prediction methods
yielded positive results.
To examine the significance of the observed monophyly between chromalveolates
and Plantae, we repeated the phylogenomic analysis using a dataset that excluded
per
cen
tag
e (%
)
with Plantae
without Plantae
Archaea
Bacteria(including cyanobacteria)
Amoebozoa
Animalia Excavata
Fungi
Plantae
Rhizaria
Vira
Prokaryotes Eukaryotes Viruses
0
20
40
60
80
Fig. 2. Distribution of monophyly between chromalveolates and different lineages, for Thalas-
siosira pseudonana genes that showed a potential algal ancestry. The Y-axis represents the per-
centage of monophyletic relationships recovered, the X-axis represents the different lineages of
prokaryotes, eukaryotes, and viruses. The blue and red bars represent the distributions across the
dataset inclusive and exclusive of Plantae genomes, respectively.
Page 5
December 15, 2008 11:45 WSPC - Proceedings Trim Size: 9.75in x 6.5in MoustafaChan˙CameraReadyBW
A Phylogenomic Approach for Studying Plastid Endosymbiosis 169
Plantae genomes (glaucophytes, red, and green algae), and compared the observed
monophyly between chromalveolates and the other lineages, with the existing results
(dataset inclusive of Plantae genomes). As shown in Figure 2, the distributions of
the observed monophyly between chromalveolates and non-Plantae are not signifi-
cantly different between the two instances, i.e., when Plantae genomes are included
or not (Kolmogorov-Smirnov test [31], p-value > 0.05). This finding suggests that
the observed monophyletic relationship between chromalveolates and Plantae is
non-random, and not biased by a secondary or tertiary association between chro-
malveolates and the other lineages. The strong association between chromalveolates
and Bacteria (33.6%) in the dataset that excluded Plantae genomes can be explained
by the presence of cyanobacterial genes, which have originated via primary EGT
(most of which are of plastid function). The (cyano)bacterial association with di-
atom genes can therefore be explained by endosymbiosis and not by other scenarios
that involve LGT from prokaryotes.
3. EGT of red algal genes in Thalassiosira pseudonana
We observe 124 (1.1% of the total 11,390) protein-coding sequences from the genome
of T. pseudonana to have a red algal origin. The phylogenetic trees built with each
of these genes and their respective homologs show monophyly of the red algae and
chromalveolates with bootstrap support ≥ 75%. The genes are clustered into 80 pu-
tative families (Table 1). Among these gene families, 40 (50.0%) are well-annotated
with gene ontologies (complete annotation for ≥ 90% of the sequences in each fam-
ily), whereas 18 (22.5%) are partially annotated (complete annotation for < 90%
of the sequences in each family). The remaining 22 (27.5%) are either incompletely
annotated or have no significant match in the gene ontology database. We consider
these 22 gene families to encode novel, unknown functions in the diatom.
The majority of genes from T. pseudonana in each of these families is primarily
represented by single-copy sequences (58, 72.5%), with some containing two (14,
17.5%) or three (6, 7.5%) gene copies. There are two families in which the gene
is highly duplicated within the genome of T. pseudonana. These are the ABC-1
domain protein (7 copies) and light-harvesting protein (13 copies). As shown in
the last column of Table 1, 23 (28.8%) of the 80 gene families putatively code for
proteins targeted to the plastid, 21 (26.3%) putatively code for proteins targeted
to multiple organelles with the majority going to the plastid, 19 (23.8%) of the
proteins are potentially targeted to multiple organelles with the minority being the
plastid, whereas the remainder (17, 21.3%) putatively code for proteins that are not
targeted to the plastid. In parallel with gene ontology analysis, we do not observe a
N-terminal extension in the bacterial homologs of these 17 eukaryotic gene families,
suggesting that these genes are not targeted to membrane-bounded organelles. The
families in which the gene copy is highly duplicated in T. pseudonana are found
to be targeted to multiple organelles in the cell (including the mitochondrion and
nucleus) and are not restricted to the plastid.
End of preview.