Whole genome duplications and expansion of the vertebrate GATA transcription factor gene family.
ABSTRACT GATA transcription factors influence many developmental processes, including the specification of embryonic germ layers. The GATA gene family has significantly expanded in many animal lineages: whereas diverse cnidarians have only one GATA transcription factor, six GATA genes have been identified in many vertebrates, five in many insects, and eleven to thirteen in Caenorhabditis nematodes. All bilaterian animal genomes have at least one member each of two classes, GATA123 and GATA456.
We have identified one GATA123 gene and one GATA456 gene from the genomic sequence of two invertebrate deuterostomes, a cephalochordate (Branchiostoma floridae) and a hemichordate (Saccoglossus kowalevskii). We also have confirmed the presence of six GATA genes in all vertebrate genomes, as well as additional GATA genes in teleost fish. Analyses of conserved sequence motifs and of changes to the exon-intron structure, and molecular phylogenetic analyses of these deuterostome GATA genes support their origin from two ancestral deuterostome genes, one GATA 123 and one GATA456. Comparison of the conserved genomic organization across vertebrates identified eighteen paralogous gene families linked to multiple vertebrate GATA genes (GATA paralogons), providing the strongest evidence yet for expansion of vertebrate GATA gene families via genome duplication events.
From our analysis, we infer the evolutionary birth order and relationships among vertebrate GATA transcription factors, and define their expansion via multiple rounds of whole genome duplication events. As the genomes of four independent invertebrate deuterostome lineages contain single copy GATA123 and GATA456 genes, we infer that the 0R (pre-genome duplication) invertebrate deuterostome ancestor also had two GATA genes, one of each class. Synteny analyses identify duplications of paralogous chromosomal regions (paralogons), from single ancestral vertebrate GATA123 and GATA456 chromosomes to four paralogons after the first round of vertebrate genome duplication, to seven paralogons after the second round of vertebrate genome duplication, and to fourteen paralogons after the fish-specific 3R genome duplication. The evolutionary analysis of GATA gene origins and relationships may inform understanding vertebrate GATA factor redundancies and specializations.
-
Article: Timing of Genome Duplications Relative to the Origin of the Vertebrates : Did CycIostomes Diverge before or after?
[show abstract] [hide abstract]
ABSTRACT: Two rounds of whole-genome duplications are thought to have played an important role in the establishment of gene repertoires in vertebrates. These events occurred during chordate evolution after the split of the urochordate and cephalochordate lineages but before the radiation of extant gnathostomes (jawed vertebrates). During this interval, diverse agnathans (jawless fishes), including cyclostomes (hagfishes and lampreys), diverged. However, there is no solid evidence for the timing of these genome duplications in relation to the divergence of cyclostomes from the gnathostome lineage. We conducted cDNA sequencing in diverse early vertebrates for members of homeobox-containing (DIx and ParaHox) and other gene families that would serve as landmarks for genome duplications. Including these new sequences, we performed a molecular phylogenetic census using the maximum likelihood method for 55 gene families. In most of these gene families, we detected many more gene duplications before the cycIostome-gnathostome split, than after. Many of these gene families (e.g., visual opsins, RAR, Notch) have multiple paralogs in conserved, syntenic genomic regions that must have been generated by large-scale duplication events. Taken together, this indicates that the genome duplications occurted before the cyclostome-gnathostome split. We propose that the redundancy in gene repertoires possessed by all vertebrates, including hagfishes and lampreys, was introduced primarily by genome duplications. Apart from subsequent lineage-specific modifications, these ancient genome duplication events might serve generally to distinguish vertebrates from invertebrates at the genomic level.First publ. in: Molecular Biology and Evolution 26 (2009), pp. 47-59. -
SourceAvailable from: Victoria E Prince
Article: Zebrafish hox clusters and vertebrate genome evolution.
A Amores, A Force, Y L Yan, L Joly, C Amemiya, A Fritz, R K Ho, J Langeland, V Prince, Y L Wang, M Westerfield, M Ekker, J H Postlethwait[show abstract] [hide abstract]
ABSTRACT: HOX genes specify cell fate in the anterior-posterior axis of animal embryos. Invertebrate chordates have one HOX cluster, but mammals have four, suggesting that cluster duplication facilitated the evolution of vertebrate body plans. This report shows that zebrafish have seven hox clusters. Phylogenetic analysis and genetic mapping suggest a chromosome doubling event, probably by whole genome duplication, after the divergence of ray-finned and lobe-finned fishes but before the teleost radiation. Thus, teleosts, the most species-rich group of vertebrates, appear to have more copies of these developmental regulatory genes than do mammals, despite less complexity in the anterior-posterior axis.Science 12/1998; 282(5394):1711-4. · 31.20 Impact Factor -
Article: Preservation of duplicate genes by complementary, degenerative mutations.
[show abstract] [hide abstract]
ABSTRACT: The origin of organismal complexity is generally thought to be tightly coupled to the evolution of new gene functions arising subsequent to gene duplication. Under the classical model for the evolution of duplicate genes, one member of the duplicated pair usually degenerates within a few million years by accumulating deleterious mutations, while the other duplicate retains the original function. This model further predicts that on rare occasions, one duplicate may acquire a new adaptive function, resulting in the preservation of both members of the pair, one with the new function and the other retaining the old. However, empirical data suggest that a much greater proportion of gene duplicates is preserved than predicted by the classical model. Here we present a new conceptual framework for understanding the evolution of duplicate genes that may help explain this conundrum. Focusing on the regulatory complexity of eukaryotic genes, we show how complementary degenerative mutations in different regulatory elements of duplicated genes can facilitate the preservation of both duplicates, thereby increasing long-term opportunities for the evolution of new gene functions. The duplication-degeneration-complementation (DDC) model predicts that (1) degenerative mutations in regulatory elements can increase rather than reduce the probability of duplicate gene preservation and (2) the usual mechanism of duplicate gene preservation is the partitioning of ancestral functions rather than the evolution of new functions. We present several examples (including analysis of a new engrailed gene in zebrafish) that appear to be consistent with the DDC model, and we suggest several analytical and experimental approaches for determining whether the complementary loss of gene subfunctions or the acquisition of novel functions are likely to be the primary mechanisms for the preservation of gene duplicates. For a newly duplicated paralog, survival depends on the outcome of the race between entropic decay and chance acquisition of an advantageous regulatory mutation. Sidow 1996(p. 717) On one hand, it may fix an advantageous allele giving it a slightly different, and selectable, function from its original copy. This initial fixation provides substantial protection against future fixation of null mutations, allowing additional mutations to accumulate that refine functional differentiation. Alternatively, a duplicate locus can instead first fix a null allele, becoming a pseudogene. Walsh 1995 (p. 426) Duplicated genes persist only if mutations create new and essential protein functions, an event that is predicted to occur rarely. Nadeau and Sankoff 1997 (p. 1259) Thus overall, with complex metazoans, the major mechanism for retention of ancient gene duplicates would appear to have been the acquisition of novel expression sites for developmental genes, with its accompanying opportunity for new gene roles underlying the progressive extension of development itself. Cooke et al. 1997 (p. 362)Genetics 05/1999; 151(4):1531-45. · 4.01 Impact Factor
Page 1
BioMed Central
Page 1 of 19
(page number not for citation purposes)
BMC Evolutionary Biology
Open Access
Research article
Whole genome duplications and expansion of the vertebrate GATA
transcription factor gene family
William Q Gillis, John St John, Bruce Bowerman and Stephan Q Schneider*
Address: Institute of Molecular Biology, University of Oregon, 1229 University of Oregon, Eugene, OR 97403, USA
Email: William Q Gillis - wgillis@uoregon.edu; John St John - jstjohn@cs.uoregon.edu; Bruce Bowerman - bbowerman@molbio.uoregon.edu;
Stephan Q Schneider* - schneider@molbio.uoregon.edu
* Corresponding author
Abstract
Background: GATA transcription factors influence many developmental processes, including the
specification of embryonic germ layers. The GATA gene family has significantly expanded in many
animal lineages: whereas diverse cnidarians have only one GATA transcription factor, six GATA
genes have been identified in many vertebrates, five in many insects, and eleven to thirteen in
Caenorhabditis nematodes. All bilaterian animal genomes have at least one member each of two
classes, GATA123 and GATA456.
Results: We have identified one GATA123 gene and one GATA456 gene from the genomic
sequence of two invertebrate deuterostomes, a cephalochordate (Branchiostoma floridae) and a
hemichordate (Saccoglossus kowalevskii). We also have confirmed the presence of six GATA genes
in all vertebrate genomes, as well as additional GATA genes in teleost fish. Analyses of conserved
sequence motifs and of changes to the exon-intron structure, and molecular phylogenetic analyses
of these deuterostome GATA genes support their origin from two ancestral deuterostome genes,
one GATA 123 and one GATA456. Comparison of the conserved genomic organization across
vertebrates identified eighteen paralogous gene families linked to multiple vertebrate GATA genes
(GATA paralogons), providing the strongest evidence yet for expansion of vertebrate GATA gene
families via genome duplication events.
Conclusion: From our analysis, we infer the evolutionary birth order and relationships among
vertebrate GATA transcription factors, and define their expansion via multiple rounds of whole
genome duplication events. As the genomes of four independent invertebrate deuterostome
lineages contain single copy GATA123 and GATA456 genes, we infer that the 0R (pre-genome
duplication) invertebrate deuterostome ancestor also had two GATA genes, one of each class.
Synteny analyses identify duplications of paralogous chromosomal regions (paralogons), from single
ancestral vertebrate GATA123 and GATA456 chromosomes to four paralogons after the first
round of vertebrate genome duplication, to seven paralogons after the second round of vertebrate
genome duplication, and to fourteen paralogons after the fish-specific 3R genome duplication. The
evolutionary analysis of GATA gene origins and relationships may inform understanding vertebrate
GATA factor redundancies and specializations.
Published: 20 August 2009
BMC Evolutionary Biology 2009, 9:207doi:10.1186/1471-2148-9-207
Received: 17 February 2009
Accepted: 20 August 2009
This article is available from: http://www.biomedcentral.com/1471-2148/9/207
© 2009 Gillis et al; licensee BioMed Central Ltd.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0),
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Page 2
BMC Evolutionary Biology 2009, 9:207http://www.biomedcentral.com/1471-2148/9/207
Page 2 of 19
(page number not for citation purposes)
Background
Most animal genomes include multiple GATA transcrip-
tion factor genes with widely conserved developmental
roles[1]. Within vertebrates, GATA transcription factors
are required for the proper specification of cardiac and
blood cell lineages, for the induction and differentiation
of endoderm and mesendoderm, and in cell movement
during gastrulation and neural projections. In Xenopus lae-
vis, overexpression of GATA4, 5, or 6 can induce endo-
derm formation [2]. Similarly, the nematode GATA456
ortholog end-1 is necessary and sufficient to generate E or
endodermal cell fate in C. elegans, and it also can induce
endoderm when ectopically overexpressed in Xenopus [3].
The GATA transcription factor family is a relatively small
and evolutionary tractable gene family, with only six
members present in mammals, five in insects, and eleven
in the nematode C. elegans. This gene family has under-
gone significant expansion in bilaterians compared to
lower metazoans. For example, only a single GATA gene
has been found in two cnidarian genomes currently
sequenced [4].
Previous studies have demonstrated that the six vertebrate
GATA factors comprise two classes of evolutionarily
related genes, a GATA-1, -2, -3 class and a GATA-4, -5, -6
class [5]. These two GATA factor groups can be identified
throughout bilaterian animals, suggesting that the last
common ancestor of protostome and deuterostome
genomes contained at least two GATA genes, with both a
GATA123 and a GATA456 ortholog. Our recent survey of
GATA genes from the whole-genome sequence of multi-
ple protostome genomes has identified at least four GATA
genes in every currently available protostome genome,
with gene duplications having occurred only within the
GATA456 class [6].
In contrast, two basal deuterostomes (invertebrate rela-
tives of chordates), the echinoderm Strongylocentrotus pur-
puratus and the urochordate Ciona intestinalis, encode just
two GATA transcription factor genes, similar in number to
the predicted ancestral bilaterian state [5,7]. However,
these GATA genes are highly divergent in sequence and
bear only faint resemblance to the two GATA classes typi-
cal of most animal genomes. Indeed, a recent phyloge-
netic study of this gene family [8] concluded that the
small GATA gene repertoire of two in S. purpuratus and C.
intestinalis, relative to the eleven nematode and six verte-
brate GATA genes, resulted from secondary and independ-
ent losses of GATA genes in these lineages. In addition to
the uncertainty about their GATA gene origins, both echi-
noderms and urochordates have undergone exceptional
shifts in their developmental modes relative to other deu-
terostome phyla. Thus it has remained difficult to ascer-
tain the number, structural features, and roles of the
ancestral deuterostome GATA gene complement.
Deuterostomes include several major groups of inverte-
brate and vertebrate animals (Figure 1) [9-17]. The first
major deuterostome division occurred between ambulac-
rarians (echinoderms and hemichordates) and chordates.
The chordates then split into three groups: cephalochor-
dates, urochordates, and vertebrates. Recent studies indi-
cate that urochordates are the closest outgroup to
vertebrates [18], although urochordates are extremely
diverged on both molecular and morphological levels
[19]. The first split of vertebrates occurred between jawless
and jawed vertebrates (agnaths and gnathostomes), fol-
lowed with the divergence of jawed vertebrates into carti-
laginous and bony fish (chondrichthyes and osteichtyes).
There are two major groups of extant bony fish, ray-finned
and lobe-finned, with the former having given rise to tel-
eost fish and the latter to tetrapods.
There is ample evidence for multiple rounds of whole
genome duplication in vertebrate lineages. Two genome
duplication events are thought to have occurred near the
base of the vertebrate lineages. The first genome duplica-
tion event (1R) has been proposed to occur prior to the
divergence of jawed and jawless vertebrates, with a second
genome duplication event (2R) occurring only in jawed
vertebrates lineage [20]. However, a more recent survey of
multiple lamprey and hagfish gene families concluded
that the ancestor of extant jawless vertebrates also under-
went two whole genome duplication events, suggesting
that two rounds of whole genome duplication occurred
very early in the vertebrate lineage [21]. Finally, an addi-
tional whole genome duplication event (3R) appears to
have occurred in ray-finned fish [22-24]. During each of
these genome duplication events, two paralogous chro-
mosomal regions (paralogons) would be created from
each pre-duplication chromosomal region, and each par-
alogon would initially contain a single paralog for each
pre-duplicate gene. Therefore, for each 0R (pre-duplicate)
deuterostome gene, there could be maximally two
genome-duplicated paralogs in 1R animal genomes, four
in 2R genomes, and eight in 3R genomes, though neutral
drift should quickly eliminate most of duplicated paralogs
[25-27]. We refer to paralogs resulting from genome
duplication events as ohnologs, following the convention
suggested by K. Wolfe [28] in honour of Susumu Ohno,
who first proposed the occurrence of these genome dupli-
cation events during key transitions of vertebrate evolu-
tion [26,27]. Because vertebrate genomes contain six
GATA factor genes, compared to only two in two different
deuterostome invertebrate genomes, it has been suggested
Page 3
BMC Evolutionary Biology 2009, 9:207http://www.biomedcentral.com/1471-2148/9/207
Page 3 of 19
(page number not for citation purposes)
that the GATA transcription factor gene family may have
expanded in vertebrates by the retention of ohnologous
genes [5,7].
To more conclusively address the ancestral deuterostome
condition, we have identified the GATA transcription fac-
tor complement within the whole genome sequence of
two additional and less derived invertebrate deuteros-
tomes, the hemichordate Saccoglossus kowalevskii and the
cephalochordate Branchiostoma floridae. These analyses
include nine diverse vertebrate genome sequences, and
address gene phylogeny using both gene sequence and
genomic context comparisons. Importantly, one well-con-
served GATA123 gene family member and one well-con-
served GATA456 family member was found within each
invertebrate deuterostome genome analyzed. Thus our
study provides the strongest evidence yet that the ancestral
deuterostome genome contained two distinct GATA
genes, one GATA123 homolog and one GATA456
homolog, from which every deuterostome GATA gene
including the vertebrate complement originated. We con-
clude that hemichordates and cephalochordates have
retained members of both GATA classes. These analyses
further indicate that all vertebrate GATA genes retain con-
served syntenic ohnologs, supporting the hypothesis that
the expansion of the vertebrate GATA family has resulted
almost exclusively from whole-genome duplication
events.
Relationship and divergence times of deuterostome and vertebrate species
Figure 1
Relationship and divergence times of deuterostome and vertebrate species. This tree represents a survey of molec-
ular and paleontological analyses of phylogeny and divergence times. Divergence times estimates are given in millions of years
ago (MYA). The timing of genome duplication events from the first round (1R), second round (2R), and the teleost-specific
third round (3R) are represented by rounded rectangles. The dotted line for the connection of the agnathan lineages repre-
sents the current uncertainty regarding their divergence relative to the second round of genome duplication.
polychaete (Platynereis dumerilii)
acorn worm (Saccoglossus kowalevskii)
sea urchin (Strongylocentrotus purpuratus)
lancelet (Branchiostoma floridae)
lamprey (Petromyzon marinus)
hagfish (Eptatretus burgeri)
skate (Raja erinacea)
frog (Xenopus tropicalis)
chicken (Gallus gallus)
mouse (Mus musculus)
human (Homo sapiens)
zebrafish (Danio rerio)
medaka (Oryzias latipes)
three spined stickleback (Gasterosteus aculeatus)
fugu (Takifugu rubripes)
green spotted pufferfish (Tetraodon nigroviridis)
Acanthopterygii
Teleostei
Ostariophysi
Tetrapoda
Amniota
Mammalia
Tetraodontiformes
Chondrichthyes
Vertebrata
Gnathostomata
sea anemone (Nematostella vectensis)
Agnatha
Chordata
Cephalochordata
Ambulacraria
Deuterostomia
Bilateria
Metazoa
Protostomia
Cnidaria
Amphibia
3R
2R
1R
Osteichthyes
Urochordata
larvacean (Oikopleura dioica)
Hemichordata
Echinodermata
690 MYA
635 MYA
600 MYA
550 MYA
540 MYA
550 MYA
430-480 MYA
340 MYA
515 MYA
420 MYA
520 MYA
340 MYA
160-230 MYA
145 MYA
125-135 MYA
55 MYA
Page 4
BMC Evolutionary Biology 2009, 9:207http://www.biomedcentral.com/1471-2148/9/207
Page 4 of 19
(page number not for citation purposes)
Results
Identification of hemichordate and cephalochordate
GATA sequences
While we recently concluded that the genome of the
ancestor to both deuterostomes and protostomes
encoded two GATA transcription factors [6,7], another
group [8] suggested that at least five GATA factors were
encoded by the genome of the last common ancestor of
fruit flies, nematodes, and vertebrates, with subsequent
losses occurring in some deuterostome lineages (see Intro-
duction). To further address this issue, we have identified
GATA factor gene sequences from the available genomes
of two additional deuterostome invertebrates, the cepha-
lochordate Branchiostoma floridae and the hemichordate
Saccoglossus kowalevskii.
In the cephalochordate B. floridae genome sequence, with
8.1× coverage, we could identify only two GATA factor
genes. tBLASTn analysis of the B. floridae trace archives
was conducted with local BLAST servers [29] to identify
~136 amino acid (AA) fragments from two distinct GATA
genes. An initial reciprocal blast suggested that these frag-
ments encode distinct GATA1/2/3 and GATA4/5/6
orthologs, and this initial assignment was also supported
by the phylogenetic analyses below; therefore we refer to
these as BfloGATA123 and BfloGATA456. These fragments
encode a highly conserved dual zinc finger domain [5-7]
within three exons (Figure 2). Two genomic scaffolds were
identified from the pre-release genomic assembly JGI-
assembled genome containing these fragments http://
genome.jgi-psf.org/Brafl1/Brafl1.home.html. By conduct-
ing bl2seq sequence comparisons on larger regions of
these scaffolds, less conserved 5' and 3' ends of each gene
encoding the N-terminal and C-terminal regions of each
protein was identified.
In a BLASTn search of these two predicted BfloGATA genes
against sequenced EST libraries, 19 ESTs were identified
for the BfloGATA123, defining the full length mRNA for
this gene (1419 NT, 478 AA). Confirmation of the tran-
scription of the BfloGATA123 gene was made by PCR
amplification of a predicted 772 nucleotide (nt) fragment
with gene specific primers from a gastrula/neurula cDNA
library.
For the BfloGATA456 ortholog, we were unable to iden-
tify any EST from the pre-release database. We therefore
defined a gene model for the 5' domain through the con-
served dual zinc finger domain based on sequence com-
parison to human and Platynereis GATA sequences. Gene
specific primers were designed to a predicted 5' start
codon, and the conserved dual zinc finger domain. Two
clones were isolated via PCR from a Bflo cDNA library,
with 859 and 874 nucleotide inserts. Using the Splign
program [30], these fragments both aligned to the same
region of JGI:scaffold 160, and are presumably alternative
splice forms. These splice forms are identical with the
exception of alternative seconds exons, with the smaller
splice form incorporating a novel exon that eliminates the
first zinc finger domain.
In the genomic trace archive of the hemichordate Sac-
coglossus kowalevskii, with 7× coverage, we have used our
Gene Family Finder program (see Methods) to computa-
tionally identify two orthologs. A reciprocal blast analysis
suggested these to be a single GATA123 ortholog and a
single GATA456 ortholog, which held true with the addi-
tional phylogenetic analyses below, and we have therefore
named these SkowGATA123 and SkowGATA456, respec-
tively. Through comparisons to the BfloGATA gene
sequences, four exons from each SkowGATA (Figure 2)
were identified. Within SkowGATA123, two exons encode
the conserved first and second zinc fingers, as well as two
exons 5' to the conserved zinc finger domain. No addi-
tional 3' exon sequences were identified, including the 3'
conserved domain exon described for other GATAs, but it
is possible that this sequence is divergent or not repre-
sented in the current trace archive. For SkowGATA456,
three exons that encode the first zinc finger, second zinc
finger, and a 3' conserved lysine-rich region from the con-
served dual-zinc domain were identified, as well as an
additional single large 5' exon.
In summary, we have found that two additional deuteros-
tome invertebrate genomes each have only two GATA
transcription factors genes, further supporting our previ-
ous conclusion that the last common ancestor to all deu-
terostomes had only two GATA factor genes.
Identification of additional vertebrate GATA factors
To further investigate the expansion of GATA factors in
vertebrates, we conducted exhaustive searches for GATA
factor members within nine vertebrate genomes, includ-
ing five teleost and four tetrapod species, again using in
silico searches of annotated proteins and whole genome
contigs, as well as genomic trace files, to identify the com-
plete GATA complement for each genome. Each of the
tetrapod genomes were found to contain six GATA factors
genes, consistent with previous studies [5,8]. However, we
also identified seven or eight GATA factor genes in each of
the teleost genomes examined (Figure 3, 4, 5). The expan-
sion from two pre-genome duplication (0R) GATA fac-
tors, to six GATA factors in the tetrapod (2R), and eight in
the teleost (3R), is consistent with GATA family growth
via genome duplication.
Identification of class specific motifs
We next examined the cephalochordate and hemichor-
date GATA genes to determine if they include GATA123-
and 456-class specific conserved coding sequence motifs,
Page 5
BMC Evolutionary Biology 2009, 9:207http://www.biomedcentral.com/1471-2148/9/207
Page 5 of 19
(page number not for citation purposes)
Exon/intron structure and conserved motifs of deuterostome GATAs
Figure 2
Exon/intron structure and conserved motifs of deuterostome GATAs. Identified exons are shown as solid blocks
(boundaries confirmed by cDNA sequence) or as dotted lines (boundaries not confirmed by cDNA sequence). GATA123
orthologs for human (HsapGATA1,2,3), zebrafish (Drer1a,2a,3) hemichordate (SkowGATA123), echinoderm (SpurGATAc),
and cephalochordate (BfloGATA123) are located within the blue block (top), and GATA456 (HsapGATA4,5,6,
DrerGATA4,5,6, SkowGATA456, SpurGATAe, BfloGATA456) orthologs are located within the red block (bottom). The
zebrafish GATA genes, Drer1b and Drer2b, are nearly identical in structure and length to Drer1a and Drer2a, respectively,
and are not shown. The sole cnidarian GATA from Nematostella (NvecGATA) is shown centrally. Motifs are represented
within the exons as colored blocks as specified in insets. The dotted line for the first SpurGATAc exon indicates its possible
pseudo-exon status, and the open bars for SkowGATAs are due to uncertainty regarding the exact ends of the exons. Thick
black lines represent ancestral eumetazoan splice sites for GATA genes, blue and red lines represent ancestral deuterostome
splice sites for GATA123 and GATA456 genes, respectively, and light black lines represent novel exons.
Page 6
BMC Evolutionary Biology 2009, 9:207http://www.biomedcentral.com/1471-2148/9/207
Page 6 of 19
(page number not for citation purposes)
identified in a previous study [7]. BfloGATA123 exhibits
one of the most complete and well conserved set of
ortholog-specific motifs from our data set (Table 1 and
Figure 2), containing all 7 previously identified motifs,
which exhibit 38–76% amino acid identity with at least
one other example of that motif. BfloGATA456 contains
all 4 motifs identified within human GATA456 orthologs,
and an additional N-terminal motif previously only iden-
tified in the Platynereis PdumGATA456, the sea urchin
SpurGATAe, and the sole anemone GATA NvecGATA.
Similar to BfloGATA456, SkowGATA456 includes 3 N-ter-
minal motifs identified within human GATA456
orthologs, and an additional N-terminal motif previously
identified only in the Platynereis PdumGATATA456, the
sea urchin SpurGATAe, and the sole anemone GATA
NvecGATA. No SkowGATA123 conserved C-terminal
motifs were detected, but all five previously identified
motifs N-terminal to the zinc finger domains could be
identified.
Conserved splice site boundaries within the two
deuterostome GATA classes
To further analyze deuterostome GATA gene family mem-
bers, we next examined conservation of exon/intron struc-
tures. The genome assemblies were compared to the
translated amino acid sequence to map splice sites and
exon/intron boundaries. When the B. floridae cephalo-
chordate and S. kowalevskii hemichordate GATA genes
were compared to their human, fish, and sea anemone
orthologs, we found all of the genes contain two internal
introns that divide the conserved dual-zinc finger domain
into first zinc finger, second zinc finger, and 3' lysine rich
encoding exons (Figure 2). The positional conservation of
these two introns correlates with the high conservation of
the dual zinc-finger domain, relative to the rest of the pro-
tein, in almost all animal GATA transcription factors [6].
Thus, an ancestral exon/intron structure of the core con-
served DNA binding domain has been retained in both
deuterostome GATA123 and GATA456 gene families.
We also identified differences in the exon/intron bounda-
ries of GATA123 and GATA456 genes, 5' and 3' of the con-
served dual-zinc finger regions. These differences produce
distinctive exons that encode the class specific N- and C-
terminal motifs [7]. The cephalochordate BfloGata456,
the hemichordate SkowGATA456, and the echinoderm
GATA456 ortholog SpurGATAe, as well as the human
HsapGATA4, 5, and 6 genes and the zebrafish DrerGATA5
and 6 genes, all have a single exon 5' of the zinc finger
domain exons, with this one exon encoding all of the
identified GATA456 motifs. However, BfloGATA123,
SkowGATA123, and the human and zebrafish GATA 1, 2
and 3 genes all are encoded by two 5' exons, with two con-
served motifs located within the first exon and three
within the second. A comparison to the motifs shared
with NvecGATA found that this intron has been observed
only within bilaterian GATA123 factors [6]. We therefore
suggest that the single 5' exon in GATA456 genes may rep-
resent the ancestral condition for all GATA genes, and that
a subsequent intron insertion occurred shortly after the
duplication of the ancestral GATA gene in the GATA123
lineage.
In contrast to their 5' exons, the 3' structure of the deuter-
ostome GATA123 genes is more conserved than the 3'
structure of the GATA456 genes. The analyzed cnidarian,
echinoderm, cephalochordate, and vertebrate GATA123
orthologs each have only a single large 3' exon (though we
have not yet found any 3' exons in the hemichordate
SkowGATA123), and all of these orthologs contain two
identifiable conserved motifs in this region. In contrast,
the 3' ends of the GATA456 genes are more variable, and
no conserved motifs have been identified among these
genes; human and zebrafish GATA 5 and 6 genes, and the
zebrafish GATA 4 gene, are split into 3 short exons, the
first of which is conserved in the cephalochordate
BfloGATA456. These additional two 3'-most exons appear
to be diagnostic for vertebrate GATA456 genes. However,
the echinoderm GATA456 ortholog has two large 3'
exons, while only a single short 3' exon can be identified
in the hemichordate or cephalochordate GATA456
orthologs, though these identifications may not be com-
plete due to a lack of sequence conservation in these
regions. Thus, it appears that lineage-specific acquisition
of 3' exon/intron structure and sequence divergence has
occurred for GATA456 factors, whereas the GATA123 fac-
tors appear to retain a 3' structure similar to that of the
sole anemone NvGATA gene. Our analysis of the class-
specific features, including both conserved amino-acid
motifs and intron/exon boundaries, provides further sup-
port to the view that all deuterostome GATA genes are
members of either the GATA123 for the GATA456 sub-
families.
Prediction of an additional 5' exon in the S. purpuratus
GATA123 gene (GATAc)
Because two 5' exons are conserved in chordate and hemi-
chordate GATA genes, and because echinoderms are a sis-
ter group to the hemichordates, we were surprised that
only one 5' exon was described in the previously charac-
terized echinoderm GATA123 ortholog, SpurGATAc. Fur-
thermore, motifs encoded by the 5' most exon in other
deuterostome GATA123 genes are also present in proto-
stome GATA123 orthologs [7], suggesting that either the
S. purpuratus GATAc gene lost its first exon at some point,
or that the gene is currently incorrectly annotated. Using
tBLASTn searches of the S. purpuratus genomic sequence
with the 5' exon from BfloGATA123, but not with those
from other deuterostome GATA123 genes, we identified a
Page 7
BMC Evolutionary Biology 2009, 9:207 http://www.biomedcentral.com/1471-2148/9/207
Page 7 of 19
(page number not for citation purposes)
Phylogeny of deuterostome GATA123 and GATA456 subfamilies
Figure 3
Phylogeny of deuterostome GATA123 and GATA456 subfamilies. Phylogenetic trees for GATA123 genes (a) and
GATA456 genes (b). Branch support is given in both posterior probabilities from a Bayesian analysis (bold) or from the
approximate likelihood ratio test chi-square parameter (regular). Both trees are rooted using the Platynereis ortholog. Species
names are as follows; Bflo-Branchiostoma floridae (cephalochordate), Drer-Danio rerio (zebrafish), Ggal-Gallus gallus (chicken),
Gacu-Gasterosteus aculeatus (stickleback), Hsap-Homo sapiens (human), Olat-Oryzias latipes (medaka), Skow-Saccoglossus kowa-
levskii (acorn worm), Spur-Strongylocentrotus purpuratus (sea urchin), Pdum-Platynereis dumerilii (annelid), Regl-Raja eglanteria
(skate), Trub-Takifugu rubripes (fugu), Tnig-Tetraodon nigroviridis (tetraodon), Xtro-Xenopus tropicalis (frog).
0.2
BfloGATA456
TrugGATA5
DrerGATA4
TrugGATA6
GacuGATA5
GgalGATA6
OlatGATA6
TnigGATA6
XtroGATA5
PdumGATA456
XtroGATA4
GacuGATA4
MmusGATA6
TnigGATA4
OlatGATA4
MmusGATA4
HsapGATA5
SpurGATAe
SkowGATA456
GgalGATA4
XtroGATA6
DrerGATA5
TnigGATA5_1
MmusGATA5
GgalGATA5
GacuGATA6
HsapGATA6
DrerGATA6
TnigGATA5_2
HsapGATA4
TrugGATA4
OlatGATA5
100, 100
100, 100
99, 100
82, 98
100, 100
100, 100
99, 100
91, 92
99, 99
97, 96
100, 100
99, N/A
59, 79
100, 100
100, 100
100, 100
100, 100
100, 100
100, 100
85, 100
100, 100
100, 100
84, 99
100, 100
100, 100
100, 100
100, 100
100, 97
a)
b)
0.2
100, 100
100, 100
99, 98
100, 100
100, 100
100, 100
100, 100
100, 100
100, 100
100, 100
93, 97
100, 100
100, 100
74, 74
100, 100
100, 100
100, 99
100, 100
100, 100
99, 92
100, 100
100, 100
100, 100
83, 93
100, 100
100, 100
100, 100
100, 100
100, 100
100, 100
100, 100
100, 100
72, 47
100, 100
HsapGATA3
MmusGATA3
GgalGATA3
XtroGATA3
HsapGAT2
MmusGATA2
GgalGATA2
DrerGATA1b
ReglGATA3
TrugGATA2b
GacuGATA2b
OlatGATA2b
OlatGATA2a
GacuGATA2a
MmusGATA1
BfloGATA123
SkowGATA123
OlatGATA3
DrerGATA2a
TnigGATA3
TrugGATA3
SpurGATAc
TrugGATA1
GacuGATA1
OlatGATA1
XtroGATA2
TnigGATA2b
HsapGATA1
TnigGATA2a
TrugGATA2a
GgalGATA1
PdumGATA123
TnigGATA1
DrerGATA1a
DrerGATA3
DrerGATA2b
XtroGATA1
Vertebrate GATA3
Vertebrate GATA2
Teleost
GATA2a
Teleost
GATA2b
Vertebrate GATA1
Teleost
GATA1a
Teleost GATA1b
Vertebrate GATA6
Vertebrate GATA5
Vertebrate GATA4
Page 8
BMC Evolutionary Biology 2009, 9:207 http://www.biomedcentral.com/1471-2148/9/207
Page 8 of 19
(page number not for citation purposes)
207 base pair (bp) region with significant (p = .002) sim-
ilarity, approximately 18 kilobase (kb) upstream of the
current exon 1 in the SpurGATAc gene. This sequence
includes an open reading frame of 69 residues with an N-
terminal motif 64% identical in amino acid sequence to
the corresponding motif in BfloGATA123 (27% to
PdumGATA123., 50% to NvecGATA123., 56% to
MmusGATA2). However, this open reading frame begins
abruptly within this motif and does not include a 5' start
codon. This result appears to be consistent with tBLASTn
searches of the S. purpuratus genomic trace archive
directly, in which 22 of 23 traces identified containing this
ORF also have a stop codon at the same position [*QVD-
VYYHH.], suggesting that this stop codon is not a
sequencing error. Therefore, there could be an additional
short 5' exon that we have not found, or perhaps this exon
has degenerated and is now a pseudo-exon.
Molecular phylogenetic analysis of deuterostome GATA
genes
To better define the relationships between all deuteros-
tome GATA factors, we conducted a series of molecular
phylogenetic analyses. We first analyzed the complete set
of the collected GATA factors, using the conserved zinc
finger domains and aligning newly identified factors to a
previously defined alignment [6]. These analyses consist-
ently resolved GATA123 and GATA456 subfamilies (Addi-
tional File 1), but could not fully resolve relationships
within the GATA123 and GATA456 clades. Alignments of
GATA123 and GATA456 full length protein sequences
Syntenic genes with GATA123 locus from seven vertebrate sequences
Figure 4
Syntenic genes with GATA123 locus from seven vertebrate sequences. Gene names given from ENSEMBL, location
on chromosome represented in megabase. The colored blocks represent syntenic genes are part of paralogy groups syntenic
with other GATA loci, following the color scheme in Figure 6.
Gene:
Chromosome/ HsaX
TEX/TMCC
opsins
TEX/TMCC
opsins
TEX/TMCC
opsins
HsapGATA1 MmusGATA1
MmuX
GgalGATA1 DrerGATA1a
Dre11
FuguGATA1a
Scaffold_79
FuguGATA1b-ogm
Scaffold_215 (mb)
GacuGATA1a
groupXVII
OlatGATA1a
Olat5
Amphioxus
Brafl1/scaffold_27(mb)(mb)(mb)(mb)Dre8 (mb)(mb)(mb) groupXII(mb)(mb) Olat7 (mb)
TEX28
OPN1MW2
TEX28P1
OPN1MW1
TEX28P2
OPN1LW
153.152
153.138 OPN1MW2
153.115
153.101 OPN1MW1
153.077
153.062 OPN1LW
71.396 Not found in genome
Only EST sequence
Syntenic Genes also
not located; possibly
missing from current
71.372 assembly/genome
-
-
-
-
opn1lw122.708
opn1lw2 22.704 opn1lw2s1/20.740opn1lw (OPSG 10.620opn1lw (Q227.010
opn1sw2 22.690opn1sw20.740
opn1sw (ENSGA10.610 opn1sw (OP27.000
PNCK (CAMK PNCK
DUSP9
~55 MB
ZCCHC13 (CN ZCCHC13 (CNBP
ITIH5L ITIH5L
PFKFB1PFKFB1
CACNA1FCACNA1F
TIMM17B TIMM17B
HDAC6HDAC6
GATA1 GATA1
SUV39H1SUV39H1
SLC38A5 SLC38A5
TAF4 (2 of 2)
GRIP2 (2 of 2)
PLXNB3
ipo9
PCTK1
152.588 PNCK
152.561 DUSP9
70.901 sequence
70.884DUSP9
73.440 ZCCHC13 (CNB
54.792 ITIH5L
54.976 PFKFB1
48.948 Cacna1F (4 ge
48.635 TIMM17B
48.545 Hdac6
48.529 GATA1
48.440 SUV39H1
48.201 SLC38A5
100.825
147.261
147.024
7.184
7.476
7.502
7.536
7.638
7.848
ZCCHC13 (CN
ITIH5L
PFKFB1
CACNA1F (1
17.942
25.510
25.531
21.769
ITIH5L
PFKFB1
CACNA1F (1 of
TIMM17B
15.207
15.217
15.882
15.900
PFKFB10.398
(cacna1f)
TIMM17B
21.284
39.910
HDAC60.320HDAC(fgenesh2_pg.scaffold_272000040)
e_gw.27.106.1
fgenesh2_pg.scaffold_27000081
0.498
GATA1 22.660 GATA1b (LOC51.900 GATA10.720GATA1 10.580GATA1 26.9702842112
994397
(SLC38A5)22.005(SLC38A5) 16.070 (SLC38A5)
TAF4 (2 of 2
21.523
26.090TAF4 (2 of 2) 10.010
GRIP2 (2 of 2)
PLXNB3 (LOC56
2.140
3.560
GRIP2 (gw.27.28.1)3.250
PLXB3 0.180
ipo9 22.070 IPO9 (ENSGACG04.920ipo9 (fgenesh2_pg.scaffold_383000011)
#VALUE!46.962 PCTK13.756PCTK1 0.232
Gene:
Chromosome/ Hsa3
HsapGATA2MmusGATA2
Mmu6
DrerGATA2a
Dre11
GacuGATA2a
groupXVII
OlatGATA2a
Olat5
0.351
(mb)(mb) Ggal12(mb) (mb)Dre6 (mb)Scaffold_116(mb)Scaffold_190 (mb)(mb)Scaffold_27 (mb)(mb) Olat5(mb)
ARHGEF3ARHGEF3
56.736
ARHGEF3
1.308
ARHGEF3
33.648
SLC38A3
TMF1
UBA3
CACNA1F
PRKCD (2 of 2PRKCD
ITIH1
ITIH3
ITIH4
MUSTN1
SFMBT1
SLC38A3 50.217 moved to 14
(SLC38A3) 31.623 (SLC38A3) 39.279(SLC38A3)
TMF1
UBA3
10.724 (SLC38A3)
9.570
9.580
2.479 (SLC38A3)
TMF1
UBA3
5.715 (SLC38A3)
14.670
14.660
1.441
CACNA1D53.504
53.170 moved to 14
52.787 moved to 14
52.804 moved to 14
52.820
52.822
52.913 moved to 14
CACNA1D
PRKCD
ITIH1
ITIH3
-
7.278
1.448
0.775 ITIH1 (zgc:1
0.775 ITIH3 (LOC5
PRKCD (zgc:565 27.380PRKCD (2 of 2)
ITIH1 (ENSGACG
ITIH3
9.680 PRKCD (2 of 2)
9.731
9.741
2.120 PRKCD (2 of
ITIH1
14.580 PRKCD (1 of
14.508
2.070
ITIH1
ITIH3
ITIH4
MUSTN1
SFMBT1
2.305
2.321
(ITIH1/3)
(ITIH1/3)
0.000
0.000
(MUSTN1) 17.568
MUSTN1(TMEM
SFMBT1
9.718
9.700
MUSTN1
SFMBT1
14.531
14.560 SFMBT1 0.849
TNNC1
DUSP7
GSK3B (PTCK GSK3B (PTCK)
UROC1
CHCHD6
PLXNA1
Abtb1
SEC61A
RUVBL1
EEFSEC
GATA2
RPN1
RAB7a
RAB43
CNBP
C3orf37
H1FX
IFT122
RHO
PLXND1
TMCC1
TNNC1
DUSP7
52.460
52.059
121.028
127.682 UROC1
127.905 Chchd6
128.190 Plxna1
128.870 Abtb1
129.200 Sec61b (Sim
129.282 Ruvbl1
129.300 Eefsec
129.680 Gata2
129.821 Rpn1
129.927 Rab7
130.260 (Rab43) - Isy
130.370 CNBP
130.480 C3orf37 (843
130.520 h1fx
130.640 IFT122
130.730 Rho
130.750 PLXND1
130.849 TMCC1
48.530
48.700
TNNC1
DUSP7
GSK3B (PTCK)
0.540
0.199 DUSP7 (zgc:
83.339
10.971
10.020
10.158
9.920
9.790
9.764
9.680
9.450 GATA2
9.410 RPN1
9.410
5.260 (Rab43) GC1.78, 1.82
5.250 (CNBP) ENS
9.370
9.380
20.120
20.160
20.170
20.620 TMCC1
1.422
9.219
Rab19-like (
CAMK1
1.530DUSP7 0.700
UROC1
CHCHD6
PLXNA1
ABTB1
SEC61A1
RUVB1
EEFSEC
GATA2
RPN1
RAB7a
RAB43
CNBP
C3orf37
H1FX
IFT122
RHO
PLXND1
TMCC1
PFKFB4
90.251 UROC1
89.333 CHCHD6
89.268 PLXNA1
88.780 ABTB1
88.450 SEC61B (SEC6
88.415 RUVB1(NP 00
88.200 EEFSEC (X270
88.150 GATA2
88.030 RPN1
87.949 RAB7A
87.773 RAB43
87.779 CNBP
87.860 C3orf37 (DC1
87.930 H1FX
115.800 IFT122
115.880 RHO
115.904 PLXND1
115.968 TMCC1
PFKFB4
IHPK2
uorc1 (LOC556739.580UROC1 2.260UROC11.870
CHCHD6
(PLXNA1)
7.433
7.454
CHCHD6
PLXNA1 (1
30.930
31.080 PLXNA1 (2 o
CHCHD(estExt_fgenesh2_pg.C_3830038)
2.000
(ABTB1) (estExt_GenewiseH_1.C_3830012)3.030
sec61b (Simila
ruvbl1
(EEFSEC) zgc:
2.530 GATA2b(zgc:91
2.580
29.230
27.680
27.670
27.640 GATA2 (2 of 2
RPN1
RAB7A (2 of 2
SEC61A1 (2
RUVBL1
EEFSEC
0.280 SEC61A1 (1 of 2
0.370
0.360
0.330 GATA2a
RPN1
0.310 Rab7A (ENSGAC
9.890 SEC61A1 (2 o
RUVBL1
EEFSEC
9.790 GATA2 (1 of 2
9.810
9.820 Rab7A (ENSG
2.270 SEC61A1 (2
2.140
2.140
2.190 GATA2b (2
RPN1
2.230 RAB7A (2 o
RAB43
31.700 SEC61A1 (2
RUVBL1
EEFSEC
14.430 GATA2 (Q5K
14.140
31.870 RAB7A (1 of
25.950
1.859
2.058
2.046
1.960 (GATA123) (e_gw.27.106.1) 0.050 GATA2 (1 of
0.070
0.070 RAB7A (1 of
2.840
1.930
(RAB43) 0.760
1.830 SF
C3orf37 (LOC1
h1fx
27.730 c3orf37
27.620
0.080 c3orf379.830C37ORF3731.870
IFT122 (e_gw.27.366.1)0.110
Rho (ENSGAC 4.150
PLXND1 (fgenesh2_pg.scaffold_27000114)1.650
17.600 TMCC1
PFKFB4 (2 of
9.880 IHPK2 (1of 2
1.118
2.230
2.280 IHPK2 (2 of
TMCC1
PFKFB4 (2 o
3.927
1.910
1.840 IHPK2 (1of 2IHPK2
Rab19-like
CAMK1
IHPK2 (2 of 2)31.770 IHPK2 (1 of
1.780
2.784
Rab19-like (estExt_fgenesh2_pm.C_270011)0.760
CAMK1
9.774
CAMK1
113.284
CAMK1CAMK1
33.196
CAMK1
3.892
CAMK1
29.367
Gene:
Chromosome/ Hsa10
HsapGATA3MmusGATA3
Mmu2
DrerGATA3
Dre4
FuguGATA3
Scaffold_21
FuguGATA3-ogm
Scaffold_2
GacuGATA3
groupIV
OlatGATA3
Ola23
syntenic
(mb) (mb)Ggal1(mb) (mb)Dre25 (mb) (mb)(mb) (mb)groupXIX (mb)(mb)Ola6(mb)
PCTK1
SUV39H2
HSPA14
PRPRF
SEPHS
49.1
SUV39H2
HSPA14
PRPRF
SEPHS
NR2C1
C10orf49
CAMK1D
DHTKD1
SEC61A2
CUGBP2
GATA3
TAF3
ATP5C1
KIN
ITIH2
ITIH5
SFMBT2
PRKCQ
PFKFB3
RBM17
IL15RA
CHCHD3
14.960 SUV39H2
14.920 HSPA14
13.668 PRPRF
13.399 SEPHS
3.373 SUV39H2
3.406 HSPA14
4.543 PRPRF
4.802 SEPHS
7.865
7.846 HSPA14
7.145
7.027
NR2C1 (zgc:
fgenesh2_pg.scaffold_27000081
estExt_fgenesh2_pg.C_270134
estExt_fgenesh2_pm.C_270006
estExt_fgenesh2_pg.C_270008
8.650 NR2C1 (fgenesh2_kg.scaffold_27000001)
8.560
8.530
8.510
994397
2028755
525189
73517
3.130
9.957 HSPA1422.831HSPA1423.831
15.696
15.683
15.682 PRPF18
15.695 SEPHS1
11.870
6.200
22.160NR2C1
C10orf49 (1
CAMK1D
DHTKD1 (EN
3.780
3.720
3.700 CAMK1D
3.700
NR2C1
C10orf49 (1 o
NR2C1 (ENSO
C10orf49 (1
CAMK1D (1
DHTKD1
C10orf49
CAMK1D
DHTKD1
SEC61A2
CUGBP2
GATA3
TAF3
ATP5C1
KIN
ITIH2
ITIH5
SFMBT2
PRKCQ
PFKFB3
RBM17
IL15RA
13.303 c10orf49 (EN
12.431 CAMK1D
12.150 DHTKD1
12.211 SEC61A2
11.099 Cugbp2
8.136 Gata3
7.900 Taf3
7.870 Atp5c1
7.840 Kin
7.785 Itih2
7.641 Itih5
7.244 Sfmbt2
6.509 Prkcq
6.226 Pfkfb3
6.171 Rbm17
6.031 Il15ra
4.890 C10orf49
5.214 CAMK1D
5.810 DHTKD1 (ENS
5.792 SEC61A2
6.460 CUGBP2 (ENS
9.780 GATA3
9.840 TAF3
9.980
10.002 KIN
10.016 ITIH2
10.075 ITIH5
10.292 SFMBT2
11.093 PRKCQ
11.393 PFKFB3
11.507 RBM17
11.627
C10orf49 (zgc:
CAMK1D (zgc:1
DHTKD1 (zgc:1
5.150
5.160
5.200
6.779
6.45
6.491
5.84 cugbp2
4.34 gata3
4.18 taf3
atp5c1
4.159 kin
4.136 itih2
4.07 itih5
3.831 sfmbt2
3.515 prkcq
3.379 pfkfb3
3.324 rmb17
IL15RA (NP
chchd3 (EN
22.070
DHTKD1 (ENS6.180
CUGBP2 (1 of
GATA3
0.410 CUBBP2 (2 o 3.670
0.810
TAF3 (ENSTR
0.860
0.860
0.860
0.871
0.890
0.910
CUGBP2 (ENS 6.150 CUGBP2 (2
GATA3 (Q76
6.140
ATP5C1
KIN
ITIH2
ITIH5
SFMBT2
PRKCQ
PFKB3
RBM17
21.790 CUGBP2 (1
22.660
TAF3 (ENSO
22.740
22.750
22.750
21.424
21.340
21.290
2.160
2.150
8.480
21.300
21.400
21.500
21.500
21.510
21.547
21.680
21.730
21.780
21.820
21.880
15.960 CHCHD3 (2 of
3.670 TAF3 (ENSGA8.460
ATP51C
KIN
ITIH2
ITIH5
SFMBT2
PRKCQ
ATP5C1 (ENSGA
KIN
32.590
32.580
ITIH5
SFMBT2
PRKCQ
PRKFB3
RBM17
32.569
32.530
32.510
22.140
22.130(RBM17) (e gw.27.300.1) 0.750
2.560 CHCHD3 (1 of 229.344 CHCHD3 (2 of6.090 CHCHD38.390 CHCHD(estExt_fgenesh2_pg.C_3830038)
moved to Mmu13
5.847 GDI2
5.555 CALML3
5.444
NET1
MmusGATA1-ogm
Mmu1
GDI2
CALML3
NET1
Gene:
Chromosome/ Hsa1
(GATA1-ogmb-ogm)
GDI2
CALML3
NET1
HsapGATA1-ogm
3.537 GDI2
3.803 CALML3
3.881
0.94 GDI2
1.007
1.011
11.950GDI224.520GDI28.929
NET1
GgalGATA1-ogm
Gga26
NET1
DrerGATA1-ogma
Dre11
12.060
NET1
FuguGATA1-ogmb-o GacuGATA1-ogma
Scaffold_60(mb)
3.650
NET1
24.449
NET1
GacuGATA1-ogmb
6.111
NET1
OlatGATA1-ogma
Olat5
8.769
NET1
OlatGATA1-ogmb
8.416
FuguGATA1-ogma
Scaffold_192
Amphioxus
Brafl1/scaffold_27 (mb)(mb)(mb)(mb)Dre23(mb)(mb) groupXVII(mb) groupXII(mb)(mb)Olat7(mb)
PFKFB2PFKFB2205.293 PFKFB2132.585 PFKFB22.43 PFKFB217.061 PFKFB242.793 PFKFB2 0.469 PFKFB2 0.207 PFKFB2 2.541 PFKFB2 1.575 PFKFB231.898 PFKFB219.886
CACNA1S
TIMM17A
PCTK3
ELK4
TMCC2
CAMK1G
CACNA1S
TIMM17A
PCTK3
ELK4
TMCC2
CAMK1G
199.275 CACNA1S
200.191 TIMM17A
203.314 PCTK3
203.851 ELK4
203.463 TMCC2
207.823 CAMK1G
137.949 CACNA1S
137.198 TIMM17A
134.01 PCTK3
133.904 ELK4
134.252 TMCC2
195.1172
0.254
1.096 TIMM17A
1.966 PCTK3
2.006 ELK4
1.85
CACNA1S
TIMM17A
PCTK3
ELK4
13.141 CACNA1S
4.907
11.401
11.451
10.969 CACNA1S
TIMM17A
PCTK3
ELK4
TMCC2
19.397 CACNA1S
10.032
24.632
24.685
0.569
12.077
22.127
36.243
37.496
GgalGATA2
Dre GATA1b (XP 693371GacuGATA1b-ogmOlatGATA1b-ogm
383: 0.62mb
DrerGATA3-ogmGacuGATA3-ogmOlatGATA3-ogm
Dre GATA1-ogmb
383: 0.17mb
DrerGATA2bFuguGATA2aFuguGATA2bGacuGATA2bOlatGATA2b
GgalGATA3
383: 0.62mb
Page 9
BMC Evolutionary Biology 2009, 9:207 http://www.biomedcentral.com/1471-2148/9/207
Page 9 of 19
(page number not for citation purposes)
were highly variable and generated results similar to those
obtained using the conserved zinc finger domain only
(data not shown).
Alignments that compared only GATA123 or only
GATA456 subfamily members resulted in greater conver-
gence of the gene tree to the species tree (Figure 3a/b). We
found that invertebrate GATA123 and GATA456 genes
formed separate clades outside of the vertebrate GATA123
and GATA456 clades, respectively, consistent with the 2R
origin of the additional vertebrate GATAs. Within the
individual vertebrate clades, there was a clear separation
of tetrapod and teleost genes, and only minor changes to
the species tree were observed within these groupings
(compare Figure 3 to Figure 1). Outside of vertebrates, the
cephalochordate GATA genes
BfloGATA456 group with the ambulacrarian orthologs
BfloGATA123 and
(hemichordates and echinoderms), but this is likely due
to the high degree of conservation and low level of diver-
gence for both the hemichordate and cephalochordate
genes.
Our results also suggest distinct ancestral relationships
within each vertebrate GATA class. Within the GATA123
class (Figure 3a), a closer relationship was observed
between the GATA2 and GATA3 members, to the exclu-
sion of a more rapidly evolving GATA1 group. In contrast
to previous results [5], and consistent with other recent
results [8], a closer relationship between the GATA5 and 6
groups, to the exclusion of the GATA4 group, was
observed within the GATA456 class.
In conclusion, these molecular phylogenetic analyses sup-
port the presence of two classes of GATA factors through-
Syntenic genes with GATA123 locus from seven vertebrate sequences
Figure 5
Syntenic genes with GATA123 locus from seven vertebrate sequences. Gene names given from ENSEMBL, location
on chromosome represented in megabase. The colored blocks represent syntenic genes are part of paralogy groups syntenic
with other GATA loci, following the color scheme in Figure 7.
Gene:
Chromosome/ Hsa8
MSRA
MmusGATA4
Mmu14
9.900 Msra
10.500 Rp1l1
10.600 Sox7
10.660 Pinx1
10.700 Xkr6
11.100 Mtmr9
11.230 TDH
11.300 BLK
11.310 ?
DrerGATA4
Dre20
FfuGATA4
Scaffold_72
FfuGATA4-ogmGacuGATA4
groupXVIII
MSRA
OlatGATA4
Olat24(mb)(mb)Ggal3(mb) (mb)Dre17(mb)(mb)(mb)groupXV+L128 (mb)
3.190 MSRA (ENSGAC
(mb)
104K
Olat22
MSRA ENSOR
(mb)
MSRA
RP1L1
SOX7
PINX1
XKR6
Mtmr9
TDH
BLK
c18orf13
64.700 MSRA
64.600 -
64.500 SOX7
64.470 PINX1 (ENSGA
64.400 XKR6
64.100 Mtmr9 (NP 00
64.100 TDH
64.000 BLK
?
109.3 Msra (si:dkey
40.0007.540 MSRA8.790
SOX7
PINX1
XKR6
MTMR9
TDH
BLK
C18ORF13
C14ORF149
DAAM1
PPP1R13B
ESNA1
C8orf14
GATA4
NEIL2
FDFT1
CTSB
ABHD1
PRKG1
KCNK2
HLX1
109.6 Sox7
109.62
109.7 Xkr6 (si:dkey
109.9
109.9 TDH
110
19.100 Sox7202KSOX7
PINX1
XKR6 (1 of 2)
MTMR9 (ENSGAC
TDH (ENSGACG0
BLK (ENSGACG0
c18of13
C14orf149 (ENSGA
DAAM1 (1 of 2) (E
PPP1R13B (1 of 2)
ESNA1 (1 of 2) (E
10.830
10.830
10.870
10.850 MTMR9 (ENSG
10.850 TDH (ENSGAC
10.790
10.920 c1orf90
10.270
10.230
10.210
10.200
SOX7
PINX1
XKR6 (1 of 2
4.150 MTMR9
4.150 TDH (ENSOR
13.820
13.830
13.880
13.870
13.850 TDHb (ENSO
18.900 140K
174K
182K
MTMR9 (zgc:15
19.200 TDH (zgc:1655
1.050 MTMR9
1.040 TDH16.590
C18ORF13 (LOC5
1.0704.150 c18orf13
c14orf149 (E
DAAM1 (ENSO
PPP1R13B (1
ESNA1 (1 of
13.950 c1orf90
17.260
17.240
17.130
17.100
16.580
Xkr6 (NP 001027884.1)
C8orf14
GATA4
NEIL2
FDFT1
CTSB
11.400 C8orf14
11.600 GATA4
11.600 NEIL2
11.700 Fdft1
11.700 CTSB
64.000 C8orf14
63.800 GATA4
63.800 NEIL2
63.800 FDFT1
63.700 CTSB
110
110.1 gata4
110.1
110.1 (Fdft1) LOC
110.1
54.700GATA4 729K GATA410.170GATA4 17.080
54.200 FDFT1
CTSB
abhd1
738K
743K
744K
FDFT1
CTSB (2 of 2)
ABHD1 (ENSGACG
PRKG1-like (ENSG
KCNK2 (2 of 2)(EN
hlx1
10.160
10.160
10.150
10.140
10.120
9.770
FDFT1
CTSB
ABHD1 (ENSO
PRKG1-like (E
KCNK2 (1 of
17.060
17.060
16.920
17.040
17.040
ABHD1
ZNF39565.990 ZNF395 108.45
KCNQ?
SAMD?
133.210
119.000
ENSGALG00000005822
ENSGALG00000006007
hlx1 54.490
LAMA27.110
Gene:
Chromosome/ Hsa20
HsapGATA5MmusGATA5
Mmu2
DrerGATA5
Dre23
FuguGATA5
Scaf_7581
FfuGATA5-ogm GacuGATA5
XII_2
OlatGATA5
Olat7(mb) (mb)Ggal20(mb) (mb)Dre23(mb)4898bp(mb) (mb)(mb)Olat7(mb)
TAF4
PSMA7
SS18L1
HRH3
OSBPL2
ADRM1
LAMA5
CABLES2
RPS21
GATA5
SLCO4A1
C20orf20
NTSR1
C20orf11
c20orf59
COL20A1
EEF1A2
KCNQ2
c20orf149
PTK6
GMEB2
SAMD10
PRPF6
UCKL1
TCEA2
OPRL1
sox18
HCK
TM9SF4
PLAGL2
59.983 TAF4
60.145 PSMA7
60.152 SS18L1
60.223 HRH3
60.200 Osbp12
60.300 Adrm1
60.300 LAMA5
60.300 Rps21
60.300 Cables2
60.400 GATA5
60.740 Slco4a1
60.800 (c20orf20) 16
60.810 Ntsr1
61.000
61.050
61.390 Col20a1
61.500 Eef1a2
61.507 KCNQ2
61.600
61.600
61.680 Gmeb2
62.075 Samd10
62.000 Prpf6
62.040
62.100 Tcea2
62.100 Oprl1
62.100
30.100 HCK
30.160 Tm9sf4
30.240 PLAGL2
179.646 TAF4
179.771 PSMA7
179.777 SS18L1
179.834 HRH3
179.800 OSBPL2
179.900 ADRM1
179.900 LAMA5
179.900 CABLES2
179.900 RPS21
180.000 GATA5
180.190 SLCO4A1 (NP
180.200 C20orf20
180.230 NTSR2
C20orf11
7.652
7.715
7.721
7.752
7.7 osbpl2
7.8 adrm1b
7.8 lama5
7.9 (Cables2) X
7.9 rps21
8 gata5
8.16
8.31 c20orf20
8.24
8.42
TAF4
10.017TAF4
PSMA7
18.419
OSBPL2
ADRM1
LAMA5
CABLES2
RPS21
GATA5
SLCO4A1
C20orf20
NTSR1
C20orf11
c20orf59
COL20A1
EEF1A2
KCNQ2
c20orf149
PTK6
GMEB2
SAMD10
PRPF6
UCKL1
TCEA2
OPRL1
sox18
HCK
TM9SF4
PLAGL2
KIF3b
7.150 osbpl2b (LOC5
7.200
7.200
6.750 CABLES2 (2 of 2
6.860
5.360
29.700OSBPL2 (1 of 2)
ADRM1 (1 of 2)
LAMA5
CABLES2 (1 OF 2
6.590 OSBPL2 (2 of
13.040 ADRM1 (1 of 2
13.040
13.470 CABLES2 (2 O
rsp21?
2.550
4.016
c20orf20
3.950
c20orf11 (1 o
15.600 OSBPL2 (2 o
15.660
5.940 OSBPL2 (1 o
ADRM1 (1 of
22.160
22.180
15.650 15.600CABLES222.190
n/a RPS21
GATA5
SLCO4A1
0.610
5.120
5.370
GATA5 GATA5
SLCO4A1
10.100elsewhere-
NTSR1
c20orf11 (2 of 2 319K
NTSR15.420
c20orf11
c20orf59
15.310
15.320
15.740c20orf11 (1
c20orf59 (2
22.020
22.010
180.720 COL20A1 (NP
180.800 EEF1A2
180.810 KCNQ2
c20orf149
8.81 col20a1 (slc
8.99 (Eefla2) zgc
8.924
9.01
18.360
10.200
COL20A13.690 COL20A14.400
EEF1A215.700-
c20orf149 (2 of
12.990 c20orf149 15.730c20orf149 (222.050
PTK6 15.180
180.980
181.329 SAMD10
181.300 PRPF6
GMEB22.350
9.426 SAMD10
9.39 (Prpf6) c20
12.893
4.900
SAMD10
PRPF6
15.790
0.163
-
PRPF6
UCKL1
TCEA2
15.840
21.950
27.250181.400 TCEA2
181.400 OPRL1
9.35 tcea2
9.25 opr1
6.300
6.000
TCEA2
OPRL1
SOX18
13.350
13.260
13.370
152.930 HCK
152.980 TM9SF4
153.050 PLAGL2
10.06 HCK (ENSDA
10.07 TM9SF4
10.09
5.500
5.520
HCK
TM9SF4
PLAGL2 (1 of 2)
KIF3b
2.600
2.610
2.630
2.690
HCK
TM9SF4
PLAGL2
5.010
4.990
4.980
SYCP2
KCNQ2
57.870
61.507RHOA-like RHOA-like (ENSGA
2.360
SLC2A4RG61.840
Gene:
Chromosome/ Hsa18
ROCK1
ESCO1
ABHD3
SNRP1
MIB1
GATA6
RBBP8
CABLES1
RIOK3
NPC1
ANKRD29
c18orf8
LAMA3
OSBPL1A
ZNF521
HsapGATA6MmusGATA6
Mmu18
DrerGATA6
Dre2
FuguGATA6
Fugu_Scaffold68
Rock1
NPC1
ABHD3
GacuGATA6
GacugroupIII
ROCK1
OlatGATA6
Olat17
ROCK1
(mb)(mb) Ggal18(mb)50-51.5MbZV7_NA549 scaffold_550 (mb)Olat20 (mb)
ROCK1
ESCO1
ABHD3
SNRPD1
MIB1
GATA6 Hsa:F21
RBBP8
CABLES1 Hsa:
RIOK3
NPC1
ANKRD29
c18orf8
LAMA3
OSBPL1A
ZNF521
16.700 Rock1 10.000 ROCK1
10.500 ESCO1
10.600 ABHD3
10.610 SNRPD1
10.700 MIB1
11.000 GATA6
RBBP8
11.900 CABLES1
RIOK3
12.300 NPC1
ANKRD29
105.3
105.5
105.6
42.8
105.6 MIB (Mib1)
105.8 GATA6
105.9
106 Cables1
106.1
106.2 NPC1
106.2
891K
737K
842K
13.67029.260
Esco1
17.400 Abhd3
17.440 Snrpd1
17.500 Mib1
18.003 GATA6
ADHD3 13.620ABHD3
SNRPD1
MIB1
GATA6 Hsa:
29.370
22.570
29.380
29.460
50.800
50.920
MIB1
GATA6
809K
789K
MIB1
GATA6
13.580
13.550
Rbbp8rbbp825K RBBP89.610
18.969 Cables1
Riok3
19.300 Npc1
Ankrd29
19.330
19.523 LAMA3
19.996 Osbpl1a
20.890 Zfp521
51.030CABLES1 753KCABLES113.520 CABLES1 Hs
RIOK3
NPC1
29.510
riok33K RIOK35K RIOK31.880
2.150 NPC113.500 29.544
c18orf8 0.000c18orf81.860
12.570
12.910 OSBPL1A
13.840 ZNF521
LAMA3 0.107LAMA3
OSBPL1A (2 of 2)
ZNF521
1.110
1.080
0.980
106.44
106.87
OSBPL1A (2 o26.270
HRH4
SS18L
PSMA
TAF
HRH4
SS18
PSMA8
TAF4B
20.294 HRH4
21.850 SS18
21.967 PSMA8
22.060 TAF4B
13.165 HRH4
15.000 SS18
14.864
14.784 TAF4B
106.554
107.239
HRH4
SS18
PSMA8
1.075
0.954
0.948
HRH4
SS18
PSMA8
26.313
26.457
26.468
107.323
OlatGATA6-ogm
GgalGATA6
DrerGATA5-ogm
GacuGATA5-ogm
groupXII
OlatGATA5-ogm GgalGATA5
SHORT SCAFFOLD!
DrerGATA6-ogm
GacuGATA6-ogm
(mb)
MTMR9b (ENSORLG0000
DrerGATA4-ogmGacuGATA4-ogm OlatGATA4-ogmGgalGATA4
Page 10
BMC Evolutionary Biology 2009, 9:207http://www.biomedcentral.com/1471-2148/9/207
Page 10 of 19
(page number not for citation purposes)
out deuterostomes. Deuterostome invertebrates possess
single GATA123 and GATA456 genes, and the deuteros-
tome GATA gene family has expanded in a manner con-
sistent with several rounds of whole genome duplication
at the base of the vertebrate lineages. See the Discussion
for further consideration of these results.
Syntenic conserved paralogs and the identification of
genome-duplicated GATA paralogons
Based on the above analysis, we hypothesize that (i) the
last common ancestor to all deuterostomes had one
GATA123 gene and one GATA456 gene within its
genome, and (ii) multiple rounds of whole genome dupli-
cation account for the expansion of vertebrate and teleost
GATA genes. If this hypothesis is correct, then we should
be able to detect duplicated GATA paralogons–conserved,
syntenic paralogs associated with the corresponding par-
alogous GATA loci–within the vertebrate evolutionary lin-
eage. To test this prediction, we characterized the adjacent
genomic regions for each vertebrate GATA locus, search-
ing for examples of tightly linked loci that have been
duplicated together as a result of whole chromosome
duplications. Although a superficial analysis of conserved
synteny has been published [8], which describes a 'segre-
gation' of vertebrate GATA genes on multiple chromo-
somes, we now describe deeper syntenies of orthologs
across species and paralogs within species, and use this to
completely describe the paralogons and their context dur-
ing genome duplication events.
In support of GATA gene family expansion via genome
duplication, we found numerous gene families with con-
served synteny across the GATA loci. We first described
genes syntenic with GATA123 and GATA456 loci across
each of the vertebrate species (Figures 4, 5, Additional
Files 2, 3 &4; see Methods). This data was used to identify
gene families with paralogs syntenic in multiple GATA
loci in fish and/or tetrapod species (Figures 6b, 7b). These
Table 1: Conservation of GATA motifs
AA Percent Shared Identity
123_N1123_N2 123_N3123_N4 123_N5Dual-ZF Domain 123_C1123_C2
GenePdNvPdNvPdNv PdNv PdPd123Pd456 NvPd NvPdNv
BfGATA123
SkGATA123
SpGATAc
CiGATAb
HsGATA1
HsGATA2
HsGATA3
DrGATA2a
DrerGATA2b
DrGATA3
DrerGATA1a
DrerGATA1b
10 30
14
27
31
47
50
14
27
53
56
56
61
61
56
7
58
64
70
44
52
64
58
38
29
38
21
21
94
77
88
90
82
92
92
82
72
81
81
74
82
82
90
74
84
84
77
86
87
111727
0----
-
-
-
-
717
11
17
17
25
33
11
11107
-------------
11 18
17
17
11
24
29
38
20
17
35
50
31
40
38
27
48
52
56
32
40
15
33
34
41
25
26
16
58
65
55
55
64
64
58
50
38
58
39
36
37
29
34
25
21
29
41
33
23
5
16
30
33
17
12
25
9
6
14
5
4
-
8
35113016
-
-
-
-
-
-
-
- 17--
456_N1456_N2456_N3456_N4Dual-ZF Domain
Gene PdNvPdNvPdNv PdPd123Pd456 Nv
BfGATA456
SkGATA456
SpGATAe
CiGATAa
HsGATA4
HsGATA5
HsGATA6
DrGATA4
DrGATA5
DrGATA6
20
13
18
16
15
15
60
56
46
21
43
35
50
44
50
46
20
23
10
10
18
16
12
13
10
13
41
41
17
13
40
32
34
34
20
31
37
32
25
14
21
12
22
16
16
20
8
0
87
82
80
57
83
77
76
90
88
84
56
90
84
84
82
81
80
58
79
76
74
17
0
-
-
-
-
-
-
13
13
17
17
30
17
The percent identity shared between motifs and conserved domains from cephalochordate (Bf), hemichordate (Sk), and echinoderm (Sp) GATAs
compared to polychaete (Pd) and sea anemone (Nv) GATAs. Scores based upon pairwise alignment percent identity scores in individual alignments.
Page 11
BMC Evolutionary Biology 2009, 9:207http://www.biomedcentral.com/1471-2148/9/207
Page 11 of 19
(page number not for citation purposes)
results allowed us to define the predicted GATA paral-
ogons within each vertebrate genome. Overall, thirteen
ohnologous gene families were identified as shared
between at least two of the four paralogous GATA1/2/3
regions (Figure 6a, Figure 4). Likewise five gene families
are shared between the paralogous GATA 4/5/6 regions
(Figures 5, 7a). Thus, all vertebrate GATA genes are
located within extensive paralogons providing strong sup-
port for an origin of the vertebrate GATA gene comple-
ment by whole genome duplication events from two
ancestral GATA loci, one GATA123 gene and one
GATA456 gene.
By comparing the differential pattern of gene loss versus
gain between the GATA paralogons within and among
these vertebrate species, we infer the evolutionary birth
order of the GATA paralogons by determining the most
parsimonious pattern of ohnolog retention (Figures 4, 5,
6a, 7a). In this analysis, we describe clade-specific con-
served losses of duplicated paralogs, though it is also for-
mally possible that these 'losses' may represent the
translocation of a pre-duplication gene into or out of a
paralogon prior to a gene duplication event. Nevertheless,
all cases are phylogenetically informative.
For the GATA123 family (Figure 6a), we conclude that the
initial 1R duplication of the ancestral GATA123 paralo-
gon generated a GATA1/1-ogm (ohnolog gone missing,
see [31]) and a GATA2/3 paralogon, and was followed by
seven subsequent paralogous gene losses. The GATA1/1-
ogm paralogon lost four ohnologs, whereas the GATA2/3
paralogon lost three ohnologs (for lost ohnolog identi-
ties, see legend of Figure 6). Furthermore, within the
GATA2/3 paralogon the ITIH gene apparently underwent
a tandem local duplication, before the 2R duplication,
resulting in the ITIH1/2/3 genes and the ITIH4/5 genes.
Following the 2R duplication, the GATA1 paralogon
duplicated to generate two distinct paralogons, GATA1
and GATA1-ogm (ohnolog-gone-missing, as the second
GATA1 ohnolog has been lost), while the GATA2/3 paral-
ogon gave rise to the distinct GATA2 and GATA3 paralo-
gon. After the 2R duplication of both GATA1/1b and
GATA2/3 paralogons, only seven paralogous gene losses
are required to explain the inferred composition of the
four resulting ancestral vertebrate GATA123 paralogons.
According to our scenario, the GATA1 paralogon lost one
ohnolog, the GATA1-ogm paralogon lost three ohnologs
(including a second GATA1), the GATA3 paralogon lost
three ohnologs, and the GATA2 paralogon lost none.
For the GATA456 family, we propose that the initial 1R
duplication generated a GATA5/6 paralogon and a
GATA4/4-ogm paralogon (Figure 7). This 1R duplication
was followed by a severe reduction of the GATA4/4-ogm
paralogon resulting in a minimum of three gene losses
within this paralogon. In contrast, no gene losses occurred
within the 1R GATA5/6 paralogon. We speculate, that
subsequently the 2R duplication generated the GATA5
and GATA6 paralogons, and the relatively diminished
GATA4 and GATA4-ogm paralogons. Due to the extensive
loss of ohnologs in the latter two, we have been unable to
identify a paralogous region representing the GATA4-ogm
paralogon. However, our current analysis indicates that
one ohnolog is missing from the GATA5 paralogon, and
one ohnolog is missing from the GATA6 paralogon, with
each of these two ohnologous genes being retained within
the GATA4 paralogon. In contrast three pairs of ohnologs
are shared between GATA5 and GATA6 paralogons.
Discussion
Invertebrate deuterostomes genomes encode sole
GATA123 and GATA456 orthologs
To examine the evolution of GATA transcription factors in
deuterostomes, including vertebrates, we searched for and
identified single-copy GATA123 and GATA456 orthologs
in two basal deuterostomes, the cephalochordate Branchi-
ostoma floridae and the hemichordate Saccoglossus kowa-
levskii. Single-copy GATA123 and GATA456 orthologs
have also been identified in two other basal deuteros-
tomes, the echinoderm Strongylocentrotus purpuratus and
the urochordate Ciona intestinalis. However, the B. floridae
and S. kowalevskii genes are more conserved in sequence
compared to the previously described invertebrate deuter-
ostome GATA genes. This conservation includes near
complete sets of GATA123 and GATA456 class specific
sequence motifs [7], and conserved intron/exon bounda-
ries in the gene regions that encode these motifs. Our find-
ings confirm previous phylogenetic inferences that the
genome of the last common ancestor to all deuteros-
tomes, like the bilaterian ancestor, encoded one GATA123
and one GATA456 transcription factor, with subsequent
duplications giving rise to the multiple family members
present in vertebrate deuterostome genomes.
Reconstructing the ancestral exon/intron structure and
evolution of the GATA gene
By comparing the exon/intron structures of deuterostome
GATA genes, we can infer the structure of the ancestral
deuterostome (Ud) GATA orthologs, as well as the ances-
tral eumetazoan (Em) ortholog. All three of these genes
contained a conserved dual-zinc finger domain encoded
in three central exons, which encode the first zinc finger,
second zinc finger, and a lysine-rich region. However, the
3' and 5' regions appear to vary among these genes. We
infer that both the EmGATA and the UdGATA456 genes
included a single 5' exon to this conserved domain, while
the UdGATA123 gene contained two 5' exons. As con-
served sequence motifs can be identified within the two 5'
exons of the GATA123 genes, and also within the single
5'exon of the sole Nematostella GATA gene, we infer that
Page 12
BMC Evolutionary Biology 2009, 9:207http://www.biomedcentral.com/1471-2148/9/207
Page 12 of 19
(page number not for citation purposes)
Evolution of GATA1/1b/2/3 chromosomal regions
Figure 6
Evolution of GATA1/1b/2/3 chromosomal regions. Evolutionary scenario leading to the expansion of the chordate
GATA123 paralogon into the four GATA1, 2, 3, and 1b paralogons during two rounds of genome duplication (a). The recon-
structed GATA paralogon(s) for the vertebrate ancestor is shown after the 1R genome duplication (light grey box), or the 2R
genome duplication events (medium grey box). Paralogs in the 0R vertebrate genome that can be strongly inferred when
present in both the GATA1/1-ogm paralogon and the GATA2/3 paralogon (represented by diamond), or when synteny is also
conserved in the cephalochordate genome (downward-pointing triangle); otherwise it is not clear if these genes were translo-
cated independently into the 1R paralogons. Changes to the paralogons from the inferred 2R state of the last common bony
fish/tetrapod ancestor (medium grey box) to the extant amniote or teleost state (dark grey box) (b). Three red bars across the
chromosome indicate that a larger genomic distance separates syntenic regions on the same chromosome. Paralogous gene
families include the protein kinase C (PRKCQ, D), SCM-like (SFMBT1,2), 6-phosphofructo-2-kinases (PFKFB1, PFKFB2,
PFKFB3, PFKFB4), ITI heavy chains (ITIH1, ITIH2, ITIH3, ITIH4, ITIH5, ITIH5L), calcium channel subunits (CACNA1F,
CACNA1D, CACNA1S), mitochondrial translocase subunit (TIMM17A, TIMM17B), PTC-kinases (PTCK1, PTCK3), ETS
domain containing (ELK1, ELK2), SEC61 transport proteins (SEC61A1, SEC61A2), opsins (Rho, OPN1MW1, OPN1MW2,
OPN1LW), TMC/TEX transmembrane proteins (TEX28, Z68193.2, AC092402.4, TMCC1,2), CAM-kinases (CAMK1,
CAMK1D, CAMK1G, PNCK), and coiled-helix-coiled-helix genes (CHCHD3, CHCHD6).
GATA1
GATA1-ogm
GATA2
GATA3
GATA2a
GATA2b
GATA1a
GATA1b
GATA3
GATA3-ogm
GATA1
GATA1-ogm
GATA2
GATA3
2
2
2b
2a
3
3
3
1
1b
1a
1
Vertebrate AncestorAmniote AncestorTeleost Ancestor
b)a)
Vertebrate-0R
GATA1/2/3
Vertebrate 1R
GATA1/1-ogm
GATA1
Vertebrate 2R
GATA2/3
GATA2
GATA3
PRKCQ/D
SFMBT1/2
PFKB1/2/3/4
ITIH1/2/3/4/5/5L
CACNA1D/F/S
TIMM17A/B
PTCK1/3
ELK1/4
GATA1/2/3
SEC61A1/2
Rho/Opsins
TMC/TeX
CAMK1D/G/PNCK
CHCHD3/6
GATA1-ogm
2
3
1
1
2/3
Loss
Loss
Gain
Loss
Loss
Loss
?
?
?
?
?
?
?
?
GATA1-ogm-b
GATA1-ogm
Page 13
BMC Evolutionary Biology 2009, 9:207http://www.biomedcentral.com/1471-2148/9/207
Page 13 of 19
(page number not for citation purposes)
the 5' region of the UdGATA123 gene gained an addi-
tional intron.
Although the 5' exon-intron structure of UdGATA456 is
more similar to the NvecGATA exon/intron structure, the
3' end of GATA456 orthologs exhibits more variable fea-
tures. Both the NvecGATA and the GATA123 genes termi-
nate with the third exon of the conserved domain, while
GATA456 orthologs contain a truncated third conserved
domain exon as well as one or more novel 3' exon(s). This
third-conserved domain exon encodes ~27 conserved
amino acids of a lysine rich region. However, the
GATA123 and NvecGATA genes encode a less-conserved
terminal end with two C-terminal motifs. In comparison,
this lysine-rich exon is shorter in GATA456 orthologs than
in the GATA123 or NvecGATA genes, and lacks C-terminal
sequence motifs. Furthermore, GATA456 genes contain
novel exon(s) 3' to the conserved domain, and we have
been unable to identify conserved motifs from this addi-
tional region, suggesting that the 3' region has undergone
significant evolutionary change in GATA456 paralogs.
All vertebrate GATA456 genes have lost the ancestral N-
terminal motif N1, which is present within deuterostome
invertebrate GATA456s, and within the protostome anne-
lid Platynereis dumerilii GATA456 ortholog. A BLAST
search of the human N-terminal region against the NR
protein database fails to find this motif in any vertebrate
GATA transcription factor, suggesting that this motif may
have been lost early in vertebrate evolution.
Greater sequence conservation of GATA123 orthologs
Comparison of different deuterostome GATA genes to the
sole cnidarian GATA (NvecGATA) gene also suggests that
GATA123 genes are more slowly evolving then their
GATA456 counterparts. This can be seen both in the
higher percent identity shared between the conserved
domains of the deuterostome GATA123 and NvecGATA
genes (Table 1), the high affinity of the BfloGATA123 with
the NvecGATA, and the total number of common motifs
we can identify. Perhaps GATA123 genes are more con-
strained due to their retention of a deep ancestral func-
tion, while the GATA456 class might be more diverged
due to the selection or incorporation of bilaterian or phy-
lum specific roles. This view is consistent with previous
comparisons of the GATA gene complement in multiple
protostome genomes. Almost all protostomes possess a
single copy, more slowly evolving GATA123 gene,
whereas the GATA456 genes expanded in many proto-
stomes by sequential tandem duplications and subse-
quent modifications to their gene structure [6]. However,
the expression patterns currently described for deuteros-
tome and cnidarian GATA factors are not consistent with
retention of a deep ancestral function within the
GATA123 class. Whereas NvecGATA mRNA is largely
restricted to the endoderm in the cnidarian Nematostella
[4], with only a small ectodermal expression domain, the
vertebrate GATA-1, -2, and -3 are expressed and function
mostly within ectodermal tissues and blood, but not in
the endoderm [1]. However, GATA gene expression has
not been examined in many cnidarian species, and thus
any inference of any ancestral GATA function deeper in
animal phylogeny than bilaterians is still premature.
Expansion of vertebrate GATA transcription factor genes
during two rounds of whole genome duplications
Our previous work and this analysis suggest that the last
common ancestor to both protostomes and deuteros-
tomes had single GATA123 and GATA456 genes. But
these two GATA classes have undergone distinct expan-
sions using different mechanisms during the subsequent
evolution of different animal phyla. In protostomes, only
the GATA456 class appears to have undergone expansion,
at least in part by tandem duplications within individual
chromosomes. By contrast, in vertebrates, both the
GATA123 and the GATA456 family have expanded
through the retention of duplicated GATA genes that orig-
inated during two rounds of whole genome duplication
[32]. Our molecular phylogenetic analysis, and our anal-
ysis of conserved syntenic paralogs, both support expan-
sion by whole genome duplication and furthermore
suggest a specific evolutionary order for these duplication
events (compare scenarios in Figure 6a and 7a to the
clades defined in Figure 2).
Our molecular phylogenetic analysis of GATA123 genes
(Figure 3a) reveals a closer relationship between GATA2
and GATA3 orthologs, to the exclusion of a more rapidly
evolving GATA1 group. It is not surprising that the GATA2
and GATA3 genes show more affinity to each other, as
GATA1 appears to be a fast-evolving ortholog relative to
other vertebrate GATAs. Nevertheless, these relationships
are further supported by the retention of more syntenic
paralogs between the GATA2 and GATA3 loci, then
between the GATA1 and either GATA2 or GATA3 loci.
However, the conservation of syntenic paralogs between
GATA1 and either GATA2 or GATA3 strongly supports
common evolutionary origin of all three from an ancestral
GATA123 paralogon. We therefore conclude that GATA2/
3 and GATA1 intermediates were generated after the 1R
vertebrate genome duplication, and that one GATA1
ohnolog (chromosomally-duplicated paralog) was lost
after the 2R duplication.
In contrast to previous results [5] but consistent with
other more recent results [8], our molecular phylogenetic
analysis suggests a closer relationship between the GATA5
and 6 groups, to the exclusion of the GATA4 group. How-
ever, despite the differing outcomes in previous molecular
phylogenetic analyses, our synteny analysis is consistent
Page 14
BMC Evolutionary Biology 2009, 9:207http://www.biomedcentral.com/1471-2148/9/207
Page 14 of 19
(page number not for citation purposes)
with our phylogenetic analysis, and further supports a
closer relationship between the GATA5 and 6 groups. We
conclude that the 1R genome duplication produced
GATA4 and GATA5/6 intermediates, and an additional
GATA4-ohnolog had been lost after the 2R genome dupli-
cation.
Additional gene duplications in teleosts
We also find evidence for additional teleost-specific
ohnologs in the GATA123 lineage, but not in the
GATA456 lineage corresponding to an additional round
of genome duplication at the base of teleost fish. This evi-
dence stems from both overlaying the species phylogeny
on the gene phylogeny (compare Figure 1 to Figure 3) to
find additional 3R duplicates, but more conclusively from
the comparisons of duplicated paralogons. In all, four
additional teleost paralogs have been identified, and two
of these paralogs clearly resulting from larger chromo-
somal duplications.
The topology of our molecular phylogenetic analysis (Fig-
ure 3a) indicates a teleost-wide duplication of the GATA2
gene into separate GATA2a and GATA2b genes, most
likely originating from the 3R teleost-specific genome
duplication event. Although the topology of zebrafish
GATA2b within the tree is slightly off, possibly due to its
long branch indicating its derived sequence, the presence
of conserved syntenic genes between the tetrapod GATA2
Evolution of GATA4/5/6 chromosomal regions
Figure 7
Evolution of GATA4/5/6 chromosomal regions. The evolutionary scenario describes losses and gains of paralogous
genes near GATA456 during two rounds of genome duplication. (a) The duplications of the 0R chordate GATA456 paralogon
are shown for the three GATA4, 5, and 6 paralogons (the GATA4b paralogon could not be identified). The reconstructed
GATA paralogon(s) for the vertebrate ancestor is shown after the 1R genome duplication (light grey box), or the 2R genome
duplication events (medium grey box). Paralogs in the 0R vertebrate genome can be strongly inferred when present in both the
GATA4/4-ogm paralogon and the GATA2/3 paralogon (represented by diamond); otherwise it is not clear if these genes were
translocated independently into the 1R paralogons. (b) Progression from the inferred 2R state of the last common vertebrate
ancestor (medium grey box) to the extant amniote or teleost state (dark grey box).). Paralogous gene families include the
Oxysterol binding like proteins, (OSBPL1A/OSBPL2), laminins (LAMA3/LAMA4), Cdk5 and Abl enzyme substrates (CABLES1/
CABLES2), abhydrolase domain containing proteins (ABHD1/ABHD3), and sox transcription factors (SOX7/SOX18).
GATA5
GATA6
GATA4
Vertebrate Ancestor
GATA5
GATA5-ogm
GATA6
GATA6-ogm
GATA4
Amniote AncestorTeleost Ancestor
GATA5
GATA6
GATA4
GATA4-ogm
5
4
4
4
5
5
66
6
a)b)
Vertebrate-0R
GATA4/5/6
OSBPL1A/2
LAMA3/5
CABLES1/2
GATA4/5/6
SOX7/18
ABHD1/3
GATA5/6
GATA4
GATA5
GATA6
GATA4
?
GATA4-ogm
Vertebrate 1R
Vertebrate 2R
5
4
6
5/6
4
Loss
Loss
Loss
?
?
?
Page 15
BMC Evolutionary Biology 2009, 9:207http://www.biomedcentral.com/1471-2148/9/207
Page 15 of 19
(page number not for citation purposes)
paralogon and the teleost GATA2a and GATA2b paral-
ogons strongly suggests that these duplicated via a chro-
mosomal duplication.
The zebrafish genome also contains a second GATA1
duplicate that appears to originate from the 3R duplica-
tion. Although only found in zebrafish, our phylogenetic
analysis suggests that this GATA gene may have resulted
from the 3R duplication, and was secondarily lost early in
the ancestor of all other teleost. This view is consistent
with zebrafish being the most basal member of the fish
species represented in this analysis. However, this view is
also supported by the presence of two identifiable GATA1
paralogons within each teleost fish genomes (Figure 4,
Figure 6b, Additional File 4), although the second GATA1
gene is missing in these additional teleost paralogons with
the exception of zebrafish.
Although we see no evidence for additional teleost
GATA456 ohnologs from a 3R round of genome duplica-
tion, the Tetraodon (green spotted pufferfish) genome
does contain two GATA5 paralogs. However, the topology
of our molecular phylogenetic analysis suggests a more
recent origin via a Tetraodon-specific gene duplication, as
opposed to retained genome duplicate.
Our analysis of conserved synteny demonstrates the pres-
ence of additional duplicated paralogons, even when a
second GATA paralog is not identified (see Additional File
4 for complete Discussion). Therefore, our data based on
the comparative analysis of GATA paralogons in verte-
brates strongly supports a third genome duplication event
(3R) at the base of teleost fish.
It is notable that so many (6/8) of these GATA transcrip-
tion factors were retained after the first two rounds of
genome duplication on the base of the vertebrate branch.
In comparison, a recent analysis from the cephalochor-
date genome could identify retention of genome dupli-
cated paralogs in only about one quarter of all human
gene families, with a much smaller fraction containing
multiple ohnologs [33]. Furthermore, only 2/6 ohnologs
(GATA1a/1b, zebrafish GATA2a/2b) were retained after
an additional teleost-specific whole genome duplication
event. Apparently, the integration and preservation of
GATA transcription factors into the gene regulatory net-
works was a more probable outcome after the two early
rounds of whole genome duplication (1R and 2R), and
less likely after the third round (3R). In addition, after
these early genome duplications at the base of the verte-
brate lineage, the GATA gene family has remained static in
most vertebrate species. After the 2R duplication event, all
of the examined tetrapods maintained exactly six GATA
transcription factor genes. After the teleost-specific 3R
genome duplication, only a single gain of a GATA5 dupli-
cate in Tetraodon, and a loss of the GATA2b ohnolog in the
ancestor of the acanthopterygian fish, has occurred.
The presence of two distinct GATA factor classes in basal
deuterostomes, and their subsequent expansions in verte-
brates, informs the understanding of studies indicating
functional redundancy within each GATA class. For exam-
ple interfering with the function of all three GATA456
orthologs in Xenopus laevis embryos results in a much
more severe endoderm defect than does an inhibition of
the function of only one or two of them. Similarly, reduc-
ing the function of only one or two GATA456 paralogs
only partially blocks cardiac mesoderm induction in both
zebrafish and Xenopus [2,34,35]. The overlapping expres-
sion domains in the CNS for the GATA2 and 3 [36] and in
hematopoietic lineages for GATA123 orthologs [1] may
suggest that these GATA factors also have redundant func-
tions. Similarly, the GATA123 and GATA456 gene fami-
lies in nematodes are both highly redundant in their
requirements. The use of C. elegans GATA456 gene dupli-
cates at multiple nodes and levels in an endodermal gene
regulatory network provides the furthest understood
model so far for retention, cooption, and integration of
gene duplicates within a gene network over evolutionary
time [37].
While these gene expansions and functional redundancies
can complicate studies of GATA functions, both the hemi-
chordate and the cephalochordate posses only single cop-
ies of each GATA factor class. Both of these basal
deuterostomes also exhibit many widely conserved mor-
phological features that are thought to resemble the
ancestral states for both deuterostomes and chordates,
respectively (reviewed in [38-41]). Thus these basal deu-
terostomes are appealing model organisms in which to
investigate conserved functions for the GATA123 and
GATA456 classes of developmental regulatory transcrip-
tion factors.
Conclusion
The above molecular phylogenetic analyses, as well as
comparisons of conserved intron/exon structure and
sequence motifs, demonstrates that the last common
ancestor to all deuterostomes had only two GATA factor
genes, one GATA123 and one GATA456 gene, within its
genome. These analyses confirm that the GATA family of
transcription factors has expanded via whole genome
duplications in vertebrates. During the 1R and 2R genome
duplication, this family expanded to three GATA123 and
three GATA456 genes that are conserved across verte-
brates. The 1R genome duplication gave rise to GATA1
and GATA2/3 paralogs, as well as GATA4 and GATA5/6
paralogs, while single GATA ohnologs were lost from the
GATA1 and GATA4 lineages after the 2R event. In addi-
tion, the teleost 3R genome duplication has resulted in 1
View other sources
Hide other sources
-
Available from Bruce Bowerman · 6 Aug 2012
-
Available from Bruce Bowerman · 17 Dec 2012