ArticlePDF Available

Improvement of Phylogenies after Removing Divergent and Ambiguously Aligned Blocks from Protein Sequence Alignments

Authors:

Abstract and Figures

Alignment quality may have as much impact on phylogenetic reconstruction as the phylogenetic methods used. Not only the alignment algorithm, but also the method used to deal with the most problematic alignment regions, may have a critical effect on the final tree. Although some authors remove such problematic regions, either manually or using automatic methods, in order to improve phylogenetic performance, others prefer to keep such regions to avoid losing any information. Our aim in the present work was to examine whether phylogenetic reconstruction improves after alignment cleaning or not. Using simulated protein alignments with gaps, we tested the relative performance in diverse phylogenetic analyses of the whole alignments versus the alignments with problematic regions removed with our previously developed Gblocks program. We also tested the performance of more or less stringent conditions in the selection of blocks. Alignments constructed with different alignment methods (ClustalW, Mafft, and Probcons) were used to estimate phylogenetic trees by maximum likelihood, neighbor joining, and parsimony. We show that, in most alignment conditions, and for alignments that are not too short, removal of blocks leads to better trees. That is, despite losing some information, there is an increase in the actual phylogenetic signal. Overall, the best trees are obtained by maximum-likelihood reconstruction of alignments cleaned by Gblocks. In general, a relaxed selection of blocks is better for short alignment, whereas a stringent selection is more adequate for longer ones. Finally, we show that cleaned alignments produce better topologies although, paradoxically, with lower bootstrap. This indicates that divergent and problematic alignment regions may lead, when present, to apparently better supported although, in fact, more biased topologies.
Content may be subject to copyright.
Downloaded By: [USYB - Systematic Biology] At: 02:07 18 July 2007
Syst. Biol. 56(4):564–577, 2007
Copyright
c
Society of Systematic Biologists
ISSN: 1063-5157 print / 1076-836X online
DOI: 10.1080/10635150701472164
Improvement of Phylogenies after Removing Divergent and Ambiguously Aligned
Blocks from Protein Sequence Alignments
GERARD TALAVERA AND JOSE CASTRESANA
Department of Physiology and Molecular Biodiversity, Institute of Molecular Biology of Barcelona, CSIC, Jordi Girona 18, 08034 Barcelona, Spain;
E-mail: jcvagr@ibmb.csic.es (J.C.)
Abstract.—Alignment quality may have as much impact on phylogenetic reconstruction as the phylogenetic methods used.
Not only the alignment algorithm, but also the method used to deal with the most problematic alignment regions, may
have a critical effect on the final tree. Although some authors remove such problematic regions, either manually or using
automatic methods, in order to improve phylogenetic performance, others prefer to keep such regions to avoid losing any
information. Our aim in the present work was to examine whether phylogenetic reconstruction improves after alignment
cleaning or not. Using simulated protein alignments with gaps, we tested the relative performance in diverse phylogenetic
analyses of the whole alignments versus the alignments with problematic regions removed with our previously developed
Gblocks program. We also tested the performance of more or less stringent conditions in the selection of blocks. Alignments
constructed with different alignment methods (ClustalW, Mafft, and Probcons) were used to estimate phylogenetic trees by
maximum likelihood, neighbor joining, and parsimony. We show that, in most alignment conditions, and for alignments
that are not too short, removal of blocks leads to better trees. That is, despite losing some information, there is an increase
in the actual phylogenetic signal. Overall, the best trees are obtained by maximum-likelihood reconstruction of alignments
cleaned by Gblocks. In general, a relaxed selection of blocks is better for short alignment, whereas a stringent selection is more
adequate for longer ones. Finally, we show that cleaned alignments produce better topologies although, paradoxically, with
lower bootstrap. This indicates that divergent and problematic alignment regions may lead, when present, to apparently
better supported although, in fact, more biased topologies. [Bootstrap support; Gblocks; phylogeny; sequence alignment.]
Methods for the simultaneous generation of multiple
alignments and phylogenetic trees are actively being pur-
sued (Fleissner et al., 2005; Lunter et al., 2005; Redelings
and Suchard, 2005; Wheeler, 2001), but, at present, com-
mon practice of phylogenetic analysis requires, as a first
step, the generation of a multiple alignment of the se-
quences to be analyzed. It has been repeatedly shown
that the quality of the alignment may have an enor-
mous impact on the final phylogenetic tree (Kjer, 1995;
Morrison and Ellis, 1997; Ogden and Rosenberg, 2006;
Smythe et al., 2006; Xia et al., 2003). This is particularly
true when sequences compared are very divergent and
of different length, which makes necessary the introduc-
tion of gaps in the alignments.
Due to the computational requirements of optimal
algorithms for multiple sequence alignments, different
heuristic strategies have been proposed.The most widely
used approach has been the progressive method of align-
ment (Feng and Doolittle, 1987) that, together with en-
hancements related to the introduction of gap penalties,
was implemented in ClustalW (Thompson et al., 1994).
In progressive methods, an initial dendrogram gener-
ated from the pairwise comparisons of the sequences is
used to recursively build the multiple alignment, using
dynamic programming (Needleman and Wunsch, 1970)
in the last step. Dynamic programming is an exact algo-
rithm that assures the best possible alignments for given
gap penalties but, due to heavy computational require-
ments, it is only used for pairs of sequences or pairs of
clades of the dendrogram and not for the whole multi-
ple alignment. Several other heuristic multiple alignment
methods have been recently introduced. They include
T-Coffee (Notredame et al., 2000), Mafft (Katoh et al.,
2005; Katoh et al., 2002), Muscle (Edgar, 2004), Probcons
(Do et al., 2005), and Kalign (Lassmann and Sonnham-
mer, 2005), among others. All of them are based on the
progressive method but include several iterative refine-
ments to construct the final multiple alignment. The
latter methods have been shown to outperform purely
progressive methods in terms of alignment accuracy and,
some of them, even in computational time. However, it
has not been shown whether the greater alignment accu-
racy of more sophisticated methods leads to a significant
improvement in phylogenetic reconstruction.
Proteins have some regions that, due to their func-
tional or structural importance, are very well con-
served, whereas other regions evolve faster both in terms
of nucleotide substitutions and insertions or deletions
(Henikoff and Henikoff, 1994; Herrmann et al., 1996;
Pesole et al., 1992). That is, evolutionary rate heterogene-
ity affects to whole regions in addition to single positions.
This type of regional rate heterogeneity is very challeng-
ing for phylogenetic reconstruction, not only in terms of
homoplasy due to saturation (Yang, 1998), but also in
terms of errors in homology during alignment.
Dealing with regions of problematic alignment is a
matter of active debate in phylogenetics. Although some
authors consider that it is best to remove such regions
before the tree analysis (Castresana, 2000; Grundy and
Naylor, 1999; L¨oytynoja and Milinkovitch, 2001; Rodrigo
et al., 1994; Swofford et al., 1996), others think that there
is an important loss of information upon removal of any
fragment of the sequences already obtained (Aagesen,
2004; Lee, 2001) and that this practice should only be
used as the last resource (Gatesy et al., 1993). A third,
intermediate option, is the recoding of such regions us-
ing different strategies (Geiger, 2002; Lutzoni et al., 2000;
Young and Healy, 2003), which allows the use of at least
part of the information. Although these coded charac-
ters are most commonly analyzed with parsimony, it is
564
Downloaded By: [USYB - Systematic Biology] At: 02:07 18 July 2007
2007 TALAVERA AND CASTRESANA—IMPROVEMENT OF PHYLOGENIES AFTER REMOVING BLOCKS 565
also possible to use them as independent partitions in
Bayesian or likelihood frameworks.
In the present work we test, by using simulated pro-
tein alignments with gaps, which are the best alignment
strategies for optimal phylogenetic reconstruction. Two
preliminary considerations are necessary here. First, sim-
ulations of sequences may not cover all the complexity
of evolution but have the advantage over real sequences
that we know the tree from which they have been gener-
ated. There are some alignment sets curated from struc-
tural information that can be used to test alignment
accuracy (Thompson et al., 2005), but the phylogenetic
tree is unknown in these sets, thus making problem-
atic their use for proving phylogenetic accuracy. Second,
we have been working with simulated sequences that
try to reflect the evolutionary patterns of proteins, and
thus many of the conclusions extracted from our work
cannot be directly extrapolated to other markers such
as rRNA, which show very different evolutionary con-
straints (Gutell et al., 1994; Kjer, 1995; Xia et al., 2003).
In our analysis we used different alignment strategies
of the simulated sequences to test if they make any dif-
ference in the final phylogenetic tree. We have selected
ClustalW as the currently most used progressive align-
ment method (Thompson et al., 1994) and Mafft (Katoh
et al., 2005) and Probcons (Do et al., 2005) as examples of
more recently developed methods that have been shown
to obtain very high scores in terms of alignment accuracy
(Blackshields et al., 2006; Nuin et al., 2006). Simultane-
ously with the performance of the alignment programs,
we tested whether removing blocks of problematic align-
ment actually leads to more accurate trees. We used for
this purpose our previously developed Gblocks program
(Castresana, 2000), which selects blocks following a re-
producible set of conditions. Briefly, selected blocks must
be free from large segments of contiguous nonconserved
positions, and flanking positions must be highly con-
served to ensure alignment accuracy. Several parameters
can be modified to make the selection of blocks more
or less stringent. Phylogenetic trees made by maximum
likelihood (ML), neighbor joining (NJ), and parsimony
of the reconstructed alignments show that, in almost all
conditions tested, and at least for alignments that are
not too short, the elimination of problematic regions by
Gblocks leads to significantly better phylogenetic trees.
M
ATERIALS AND METHODS
We simulated protein sequences by means of Rose
(Stoye et al., 1998). This program allows the simula-
tion of different substitution rates in different positions
with a predetermined spatial pattern. This is a very im-
portant feature for testing the behavior of a program
like Gblocks, which selects from alignments blocks of
contiguous conserved positions with few nonconserved
positions inside. This is the reason why a program that
simulates among-site rate heterogeneity, but not regional
heterogeneity, would not be valid to test the behavior
of Gblocks. Thus, an important preliminary step in our
simulations was the selection from real proteins of spa-
tial patterns of site rates in order to use these parameters
with Rose.
Selection of Evolutionary Rate Patterns
We extracted patterns of rate heterogeneity from
real protein alignments using the program TreePuzzle
(Strimmer and von Haeseler, 1996) with a model of
among-site rate heterogeneity that assumed a Gamma
distribution of rates. This distribution was approximated
with 16 rate categories, which is the maximum number
allowed in TreePuzzle. In particular, we took, from each
position, the category and associated relative rate that
contributed the most to the likelihood. Positions with
rates >1 receive more mutations than the average and po-
sitions with rates <1 receive fewer mutations. This list of
relative rates (whose average should be 1) were given to
Rose to simulate different positions with different rates,
creating conserved and divergent regions with lengths
and boundaries that approximated those of a real pro-
tein. Proteins for extracting rate patterns were NAD2 and
NAD4 (subunits 2 and 4 of the mitochondrial NADH de-
hydrogenase) from several metazoans (Castresana et al.,
1998b), and COG0285 from the COG database, which in-
cludes mainly bacterial sequences (Tatusov et al., 2003).
The three selected profiles produced similar conclusions
regarding the best block selection strategy, and we used
the NAD2 pattern to perform most of the tests. This
pattern contained 361 positions but, after the introduc-
tion of further gaps by the simulation algorithm, the
final simulated alignments reached approximately 400
positions. In order to simulate alignments of different
length, independent simulations obtained with this pat-
tern were concatenated 1, 2, 3, 4, and 8 times to generate
final alignments of, approximately, 400, 800, 1200, 1600,
and 3200 positions, respectively. The PAM evolutionary
model (Dayhoff et al., 1978) was used to simulate the
evolution of amino acids.
Selection of Phylogenetic Trees
Simulations with Rose were performed along phylo-
genetic trees of 16 tips with three different topologies,
a purely asymmetric tree (Fig. 1a), an intermediate tree
(Fig. 1b), and a symmetric tree (Fig. 1c). These known
trees or “real trees” were manually constructed. The av-
erage and maximum length from the root to the tips
was, for the asymmetric tree, 0.89 and 1.30 substitu-
tions/position, respectively. The other trees had very
similar values. The branch lengths of the three trees in
Figure 1 were multiplied by factors of 0.5, 1, and 2, re-
spectively, so that we used in total 9 phylogenetic trees.
These trees had several short internal branches that made
them difficult to resolve; thus, they are trees where the
alignment strategy as well as the phylogenetic algorithm
used were differentially effective. Simpler trees in terms
of longer internodes were easily and equally reproduced
by all methods and were not used here. Similarly, trees
with a total smaller divergence tended to produce con-
served alignments where the alignment method was not
an issue and also not used here. Finally, these trees did
Downloaded By: [USYB - Systematic Biology] At: 02:07 18 July 2007
566 SYSTEMATIC BIOLOGY VOL. 56
FIGURE 1. Asymmetric (a), intermediate (b), and symmetric (c) trees used in the simulations. The scale bar, in substitutions/position,
corresponds to the trees with a divergence ×1.
not contain many closely related sequences, since we
wanted to specifically measure differences in reproduc-
ing the overall shape of the tree and not differences in
recovering the relationships among close sequences.
Gaps Introduced during the Simulations
The Rose program does not have any specific model
for the introduction of gaps along the alignment. Rather,
gaps are introduced with equal probability in all posi-
tions with a relative rate 1 (Stoye et al., 1998), which
is a limitation of this program. To try to overcome this
limitation, we used two different gap strategies within
Rose. First, we used a single gap threshold for the whole
alignment. After several trials, we considered a thresh-
old of 0.0007 as a reasonable one for the divergence
levels we analyzed, as deduced from visual inspection
of the alignment (that is, eyeing that blocks of diver-
gence and conservation were not so different from the
real proteins used to construct the rate profiles). Even so,
this threshold tended to produce too many gaps in con-
served regions (not shown). In addition, we also gener-
ated alignments with two different gap thresholds, 0.001
and 0.0001, which we associated, respectively, to diver-
gent and to conserved regions of the profiles. For doing
so, we divided the rate profiles in blocks of homoge-
neous divergence (that is, each block was either mostly
conserved or mostly divergent, which resulted in around
10 to 20 blocks for the different profiles). Then, we did
the simulations for each block separately, and with its
own gap threshold (high for divergent blocks and low for
more conserved blocks). Finally, the different simulated
blocks were concatenated. The phylogenetic results were
similar with both gap strategies, but we mostly worked
with simulations that had the two different gap thresh-
olds, which we considered more realistic. In all cases we
chose a vector of indels of the form [0.5, 0.4, 0.3, 0.2,
0.1], which reflects the relative frequency of indels with
lengths from 1 to 5 amino acids, respectively.
Realignments of Simulated Sequences
Alignments generated by Rose were cleaned fromgaps
and new alignments were reconstructed using ClustalW
version 1.83 (Thompson et al., 1994), Mafft version 5.531
(Katoh et al., 2002, 2005), and Probcons version 1.1 (Do
et al., 2005). Default parameters were used in ClustalW
and Probcons. All defaults were also used in Mafft ex-
cept that a neighbor joining instead of a UPGMA tree was
used as guide tree (option –nj). Alignments were cleaned
from problematic alignment blocks using Gblocks 0.91
(Castresana, 2000), for which two different parameter
sets were used. In one of them, which we call here strin-
gent selection, and which is the default one in Gblocks
0.91, “Minimum Number of Sequences for a Conserved
Position” was 9, “Minimum Number of Sequences for a
Flank Position” was 13, “Maximum Number of Contigu-
ous Nonconserved Positions” was 8, “Minimum Length
of a Block” was 10, and “Allowed Gap Positions” was
“None”. In the second set, which we call relaxed selec-
tion, we changed “Minimum Number of Sequences for
a Flank Position” to 9, “Maximum Number of Contigu-
ous Nonconserved Positions” to 10, “Minimum Length
of a Block” to 5, and “Allowed Gap Positions” to “With
Half”. The latter option allows the selection of positions
with gaps when they are present in less than half of the
sequences.
Original simulated alignments and Mafft realignments
for 30 example simulations (the first five simulations gen-
erated with the symmetric and asymmetric trees) are pro-
vided as supplementary information (available online at
http://systematicbiology.org).
Phylogenetic Reconstruction
Phylogenetic trees from the complete and the two dif-
ferent Gblocks alignments were estimated by ML, NJ,
and parsimony. For ML trees we used the Phyml pro-
gram version 2.4.4 (Guindon and Gascuel, 2003), with
the Jones-Taylor-Thornton model of protein evolution
(Jones et al., 1992) and four rate categories in the Gamma
distribution. The Gamma distribution parameter and
the proportion of invariable sites were estimated by the
program. For NJ trees we used Protdist of the Phylip
package version 3.63 (Felsenstein, 1989) with the Jones-
Taylor-Thornton model to calculate pairwise protein dis-
tances, and Neighbor of the same package to calculate the
NJ tree. For parsimony we used Protpars of the Phylip
Downloaded By: [USYB - Systematic Biology] At: 02:07 18 July 2007
2007 TALAVERA AND CASTRESANA—IMPROVEMENT OF PHYLOGENIES AFTER REMOVING BLOCKS 567
package (Felsenstein, 1989) with 50 random initializa-
tions to ensure a thorough tree search. If no parsimony
tree was obtained, which occurred in less than 1% of the
simulations, the corresponding simulation was totally
excluded from the analysis. When several equally parsi-
monious trees were found, only the first one was used.
We did not do Bayesian trees because of the enormous
computational time required for doing enough number
of generations of all simulations performed.
For each alignment length, alignment strategy, and
phylogenetic method, 300 simulations were run in a grid
of 24 processors. The symmetric difference or Robinson-
Foulds (Robinson and Foulds, 1981) topological distance
from the calculated tree to the real tree was obtained us-
ing Vanilla 1.2 (Drummond and Strimmer, 2001), and the
average of all simulations calculated. This program re-
ports half the number of total discordant clades between
two trees. For bootstrap analyses, 100 bootstraps were
calculated. Due to heavy computational requirements of
the bootstrap analyses, the number of simulations was
reduced to 150. We checked that a higher number of boot-
straps and simulations did not improve the accuracy of
the bootstrap results. Bootstrap values were separately
calculated for right and wrong partitions of the tree with
the help of Bioperl functions (Stajich et al., 2002). Statisti-
cal differences among Robinson-Foulds distances in dif-
ferent alignment conditions were detected by the Tukey-
Kramer test with an alpha level of 0.05 using the JMP
package version 5.1 (SAS Institute, Cary, NC).
R
ESULTS AND DISCUSSION
General Alignment Strategy: Complete versus
Gblocks Alignments
The differences in alignments produced by different
methods can be appreciated in Figure 2. A fragment
of the alignment of simulated sequences (Fig. 2a) was
stripped of gaps and realigned by ClustalW (Fig. 2b),
Mafft (Fig. 2c), and Probcons (Fig. 2d). As it has been
noted before (Higgins et al., 2005), ClustalW tends to
produce more compact alignments. That is, ClustalW
generates many divergent regions that are almost de-
void of gaps, resulting in a relatively simple alignment
(Higgins et al., 2005). This can be clearly appreciated in
the most problematic region in the center of this align-
ment (Fig. 2b). Although Mafft also tends to make align-
ments more compact than the real ones (Fig. 2c), the
deviation from the real situation is not as large as with
ClustalW, at least with default gap penalties. Probcons
TABLE 1. Average number of positions of the complete alignments and the average percentage of positions selected by Gblocks with relaxed
and stringent conditions. Simulation of sequences was done following the asymmetric tree and the heterogeneity pattern of the NAD2 protein
concatenated two times.
ClustalW Mafft Probcons
Total % Gblocks % Gblocks Total % Gblocks % Gblocks Total % Gblocks % Gblocks
Divergence length relaxed stringent length relaxed stringent length relaxed stringent
×0.5 826.6 79.4 54.3 852.5 74.2 51.6 871.8 70.3 50.9
×1 862.4 64.2 42.0 903.7 59.0 39.8 966.4 51.8 37.6
×2 901.8 46.4 30.2 961.7 42.9 28.4 1117.9 34.7 24.5
produces the least compact alignments of the three pro-
grams tested (Fig. 2d). For example, simulations from
asymmetric trees with divergence ×1, which had an av-
erage original length of 1097 positions, were compacted
to an average of 966 positions by Probcons, to 904 posi-
tions by Mafft and to 862 positions by ClustalW (Table 1).
Similar relative degrees of compression were obtained in
other types of simulations.
Gblocks removes problematic regions of a multiple
alignment according to a number of rules. First, blocks
selected for inclusion must be free from a large number
of contiguous nonconserved positions, must be flanked
by highly conserved positions, and must have a mini-
mum length, as controlled by the corresponding param-
eters (see Materials and Methods). In addition, positions
with gaps can be removed either always or only when
more than half of the sequences contain gaps (Castre-
sana, 2000). The latter parameter has a large influence
on the total number of selected positions. We have used
Gblocks in simulated realigned sequences with two dif-
ferent conditions. The condition that we call stringent
does not allow any gap position. The relaxed condition
allows gap positions if they are present in less than half
of the sequences, and it is also less restrictive in the other
parameters (see Materials and Methods). The effect of
the two different parameter sets of Gblocks selection can
be appreciated in Figure 2, for ClustalW (Fig. 2b), Mafft
(Fig. 2c), and Probcons alignments (Fig. 2d). In both cases,
the relaxed parameters (grey blocks) allow the selection
of more positions than the stringent parameters (white
blocks). Table 1 shows the average number of positions of
the complete alignments and the percentage of positions
left after treatment with Gblocks with the two different
parameter sets. Values in this table are for the asymmetric
tree, but similar values were found for other trees.
In order to infer which type of alignment algorithm
(ClustalW, Mafft, or Probcons) and which treatment of
the resulting alignment (no treatment or Gblocks treat-
ment with stringent or relaxed conditions) was best for
phylogenetic analysis, we calculated phylogenetic trees
from all these alignments, and measured the topologi-
cal distance with respect to the real tree. Figure 3 shows,
for the simulations with the asymmetric tree, the aver-
age topological distances to the real tree from the trees
generated with ClustalW alignments, with and with-
out the use of Gblocks. In addition, the distance to the
tree obtained from the Gblocks complementary align-
ment (that is, the alignment resulting after concatena-
tion of all the blocks rejected by Gblocks) is also shown.
Downloaded By: [USYB - Systematic Biology] At: 02:07 18 July 2007
568 SYSTEMATIC BIOLOGY VOL. 56
a)
EDCLRSGKVQQYFSAQYL---DGVGVSLIPQCLQVEFTSRIDFKSFVCHPAECGL---STPA--GC---AQW------------A--E----AGGAGSDFPQVDVANSGYKAERFTVQWQY-KTRNRATIDHHRSAKSLPKKS
DDCTRSGKVKQYFGAQYAA--MGVIYSLIPQCLQVKITSRIDYKNFICAQKACAK-----PG--IPEFGS-------------AG--R---A-SGAESDFGQVDPANKGYKTDRFTVQWQY-RGRGRADIKYHWHACSYQQISA
EDCTRSGKVQQYFSAQYMS--TGIICSLIPQCLQVKFTSCIDYKTFICSPAACGP-----PG--TCYADKVW----FFHFKLSNG--L----DGSAGSDFPQVDPANEGYKSERFTVQWKY-RARDRANIQHHWSVKTYRSQSK
GDCTRAGKVQEYFSAQYLA--IGKAYALIPQCLQVKFTSRIDYKDFICSPGACGA-----PA--NCYYNVVW----VHQFKLDAG--G----SVNAGSDFPRVDPANGGFKKKRFTVQWKY-GARDRVAIEHHWSAKTFRQRS
NDCTRSGKVQQYFSAQYIG--NAVRTSLIP
LCLQVNFTSRSDFKVFACAPAECGDVGLTLPAPRAC---HVW----HF----AEG--TAHA-AANAGTDFPQIEGANKGYKA ERFTVQWKY--VQSRARIVHHWSARTLRKRSL
NDCLRSGKVQVYFSAQYAN--SGVKAALIPEALQVKFTSFIDFKSFVCSPAQCGV---SLPA--GV---GPWYNAILF----PEG--A----TGGAGSDFPQVEPANNGYKAERFGVQWAY-LTRNRATINHHWSARVLPKKS
EDCTRSG
QVQQYFSAQYKA--AGVVYSLIQQCLQVKFTSRVDYKSFICSPNACGQ-----PA--RAYYGKT--------FKLSAG--V----DGNAGSEFLQIDPANDGYKSERFTVQWKY-RARDRATINHHWSVKTYRGQSK
DECTRSGKVQQFFSPQYITSFFGPIYSIIPQCLQVNFTARIDFKTFVCSKGACGL---VAPV--TC---KEW----FF----TGG--L----KGGAGSDYAQVDPANGGYKAERFTVQWPEIKARSRATIDHHWSAKAY
HKKSL
DDCLRSGKVQQYFSAQYMG--NGVKASLIPQCLQVKFTSKIDFTSFICVPTECGI---SLPA--DC---AAW----FF----PDV--D----RGGAGSDFPQVDPGNDGYKAEHFTVQWKY-KARNRTTINHHWSAKTLRKKS
DDCTRSGRVQQYFSAQYLS--GGIIYSLIPKCLQVKFTSCIDYKSFICSPAACAD-----SP--ACYADATW----FFQFKLSDG--V----PGNAGSDFPQVDPANEGYKSERFTVQWKY-KAPDRATINHHWSVKTYRAEST
DDCLRSGNR
QQYFTAVYGN--LGVPTSLIPNCLQVKFTSVIQFSTFIYAPPKCPQ---DTPG--GA---S------TF-----SM--H-----VSADSGYSQVEGENHGLKMGHFDVQW-Y-RPRARAVIDHHWSA--LQNR S
EDCARSGKVQQYFSAQYMS--AVIIYSLIPQCLQVKFTSCIDYKSLICSPAACGE-----PG--TCYADKTW----FFQFKLTAG--L----EGNAGSDFPQVDPANEGYKSERFTVQWKY-KARDRATIQHHWSVKTYRSQSK
DDCTRSGKVQQYFSAQYMI--GGVI
YSLIPQCLQVKFTSCINFKSFICPPAACAE---NLPE--RC---QFW----FF----DTG--E----GGGAGSDFPQVDPANDGYKAERFTVQWHY-KPRDRAAISHHWSAKSLRKNSL
DDCTRSGKVQQYFSAQYLG--GGVVYSLIPQCHQVKFTSKIDYKSLICAPAACGV---DFPA--NC---QTW----FF----GGGGTL----SGGAGSDFPQVDPANDGYKAERFTVQWKY-QAKNRASINHHWSAKSYRKKSP
SDCTRSGKVQQYFTAQYMS--QGKICSLIPDCLKVKFTSCLD
YKSFNV SAAACGD-----PG--TCYAARAW----FFQFKLSVG--L----DGNAGSAYEQ ASPANEGYKSERFTVQWKY-KARDRATIQHHWSVKVYRRRTT
DDCTREGRVEQYFSANYRS--SGILYSLILVCLQVKFTACINFKSFSCSPASCGT-----PS--LCYADKNW----FYQFKL--S--V----EGNGGSNFPQVDPANDGYKTDRFTVQWVY-KARDRASIKHHWSVDTYREGSC
L
G
F
L
F
c)
EDCLRSGKVQQYFSAQYL-D--GVGVSLIPQCLQVEFTSRIDFKSFVCHPAECG-----LSTPAGC---AQW--------AEAGGAGSDFPQVDVANSGYKAERFTVQW-QY KTRNRATIDHHRSAKSLPKK-SL
DDCTRSGKVKQYFGAQYAAM--GVIYSLIPQCLQVKITSRIDYKNFICAQKACA-------KPGIP---------EFGSAGRASGAESDFGQVDPANKGYKTDRFTVQW-QYRGRGRADIKYHWHACSYQQI-S
A
EDCTRSGKVQQYFSAQYMST--GIICSLIPQCLQVKFTSCIDYKTFICSPAACG-------PPGTCYADKVWFFHFKLSNGLDGSAGSDFPQVDPANEGYKSERFTVQW-KYRARDRANIQHHWSVKTYRSQ-SK
GDCTRAGKVQEYFSAQYLAI--GKAYALIPQCLQVKFTSRIDYKDFICSPGACG-------APANCYYNVVWVHQFKLDAGGSVNAGSDFPRVDPANGGFKKKRFTVQW-KYGARDRVAIEHHWSAKTFRQR-
SG
NDCTRSGKVQQYFSAQYIGN--AVRTSLIPLCLQVNFTSRSDFKVFACAPAECGDVGLTLPAPRAC---HVWHF-AEGTAHAAANAGTDFPQIEGANKGYKA ERFTVQW-KY-VQSRARIVHHWSARTLRKR-SL
NDCLRSGKVQVYFSAQYANS--GVKAALIPEALQVKFTSFIDFKSFVCSPAQCG-----VSLPAGV---GPWYNAILFPEGATGGAGSDFPQVEPANNGYKAERFGVQW-AYLTRNRATINHHWSA
RVLPKK-S
EDCTRSGQVQQYFSAQYKAA--GVVYSLIQQCLQVKFTSRVDYKSFICSPNACG-------QPARAYYGKT----FKLSAGVDGNAGSEFLQIDPANDGYKSERFTVQW-KYRARDRATINHHWSVKTYRGQ-SK
DECTRSGKVQQFFSPQYITSFFGPIYSIIPQCLQVNFTARIDFKTFVCSKGACG-----LVAPVTC---KEWF----FTGGLKGGAGSDYAQVDPANGGYKAERFTVQWPEIKARSRA
TIDHHWSAKAYHKK-SL
DDCLRSGKVQQYFSAQYMGN--GVKASLIPQCLQVKFTSKIDFTSFICVPTECG-----ISLPADC---AAWF--F--PDVDRGGAGSDFPQVDPGNDGYKAEHFTVQW-KYKARNRTTINHHWSAKTLRKK-SL
DDCTRSGRVQQYFSAQYLSG--GIIYSLIPKCLQVKFTSCIDYKSFICSPAACA-------DSPACYADATWFFQFKLSDGVPGNAGSDFPQVDPANEGYKSERFTVQW-KYKAPDRATINHHWSV
KTYRAE-ST
DDCLRSGNRQQYFTAVYGNL--GVPTSLIPNCLQVKFTSVIQFSTFIYAPPKCP-----QDTPGGA-----------STFSMHVSADSGYSQVEGEN HGLKMGHFDVQW--YRPRARAVIDHHWSALQNR
EDCARSGKVQQYFSAQYMSA--VIIYSLIPQCLQVKFTSCIDYKSLICSPAACG-------EPGTCYADKTWFFQFKLTAGLEGNAGSDFPQVDPANEGYKSERFTVQW-KYKARDRATIQHHWSV
KTYRSQ-SK
DDCTRSGKVQQYFSAQYMIG--GVIYSLIPQCLQVKFTSCINFKSFICPPAACA-----ENLPERC---QFWF----FDTGEGGGAGSDFPQVDPANDGYKAERFTVQW-HY KPRDRAAISHHWSAKSLRKN-SL
DDCTRSGKVQQYFSAQYLGG--GVVYSLIPQCHQVKFTSKIDYKSLICAPAACG-----VDFPANC---QTWF--FGGGGTLSGGAGSDFPQVDPANDGYKAERFTVQW-KYQAKNRASINHHWSAKSYRKK-SP
SDCTRSGKVQQYFTAQYMSQ--
GKICSLIPDCLKVKFTSCLDYKSFNV SAAACG-------DPGTCYAARAWF FQFKLSVGLDGNAGSAYEQASPANEGYKSERFTVQW-KYKARDRATIQHHWSVKVYRRRTTT
DDCTREGRVEQYFSANYRSS--GILYSLILVCLQVKFTACINFKSFSCSPASCG-------TPSLCYADKNWFYQFKLS--VEGNGGSNFPQVDPANDGYKTDRFTVQW-VY KARDRASIKHHWSVDTYR---EG
F
SFFGN
b)
EDCLRSGKVQQYFSAQYLD---GVGVSLIPQCLQVEFTSRIDFKSFVCHPAECGLSTPAGCAQW------------AEAGGAGSDFPQVDVANSGYKAERFTVQW-QYKTRNRATIDHHRSAKSLPKKS
-DCTRSGKVKQYFGAQYAAM--GVIYSLIPQCLQVKITSRIDYKNFICAQKACAKPGIPEFGSAG------------RASGAESDFGQVDPANKGYKTDRFTVQW-QYRGRGRADIKYHWHACSYQQISA
-DCTRSGKVQQYFSAQYMST--GIICSLIPQCLQVKFTSCIDYK
TFICSPAACGPPGTCYADKVWFFHFKLSN---GLDGSAGSDFPQVDPAN EGYKSERFTVQW-KYRARDRANIQHHWSVKTYRSQSK
GDCTRAGKVQEYFSAQYLAI--GKAYALIPQCLQVKFTSRIDYKDFICSPGACGAPANCYYNVVWVHQFKLDA---GGSVNAGSDFPRVDPANGGFKKKRFTVQW-KYGARDRVAIEHHWSAKTFRQRSG
NDCTRSGKVQQYFSAQYIGN--AVRTSLIPLCLQVNFTSRSDFKVFACAPAECGDVGLTLPAPRACHVWHFAEGTAHAAANAGT
DFPQIEGANKGYKAERFTVQW--KYVQSRARIVHHWSARTLRKRSL
NDCLRSGKVQVYFSAQYANS--GVKAALIPEALQVKFTSFIDFKSFVCSPAQCGVSLPAGVGPWYNA-ILFPE---GATGGAGSDFPQVEPANNGYKAERFGVQW-AYLTRNRATINHHWSARVLPKKSF
-DCTRSGQVQQYFSAQYKAA--GVVYSLIQQCLQVKFTSRVDYKSFICSPNACGQPARAYYGKT----FKLSA---GVDGNAGSEFLQIDPANDGYKSERFTVQW-KYRAR
DRATIN HHWSVKTYRGQSK
-ECTRSGKVQQFFSPQYITSFFGPIYSIIPQCLQVNFTARIDFKTFVCSKGACGLVAPVTCKEWFFT-----G---GLKGGAGSDYAQVDPANGGYKAERFTVQWPEIKARSRATIDHHWSAKAYHKKSL
DDCLRSGKVQQYFSAQYMGN--GVKASLIPQCLQVKFTSKIDFTSFICVPTECGISLPADCAAWF-----FPD---VDRGGAGSDFPQVDPGNDGYKAEHFTVQW-KYKARNRTTINHHWSAKTL
RKKS
-DCTRSGRVQQYFSAQYLSG--GIIYSLIPKCLQVKFTSCIDYKSFICSPAACADSPACYADATWFFQFKLSD---GVPGNAGSDFPQVDPANEGYKSERFTVQW-KYKAPDRATINHHWSV KTYRAEST
DDCLRSGNRQQYFTAVYGNLG--VPTSLIPNCLQVKFTSVIQFSTFIYAPPKCPQDTPGGASTFS------------MHVSADSGYSQVEGENHGLKMGHFDVQW--YRPRARAVIDHHWSALQNRSFFG
-DCARSGKVQQYFSAQY
MSA--VIIYSLIPQCLQVKFTSCIDY KSLICSPAACGEPGTCYADKTWFFQFKLTA---GLEGNAGSDFPQVDPAN EGYKSERFTVQW-KYKARDRATIQHHWSVKTYRSQSK
-DCTRSGKVQQYFSAQYMIG--GVIYSLIPQCLQVKFTSCINFKSFICPPAACAENLPERCQFWFFD-----T---GEGGGAGSDFPQVDPANDGYKAERFTVQW-HYKPRDRAAISHHWSAKSLRKNSL
-DCTRSGKVQQYFSAQYLGG--GVVYSLIPQCHQVKFTSKIDYKSLICAPAACG---VDFPANCQTWFFGGGG---TLSGGAGSDFPQVDPANDGYKAERFTVQW-KYQAKN
RASINHHWSAKSYRKKSP
-DCTRSGKVQQYFTAQYMSQ--GKICSLIPDCLKVKFTSCLDYKSFNVSAAACGDPGTCYAARAWFFQFKLSV---GLDGNAGSAYEQASPANEGYKSERFTVQW-KYKARDRATIQHHWSVKVYRRRTT
-DCTREGRVEQYFSANYRSS--GILYSLILVCLQVKFTACINFKSFSCSPASCGTPSLCYADKNWFYQFKLS-----VEGNGGSNFPQVDPANDGYKTDRFTVQW-VYKARD RASI
KHHWSVDTYREGSC
L
L
EDCLRSGKVQQYFSAQYLD---GVGVSLIPQCLQVEFTSRIDFKSFV CHPAECGLS-----TPA-GCAQWA-------------EAGGAGSDFPQVDVANSGYKAERFTVQWQ-YKTRNRATIDHHRSAKSLPKKSL
DDCTRSGKVKQYFGAQYAAM--GVIYSLIPQCLQVKITSRIDYKNFICAQKACAKP-----GIPEF-------G--S---A--GRASGAESDFGQVDPANKGYKTD RFTVQWQ-YRGRGRADIKYHWHACSYQQISA
EDCTRSGKVQQYFSAQYMST--GIIC
SLIPQCLQVKFTSCIDYKTFICSPAACGPP-----GTCYADKVWFFHFKLS---N--GLDGSAGSDFPQVDPANEGYKSERFTVQWK-YRARDRANIQHHWSVKTYRSQSK
GDCTRAGKVQEYFSAQYLAI--GKAYA LIPQCLQVKFTSRIDYKDFICSPGACGAP-----ANCYYNVVWVHQFKLD---A--GGSVNAGSDFPRVDPANGGFKKKRFTVQWK-YGARDRVAIEHHWSAKTFRQRSG
NDCTRSGKVQQYFSAQYIGN--AVRTSLIPLCLQVNFTSRSDFKVF
ACAPAECGDVGLTLPAPR-ACHVWH F----AEGTA--HAAANAGTDFPQIEGANKGYKAERFTVQWK-Y-VQSRARIVHHWSARTLRKRSL
NDCLRSGKVQVYFSAQY ANS--GVKAALIPEALQVKFTSFIDFKSFVCSPAQCGVS-----LPA-GVGPWYNAILFP---E--GATGGAGSDFPQVEPANNGYKAERFGVQWA-YLTRNRATINHHWSARVLPKKSF
EDCTRSGQVQQYFSAQYKAA--GVVYSLIQQCLQVKFTSRVDYKSFICSPNACGQP-----ARAYYGKTFK----LS---A--GVD
GNAGSEFLQIDPANDGYKSERFTVQWK-YRARDRATINHHWSVKTYRGQSK
DECTRSGKVQQFFSPQY ITSFFGPIYSIIPQCLQVNFTARIDFKTFVCSKGACGLV-----APV-TCKEWFF----T---G--GLKGGAGSDYAQVDPANGGYKAERFTVQWPEIKARSRATIDHHWSAKAYHKKSL
DDCLRSGKVQQYFSAQYMGN--GVKASLIPQCLQVKFTSKIDFTSFICVPTECGIS-----LPA-DCAAWFF----P---D--VDRG
GAGSDFPQVDPGNDGYKAE HFTVQWK-YKARNRTTINHHWSAKTLRKKSL
DDCTRSGRVQQYFSAQYLSG--GIIYSLIPKCLQVKFTSCIDYKSFICSPAACADS-----PACYADATWFFQFKLS---D--GVPGNAGSDFPQVDPANEGYKSERFTVQWK-YKAPDRA TINHHWSVKTYRAES T
DDCLRSGNRQQYFTAVY GNL--GVPTSLIPNCLQVKFTSVIQFSTFIYAPPKCPQD-----TPG-GASTF-------------SMHVSADSGYSQVEGENH
GLKMGHFDVQW--YRPRARAVIDHHWSALQNRSFFG
EDCARSGKVQQYFSAQYMSA--VIIYSLIPQCLQVKFTSCIDYKSLICSPAACGEP-----GTCYADKTWFFQFKLT---A--GLE GNAGSDFPQVDPANEGYKSERFTVQWK-YKARDRATIQHHWSVKTYRSQSK
D
DCTRSGKVQQYFSAQYMIG--GVIYSLIPQCLQVKFTSCINFKSFICPPAACAEN-----LPE-RCQFWFF----D---T--GEGGGAGSDFPQVDPAND GYKAERFTVQWH-YKPRDRAAISHHWSAKSLRKNSL
DDCTRSGKVQQYFSAQYLGG--GVVYSLIPQCHQVKFTSKIDYKSLICAPAACGVD-----FPA-NCQTWFF----G---GGGTLSGGAGSDFPQVDPANDGYKAERFTVQWK-YQAKNRASINHHWSAKSYRKKSP
SDCTRSGKVQQYFTAQYMSQ--GKICSLIPDCLKVKFTSCLD YKSFNVSAAACGDP-----GTCYAARAWFF
QFKLS---V--GLDGNAGSAYEQASPANE GYKSERFTVQWK-YKARDRATIQHHWSVKVYRRRTT
DDCTREGRVEQYFSANYRSS--GILYSLILVCLQVKFTACINFKSFSCSPASCGTP-----SLCYADKNWF YQF--K---L--SVEGNGGSNFPQVDPANDGYKTD RFTVQWV-YKARDRASIKHHWSVDTYREGSC
d)
FIGURE 2. Fragment of a simulated alignment (a) and the realignment of the same sequences (after gap removal) by ClustalW (b), Mafft
(c), and Probcons (d). The simulation corresponds to an asymmetric tree with divergence ×1. The blocks below each alignment represent the
fragments selected by Gblocks with relaxed conditions (grey blocks) and with stringent conditions (white blocks). Positions of the alignments
where more than 50% of the sequences are identical are shown with black boxes.
Figure 4 represents for each tree (and for two representa-
tive lengths, 800 and 3200 amino acids, as representatives
of single-gene and concatenated-gene phylogenies) the
best alignment strategies after statistically comparing the
average topological distances by means of the Tukey-
Kramer test. An overview of these two figures shows
that, when the alignments are cleaned by Gblocks with
any of the two parameter sets used (dotted lines in Fig-
ure 3), the topological distance to the real tree decreases
with respect to the complete alignment (solid, red line)
in almost all divergences and alignment lengths tested,
and with the three tree reconstruction methods used:
Downloaded By: [USYB - Systematic Biology] At: 02:07 18 July 2007
2007 TALAVERA AND CASTRESANA—IMPROVEMENT OF PHYLOGENIES AFTER REMOVING BLOCKS 569
FIGURE 3. Average Robinson-Foulds distances to the real tree from the tree calculated with ClustalW complete alignments (solid, red line with
crossed symbols), the same alignments after treatment with Gblocks relaxed (dotted, blue line with diamonds) and stringent (dotted, green line
with squared symbols) conditions, and the complementary alignments of the Gblocks relaxed alignment (solid, orange line with triangles). The
asymmetric tree with three different divergence levels was used for the simulations with different alignment lengths. Trees were reconstructed
by ML, NJ, and parsimony.
ML, NJ, and parsimony. The improvement in topolog-
ical accuracy upon Gblocks treatment is more noticeable
for the highest divergences (×2). This is expected since
there are more problematic blocks in these alignments,
as shown by the lower percentage of positions selected
by Gblocks (Table 1). In addition, the improvement from
Gblocks treatment is particularly large for NJ and parsi-
mony. These two methods produce quite poor topologies
when using the complete alignments but, upon using
Gblocks, particularly with the most stringent conditions
(green line, squared symbols), there is a substantial gain
in topological accuracy. ML produces the overall best
trees (see also below) although, in the lowest divergence
(×0.5), there is almost no difference in topological qual-
ity between the Gblocks and the complete alignments.
In fact, for short genes (400 to 800 amino acids) the com-
plete alignment gives rise to better trees than the Gblocks
alignments, although there is no statistical difference be-
tween the complete alignment and the Gblocks align-
ment with relaxed parameters (Fig. 4).
Downloaded By: [USYB - Systematic Biology] At: 02:07 18 July 2007
570 SYSTEMATIC BIOLOGY VOL. 56
FIGURE 4. ClustalW alignment strategies that give rise to the statistically best topologies. When two or more strategies do not show statistical
differences in Robinson-Foulds distances, all equivalent strategies are represented. The complete alignment is represented by a black block, and
the relaxed and stringent Gblocks strategies by grey and white blocks, respectively.
It is thus shown from the example above that the re-
moval of divergent and problematic regions of an align-
ment is, in principle, beneficial for phylogenetic analyses
of relatively divergent sequences. In fact, it is true, as pre-
viously argued (Aagesen, 2004; Lee, 2001), that there is
some phylogenetic information in the blocks removed
by methods like Gblocks. This can be appreciated in Fig-
ure 3, which shows the topological distances to the real
trees from the trees obtained with the blocks excluded by
Gblocks (complementary alignment; solid, orange line).
These distances, although very large, become quite re-
duced for long alignments, indicating that trees obtained
from the complementary regions are not random; that is,
there is some phylogenetic information in the regions re-
jected by Gblocks. However, what seems to matter is not
the total phylogenetic signal but the signal-to-noise ratio.
Despite the relatively simple simulations performed, re-
gions excluded by Gblocks seem to add more noise than
signal, thus lowering the quality of the trees from the
complete alignments with respect to the Gblocks-cleaned
alignments.
Similar conclusions about the beneficial effect of
Gblocks can be drawn from Mafft alignments of the same
asymmetric trees (Figs. 5 and 6). In this case, Gblocks is
not an advantage over the complete alignment in the two
most conserved alignments (×0.5 and ×1) when using
the ML method although, again, Gblocks relaxed and
the complete alignments are not statistically different.
The picture for Probcons (Fig. 1 of the online Appendix,
available at http://systematicbiology.org) is similar to
that for Mafft. Figure 2 of the online Appendix shows
a comparison of the three alignment programs with de-
fault gap costs, using the trees produced after Gblocks
cleaning with relaxed conditions. Under the conditions
of these simulations, ClustalW is slightly worse, regard-
ing the trees produced, than the two other programs. The
performances of Mafft and Probcons are very similar, and
only for NJ and parsimony Probcons alignments work
slightly better. Probcons, however, is highly demand-
ing in computational time. Thus, for the rest of the tests
we only compared the performances of ClustalW and
Mafft.
Downloaded By: [USYB - Systematic Biology] At: 02:07 18 July 2007
2007 TALAVERA AND CASTRESANA—IMPROVEMENT OF PHYLOGENIES AFTER REMOVING BLOCKS 571
FIGURE 5. Average Robinson-Foulds distances to the real tree from the tree calculated with Mafft complete alignments (solid, red line with
crossed symbols), the same alignments after treatment with Gblocks relaxed (dotted, blue line with diamonds) and stringent (dotted, green line
with squared symbols) conditions, and the complementary alignments of the Gblocks relaxed alignment (solid, orange line with triangles). The
asymmetric tree with three different divergence levels was used for the simulations with different alignment lengths. Trees were reconstructed
by ML, NJ, and parsimony.
The results for the symmetric and intermediate trees of
both alignment algorithms are shown in the correspond-
ing columns of Figures 4 and 6 for the ClustalW and
Mafft methods, respectively (and in Figures 3 to 6 in the
online Appendix for all alignment lengths). Two results
are noteworthy from these analyses. First, differences
in phylogenetic performance between different align-
ments derived from symmetric trees are quantitatively
smaller, in agreement with a previous work (Ogden and
Rosenberg, 2006). See, for example, the similarity of the
three graphs of ML trees of ClustalW alignments (Fig. 3
in the online Appendix). Second, in these trees there are
two conditions where the Gblocks alignments produce
ML trees that are statistically worse than the complete
alignments: the symmetric and intermediate trees of di-
vergence ×1 with Mafft alignments of 800 amino acids
(Fig. 6). These are the only two conditions where we ob-
served this. However, we do not think that this justifies
Downloaded By: [USYB - Systematic Biology] At: 02:07 18 July 2007
572 SYSTEMATIC BIOLOGY VOL. 56
FIGURE 6. Mafft alignment strategies that give rise to the statistically best topologies. When two or more strategies do not show statistical
differences in Robinson-Foulds distances, all equivalent strategies are represented. The complete alignment is represented by a black block, and
the relaxed and stringent Gblocks strategies by grey and white blocks, respectively.
not using Gblocks in these types of trees, even if we
could know the shape of the tree in advance. In real
alignments, evolution must be much more complex than
what we simulated. For example, we did not simu-
late biased amino acid compositions (Castresana et al.,
1998a) or different models of evolution in different parts
of trees (Philippe and Laurent, 1998), all of which will
have stronger biasing effects in nonconserved blocks. Be-
cause the difference in topological accuracy between the
Gblocks and the complete alignments is very small in
these two conditions, it is very likely that the addition of
any of these effects in the simulations would have made
both the Gblocks relaxed and complete alignments of at
least equal performance.
All simulations shown so far were performed follow-
ing a pattern of rate variation of the NAD2 protein. To
test the influence of different rate patterns, we used in
the simulations profiles derived from two other proteins
(NAD4 and COG0285). From the Mafft alignments of
these simulations we calculated the corresponding ML
trees (Fig. 7 in the online Appendix). Different patterns
(and thus different percentages of block selection) gave
rise to different performances of the complete and the
Gblocks alignments, but the results were similar in rela-
tive terms. We also tested the performance of a different
gap model, in which gaps were introduced homoge-
neously along the alignment, instead of using two differ-
ent gap thresholds in different regions of the alignments
(see Materials and Methods). The results were again sim-
ilar with the simpler gap strategy, as shown for the ML
reconstruction of the asymmetric trees (Fig. 8 of the On-
line appendix).
Phylogenetic Methods Used
The data shown above indicate that ML is the phyloge-
netic method that best extracts reliable information from
problematic alignment regions, since trees derived from
complete alignments are relatively good. This contrasts
with the trees obtained by NJ and parsimony, which are
quite poor from the complete alignments, indicating that
they greatly benefited from the use of Gblocks. ML is also
the method that produces the overall best trees, in agree-
ment with previous simulation analysis (see references
Downloaded By: [USYB - Systematic Biology] At: 02:07 18 July 2007
2007 TALAVERA AND CASTRESANA—IMPROVEMENT OF PHYLOGENIES AFTER REMOVING BLOCKS 573
FIGURE 7. Average Robinson-Foulds distances to the real tree from the tree calculated with Mafft complete (solid line, solid symbols) and
ClustalW complete alignments (solid line, empty symbols). The tree distances obtained with the same alignments after treatment with Gblocks
with relaxed conditions (dotted lines) are also shown. Trees were reconstructed by ML (circles), NJ (squares), and parsimony (triangles). The
most divergent asymmetric tree was used for the simulations.
in Felsenstein, 2004). To show this, Figure 7 presents the
superimposed graphs for the most divergent asymmet-
ric tree as an example. The better performance of ML
in all alignment conditions is clearly appreciated in this
graph.
Short versus Long Alignments
Alignment length turned out to be a very important
factor to be taken into account when deciding the best
alignment cleaning strategy. Figures 3 and 5 show that,
in general, for shorter alignments the best Gblocks con-
dition is the relaxed one, whereas for longer alignments
the stringent condition tends to work better. This can also
be appreciated by comparing the slopes of the graphs
corresponding to the complete alignments, and those of
the Gblocks alignments with relaxed and stringent con-
ditions. The slope downwards (towards better trees) is
less pronounced for the complete alignments and more
pronounced for Gblocks with stringent conditions. This
means that for single genes (400 to 800 amino acids) the
gain in signal-to-noise ratio after elimination of prob-
lematic blocks may not compensate the total loss of in-
formation. However, for longer alignments, for example,
those used in phylogenomic studies where several genes
are concatenated (Delsuc et al., 2005; Jeffroy et al., 2006),
there is enough total information so that selecting the
best pieces with Gblocks using the stringent conditions
allows to get closer to the real tree. This basic tendency
is observed under all simulation conditions we tested.
Bootstrap Support in Trees Obtained
from Gblocks Alignments
Previous performance tests of Gblocks with real data
showed that Gblocks alignments obtained less support
in ML analysis, because the number of trees not sig-
nificantly different from the ML tree was smaller in
the complete alignment than in the Gblocks alignment
(Castresana, 2000). Later, in numerous studies in our
group and in other groups, the same effect was observed
using bootstrap values of NJ trees, which were lower
in the Gblocks alignments. Our simulations reproduced
the same behavior again. In NJ trees obtained from 100
bootstrap samples, the average bootstrap support of all
partitions was higher for the complete alignments, and
lower for Gblocks alignments (Fig. 8). However, the same
simulations (see topological distances of NJ trees in Fig-
ures 3 and 5) showed that the best trees were obtained
with Gblocks conditions and the worse topologies with
the complete alignments, thus following the opposite di-
rection, regarding quality, to the bootstrap values, at least
for the maximum divergence. A similar trend was found
for NJ trees of simulations with symmetric trees (Fig. 9
of the online Appendix) and for bootstrapped ML trees
(Fig. 10 of the online Appendix). One may think that the
bootstraps of Gblocks trees are lower due to the smaller
length of the Gblocks alignments, but it is still very para-
doxical that the best topology is associated to a lower
bootstrap.
The explanation for this contradictory behavior of
Gblocks may be that divergent and problematic align-
ment regions are biased towards an erroneous topology
(Lake, 1991). This could happen if the initial guide tree
used in the progressive alignment methods is conducting
very strongly the alignment in the divergent and most
gappy regions, where alignment programs may easily
create similarity at the expense of homology (Higgins
et al., 2005). In addition, when alignment software is
faced with an ambiguous alignment decision, the algo-
rithmic solution makes consistent but arbitrary decisions
Downloaded By: [USYB - Systematic Biology] At: 02:07 18 July 2007
574 SYSTEMATIC BIOLOGY VOL. 56
FIGURE 8. Average bootstrap values of NJ trees obtained from ClustalW (a) and Mafft (b) alignments simulated from the asymmetric tree
with three different divergence levels. Complete (solid, red line), Gblocks relaxed (dotted, blue line with diamonds), and Gblocks stringent
(dotted, green line with squared symbols) alignments are shown.
FIGURE 9. Average Robinson-Foulds distances from the ClustalW guide tree to the real tree (red line with crossed symbols), from the guide
tree to the NJ tree of the Gblocks alignment with relaxed conditions (green line with squared symbols), and from the guide tree to the NJ tree
of the complementary positions of the same Gblocks alignment (blue line with diamonds). The asymmetric tree with three different divergence
levels was used for the simulations.
Downloaded By: [USYB - Systematic Biology] At: 02:07 18 July 2007
2007 TALAVERA AND CASTRESANA—IMPROVEMENT OF PHYLOGENIES AFTER REMOVING BLOCKS 575
that bias the support indices. That is, this repeated align-
ment decisions will increase the bootstrap support, and
this bias will be stronger in the most divergent regions,
where there is more uncertainty. Three results are con-
sistent with this possibility. Firstly, we have observed
in our simulations that the initial guide dendrogram
used by ClustalW is indeed very different from the real
tree, as measured by the Robinson-Foulds distance of
both trees (Fig. 9). If all divergent regions tend to eas-
ily reproduce this initial dendrogram, we would expect
that the guide tree is more similar to the tree obtained
from the Gblocks excluded regions than to the Gblocks
alignment. Figure 9 shows that this is the case, partic-
ularly in the most divergent simulations. Secondly, we
see that the effect of increased bootstrap support in the
complete alignment with respect to the Gblocks align-
ments is higher in ClustalW, which highly depends on
the initial dendrogram, than in Mafft (Fig. 8). For exam-
ple, in simulations of 400 amino acids and at ×2 diver-
gence, there is an increase from 60% to 76% bootstrap
support in ClustalW when comparing the Gblocks strin-
gent and complete alignments, and only from 60% to
70% in Mafft. In the latter method, the successive it-
erations of the alignment algorithm may make the fi-
nal alignment more independent from the initial crude
dendrogram, thus explaining that trees generated from
these alignments are slightly less biased. And thirdly,
when we calculated separately bootstraps of right and
wrong partitions for each tree we observe, apart from
lower values for wrong partitions, a slightly higher bias
in them (Fig. 11 of the online Appendix). The bias is
also present in the right partitions, probably because
some of the recurrent software decisions in the diver-
gent regions are actually correct. Thus, the bias coming
from divergent regions seems to increase the bootstrap
of all partitions, although the effect is slightly larger in
the wrong ones. All this indicates that bootstrap sup-
port cannot be used as a measure of reliability of the
tree topology when divergent regions are present in the
alignment.
C
ONCLUSIONS
We have shown, under the conditions of these simu-
lations, that the information contained in divergent and
ambiguously aligned regions of multiple alignments is,
in general, not beneficial for phylogenetic reconstruction.
Thus, using Gblocks or a similar method for removing
problematic blocks seems to be justified for phylogenetic
analysis, particularly for divergent alignments. In this
work, we have used simulations of moderately diver-
gent and very heterogeneous proteins, which are typ-
ically used in deep phylogenies (i.e., bacterial groups,
eukaryotes lineages, metazoan phyla). However, we do
not know how removal of blocks would affect more con-
served and less heterogeneous alignments. We have also
not tested how a finer tuning of parameters of align-
ment programs and Gblocks may improve the phyloge-
nies. Although we have only used protein alignments,
the same conclusions are expected to apply to protein-
coding DNA alignments of similar divergence. On the
other hand, although we predict that the general con-
clusion that ambiguously aligned regions in any data set
are best excluded when they provide more noise than sig-
nal, rRNA alignments as well as alignments from non-
coding DNA have very different features from coding
alignments, and our simulations were not specifically
designed to explore the properties of these kinds of se-
quences. However, our purpose in this work is not giving
strict rules about the best alignment strategy and asso-
ciated parameters. Rather, our simulations are mainly
informative about general tendencies. Thus, in the fol-
lowing we summarize important tendencies observed in
our simulations and give some general rules regarding
the best alignment strategy that can be applied to real
situations of protein alignments.
NJ and parsimony seem to be unable to extract
useful phylogenetic information from the problematic
alignment regions, because the complete alignments are
always much worse than the Gblocks treated alignments,
so using Gblocks seems particularly advisable for these
methods. Most probably, these two methods are not able
to take into account the multiple substitutions that oc-
cur in these excessively saturated blocks. On the other
hand, ML, less affected by saturation, is able to extract
some information from these blocks, since in some condi-
tions the complete alignments are similar or even better
than the Gblocks alignments. However, the misidenti-
fied homology that may occur in these regions affects
all phylogenetic methods, which may explain why us-
ing Gblocks is more beneficial at high divergences for all
methods.
Regarding the use of stringent or relaxed conditions
for Gblocks, two important rules can be extracted from
our analysis. First, for ML trees relaxed conditions of
Gblocks seem to give rise to better trees, whereas for NJ
and parsimony stringent conditions are better. Second,
alignment length is a crucial parameter to be taken into
account. For short alignments, such as in studies of sin-
gle short genes, the removal of blocks by Gblocks may
leave too few positions, so in these cases it may be better
to use very relaxed conditions of Gblocks. In the short-
est alignments, which have very little information, use
of Gblocks may be even detrimental. At any rate, one
should be aware that with this type of short alignments
it is only possible to obtain a very approximate topology,
possibly quite distant from the real tree. For phyloge-
nomic studies, where there is enough information from
the concatenation of several genes (Jeffroy et al., 2006),
the use of Gblocks with stringent conditions tends to give
rise to the best phylogenetic trees.
A
CKNOWLEDGMENTS
This work was supported financially by a research grant in bioinfor-
matics from the Fundaci´on BBVA (Spain), and grant number BIO2002-
04426-C02-02 from the Plan Nacional de Investigaci´on Cient´ıfica,
Desarrollo e Innovaci ´on Tecnol´ogica (I+D+I) of the MEC, cofinanced
with FEDER funds. We thank V. Soria-Carrasco for useful technical as-
sistance, and three anonymous reviewers, K. Kjer, and R.D.M. Page for
critical comments that helped improve the manuscript.
Downloaded By: [USYB - Systematic Biology] At: 02:07 18 July 2007
576 SYSTEMATIC BIOLOGY VOL. 56
REFERENCES
Aagesen, L. 2004. The information content of an ambiguously alignable
region, a case study of the trnL intron from the Rhamnaceae. Organ.
Divers. Evol. 4:35–49.
Blackshields, G., I. M. Wallace, M. Larkin, and D. G. Higgins. 2006.
Analysis and comparison of benchmarks for multiple sequence
alignment. In Silico Biol. 6:321–339.
Castresana, J. 2000. Selection of conserved blocks from multiple align-
ments for their use in phylogenetic analysis. Mol. Biol. Evol. 17:540–
552.
Castresana, J., G. Feldmaier-Fuchs, and S. P¨abo. 1998a. Codon reas-
signment and amino acid composition in hemichordate mitochon-
dria. Proc. Natl. Acad. Sci. USA 95:3703–3707.
Castresana, J., G. Feldmaier-Fuchs, S. Yokobori, N. Satoh, and S. P¨abo.
1998b. The mitochondrial genome of the hemichordate Balanoglossus
carnosus and the evolution of deuterostome mitochondria. Genetics
150:1115–1123.
Dayhoff, M. O., R. M. Schwartz, and B. C. Orcutt. 1978. A model of evo-
lutionary change in proteins. Pages 345–352 in Atlas of protein se-
quence structure (M. O. Dayhoff, ed.) National Biomedical Research
Foundation, Washington, D.C.
Delsuc, F., H. Brinkmann, and H. Philippe. 2005. Phylogenomics
and the reconstruction of the tree of life. Nat. Rev. Genet. 6:361–
375.
Do, C. B., M. S. Mahabhashyam, M. Brudno, and S. Batzoglou. 2005.
ProbCons: Probabilistic consistency-based multiple sequence align-
ment. Genome Res. 15:330–340.
Drummond, A., and K. Strimmer. 2001. PAL: An object-oriented pro-
gramming library for molecular evolution and phylogenetics. Bioin-
formatics 17:662–663.
Edgar, R. C. 2004. MUSCLE: Multiple sequence alignment with
high accuracy and high throughput. Nucleic Acids Res. 32:1792–
1797.
Felsenstein, J. 1989. PHYLIP—Phylogeny inference package (version
3.4). Cladistics 5:164–166.
Felsenstein, J. 2004. Inferring phylogenies. Sinauer Associates, Sunder-
land, Massachusetts.
Feng, D. F., and R. F. Doolittle. 1987. Progressive sequence alignment
as a prerequisite to correct phylogenetic trees. J. Mol. Evol. 25:351–
360.
Fleissner, R., D. Metzler, and A. von Haeseler. 2005. Simultaneous sta-
tistical multiple alignment and phylogeny reconstruction. Syst. Biol.
54:548–561.
Gatesy, J., R. DeSalle, and W. Wheeler. 1993. Alignment-ambiguous nu-
cleotide sites and the exclusion of systematic data. Mol. Phylogenet.
Evol. 2:152–157.
Geiger, D. L. 2002. Stretch coding and block coding: Two new strate-
gies to represent questionably aligned DNA sequences. J. Mol. Evol.
54:191–199.
Grundy, W. N., and G. J. Naylor. 1999. Phylogenetic inference from
conserved sites alignments. J. Exp. Zool. 285:128–139.
Guindon, S., and O. Gascuel. 2003. A simple, fast, and accurate algo-
rithm to estimate large phylogenies by maximum likelihood. Syst.
Biol. 52:696–704.
Gutell, R. R., N.Larsen,and C. R. Woese. 1994. Lessons from an evolving
rRNA: 16S and 23S rRNA structures from a comparative perspective.
Microbiol. Rev. 58:10–26.
Henikoff, S., and J. G. Henikoff. 1994. Proteinfamily classification based
on searching a database of blocks. Genomics 19:97–107.
Herrmann, G., A. Schon, R. Brack-Werner, and T. Werner. 1996. CON-
RAD: A method for identification of variable and conserved re-
gions within proteins by scale-space filtering. Comput. Appl. Biosci.
12:197–203.
Higgins, D. G., G. Blackshields, and I. M. Wallace. 2005. Mind the
gaps: Progress in progressive alignment. Proc. Natl. Acad. Sci. USA
102:10411–10412.
Jeffroy, O., H. Brinkmann, F. Delsuc, and H. Philippe. 2006. Phy-
logenomics: The beginning of incongruence? Trends Genet. 22:225–
231.
Jones, D. T., W. R. Taylor, and J. M. Thornton. 1992. The rapid generation
of mutation data matrices from protein sequences. Comput. Appl.
Biosci. 8:275–282.
Katoh, K., K. Kuma, H. Toh, and T. Miyata. 2005. MAFFT version 5:
Improvement in accuracy of multiple sequence alignment. Nucleic
Acids Res. 33:511–518.
Katoh, K., K. Misawa, K. Kuma, and T. Miyata. 2002. MAFFT: A novel
method for rapid multiple sequence alignment based on fast Fourier
transform. Nucleic Acids Res. 30:3059–3066.
Kjer, K. M. 1995. Use of rRNA secondary structure in phylogenetic
studies to identify homologous positions: an example of alignment
and data presentation from the frogs. Mol. Phylogenet. Evol. 4:314-
330.
Lake, J. A. 1991. The order of sequence alignment can bias the selection
of tree topology. Mol. Biol. Evol. 8:378–385.
Lassmann, T., and E. L. Sonnhammer. 2005. Kalign—An accurate and
fast multiple sequence alignment algorithm. BMC Bioinformatics
6:298.
Lee, M. S. 2001. Unalignable sequences and molecular evolution.
Trends Ecol. Evol. 16:681–685.
oytynoja, A., and M. C. Milinkovitch. 2001. SOAP, cleaning mul-
tiple alignments from unstable blocks. Bioinformatics 17:573–
574.
Lunter, G., I. Miklos, A. Drummond, J. L. Jensen, and J. Hein. 2005.
Bayesian coestimation of phylogeny and sequence alignment. BMC
Bioinformatics 6:83.
Lutzoni, F., P. Wagner, V. Reeb, and S. Zoller. 2000. Integrating am-
biguously aligned regions of DNA sequences in phylogenetic anal-
yses without violating positional homology. Syst. Biol. 49:628–
651.
Morrison, D. A., and J. T. Ellis. 1997. Effects of nucleotide sequence
alignment on phylogeny estimation: A case study of 18S rDNAs of
apicomplexa. Mol. Biol. Evol. 14:428–441.
Needleman, S. B., and C. D. Wunsch. 1970. A general method applica-
ble to the search for similarities in the amino acid sequence of two
proteins. J. Mol. Biol. 48:443–453.
Notredame, C., D. G. Higgins, and J. Heringa. 2000. T-Coffee: A novel
method for fast and accurate multiple sequence alignment. J. Mol.
Biol. 302:205–217.
Nuin, P. A., Z. Wang, and E. R. Tillier. 2006. The accuracy of several
multiple sequence alignment programs for proteins. BMC Bioinfor-
matics 7:471.
Ogden, T. H., and M. S. Rosenberg. 2006. Multiple sequence alignment
accuracy and phylogenetic inference. Syst. Biol. 55:314–328.
Pesole, G., M. Attimonelli, G. Preparata, and C. Saccone. 1992. A sta-
tistical method for detecting regions with different evolutionary
dynamics in multialigned sequences. Mol. Phylogenet. Evol. 1:91–
96.
Philippe, H., and J. Laurent. 1998. How good are deep phylogenetic
trees? Curr. Opin. Genet. Dev. 8:616–623.
Redelings, B. D., and M. A. Suchard. 2005. Joint Bayesian estimation of
alignment and phylogeny. Syst. Biol. 54:401–418.
Robinson, D. F., and L. R. Foulds. 1981. Comparison of phylogenetic
trees. Math. Biosci. 53:131–147.
Rodrigo, A. G., P. R. Bergquist, and P. L. Bergquist. 1994. Inadequate
support for an evolutionary link between the Metazoa and the Fungi.
Syst. Biol. 43:578–584.
Smythe, A. B., M. J. Sanderson, and S. A. Nadler. 2006. Nematode small
subunit phylogeny correlates with alignment parameters. Syst. Biol.
55:972–992.
Stajich, J. E., et al. 2002. The Bioperl toolkit: Perl modules for the life
sciences. Genome Res. 12:1611–1618.
Stoye, J., D. Evers, and F. Meyer. 1998. Rose: Generating sequence fam-
ilies. Bioinformatics 14:157–163.
Strimmer, K., and A. von Haeseler. 1996. Quartet puzzling: A quar-
tet maximum-likelihood method for reconstructing tree topologies.
Mol. Biol. Evol. 13:964–969.
Swofford, D. L., G. J. Olsen, P. J. Waddell, and D. M. Hillis. 1996. Phy-
logenetic inference. Pages 407–514 in Molecular systematics (D. M.
Hillis, C. Moritz, and B. K. Mable, eds.). Sinauer Associates, Sunder-
land, Massachusetts.
Tatusov, R. L., N. D. Fedorova, J. D. Jackson, A. R. Jacobs, B. Kiryutin,
E. V. Koonin, D. M. Krylov, R. Mazumder, S. L. Mekhedov, A. N.
Nikolskaya, B. S. Rao, S. Smirnov, A. V. Sverdlov, S. Vasudevan, Y. I.
Wolf, J. J. Yin, and D. A. Natale. 2003. The COG database: an updated
version includes eukaryotes. BMC Bioinformatics 4:41.
Downloaded By: [USYB - Systematic Biology] At: 02:07 18 July 2007
2007 TALAVERA AND CASTRESANA—IMPROVEMENT OF PHYLOGENIES AFTER REMOVING BLOCKS 577
Thompson, J. D., D. G. Higgins, and T. J. Gibson. 1994. CLUSTAL
W: Improving the sensitivity of progressive multiple sequence
alignment through sequence weighting, position-specific gap penal-
ties and weight matrix choice. Nucleic Acids Res. 22:4673–4680.
Thompson, J. D., P. Koehl, R. Ripp, and O. Poch. 2005. BAliBASE 3.0:
Latest developments of the multiple sequence alignment benchmark.
Proteins 61:127–136.
Wheeler, W. 2001. Homology and the optimization of DNA sequence
data. Cladistics 17:S3–S11.
Xia, X., Z. Xie, and K. M. Kjer. 2003. 18S ribosomal RNA and tetrapod
phylogeny. Syst. Biol. 52:283–295.
Yang, Z. 1998. On the best evolutionary rate for phylogenetic analysis.
Syst. Biol. 47:125–133.
Young, N. D., and J. Healy. 2003. GapCoder automates the use
of indel characters in phylogenetic analysis. BMC Bioinformatics
4:6.
First submitted 7 February 2007; reviews returned 6 March 2007;
final acceptance 24 March 2007
Associate Editor: Karl Kjer
Editors: Rod Page and Jack Sullivan
... Two species of Phyllodocidae, Eulalia viridis and Notophyllum foliosum, were included as outgroups (Ravara et al. 2019). Alignments were performed using MAFFT v7.505 (Katoh and Standley 2013) with the default parameters, then trimmed using Gblocks v0.91b (Gblocks parameters: minimum length of a block = 5; allowed gap positions = with half) (Talavera and Castresana 2007). A concatenation of (Zhang et al. 2020), with missing genes filled with "-". ...
Article
Full-text available
Natsushima is a genus of deep-sea Chrysopetalidae (Annelida) characterized by numerous bifurcate chaetae. It is poorly known, with three species living in the mantle cavity of bivalves in chemosynthetic habitats. Here we describe Natsushima nanhaiensis n. sp. based on an integrative morphological and molecular phylogenetic analysis of specimens collected from the Haima cold seep in the South China Sea. Morphologically, the new species can be distinguished from its congeneric species by the shape and number of the neuropodial hooks and bifurcate chaetae, the shape of the parapodia, and the long dorsal cirri. Sequence comparison and phylogenetic analysis based on the mitochondrial COI and 16S rRNA gene sequences supported the placement of Natsushima nanhaiensis n. sp. in Natsushima and its status as a distinct species. We also present a key to species of Natsushima and discuss their biogeography.
... The contigs were matched to the probes using the PHYLUCE's program "phyluce_assembly_match_contigs_to_probes" with min-coverage and min-identity set to 70 and 80, respectively (see Tables S3 and S4 for more details on contigs numbers and length). The UCE loci were then aligned using MAFFT 7 (Katoh and Standley 2013) and trimmed using GBLOCKS to eliminate poorly or ambiguously aligned, as well as divergent positions (Talavera and Castresana 2007) using relaxed parameters (b1 = 0.6, b2 = 0.6, b3 = 8, b4 = 5), both implemented in PHYLUCE. For the final matrix, only those loci with at least 80% of the taxa present were kept for phylogenetic analyses (923 UCEs-22% of missing data). ...
Article
Aim Insect brood parasites (i.e., cleptoparasites), like cuckoo bees, typically attack hosts within specific lineages, but seem to be less constrained by the biogeographical movements of their hosts compared to obligate parasites. Cuckoo bees depend on stable host populations, being particularly sensitive to environmental changes and thus valuable bioindicators of the bee community health. We here test the congruence between the biogeographical history of cuckoo oil bees and their oil bee hosts. Location The Americas. Taxon Bees (Hymenoptera, Apidae). Methods Using phylogenomic and Sanger sequence data, we present new time‐calibrated phylogenies for cuckoo oil bees in the ericrocidine line and their oil bee hosts, Centris and Epicharis. We estimate their ancestral ranges using six historical biogeographical models on a set of 100 trees, randomly sampled from the posterior distribution of phylogenies in each group, thus accounting for uncertainties in divergence time estimates and model selection. Results The origin of the hosts stem in the Cretaceous precedes the origin of their cleptoparasite's stem in the Palaeocene. Cleptoparasite and host crown origins were synchronous in the Eocene, and both took place in tropical South America. While the pair Rhathymini‐ Epicharis remained mostly associated within this region, Centris and their cleptoparasites expanded their distribution to other parts of Neotropical and Nearctic regions in independent range expansions events. In all cases, host range shifts preceded the cleptoparasite shifts. Main Conclusion The biogeographical history of cleptoparasitic oil bees and oil‐collecting hosts is generally congruent in time and space. Events of range expansion mainly occurred in the more species‐rich lineages of cleptoparasites. Range shifts in cleptoparasites followed the distribution of their hosts and coincided with the distribution of oil‐producing plants visited by the host bees. Our results broaden our understanding of the complex biogeography of interacting partners and on how changes in host distributions may impact cleptoparasitic bees.
... The multiple alignment was conducted by MUS-CLE in the MEGA11 software (Edgar, 2004;Tamura et al., 2021). To eliminate poorly aligned positions and divergent regions, GBlocks version 0.91.1 was used (Castresana, 2000;Talavera & Castresana, 2007). Phylogenetic analysis was performed with the Bayesian inference (BI) and the Maximum Likelihood (ML) algorithms using MrBayes v. 3.2.7 and MEGA11 software (Tamura et al., 2011), respectively. ...
Article
Full-text available
Sequences of the ITS1–5.8S–ITS2 rDNA of Gyrodactylus alviga Dmitrieva & Gerasev, 2000 from Merlangius merlangus L. (Gadiformes: Gadidae) in the Black Sea were obtained for the first time. Gyrodactylus alviga is 0.2% distinct from G. pterygialis Bychowsky & Polyansky, 1953 parasitising the gadid fish Pollachius virens L. in the Norwegian Sea and Gyrodactylus sp. from Microgadus tomcod Walbaum of the same fish family in the Northwest Atlantic, based on the genetic variability of the ITS region. The most species-specific ITS1 region was identical in both species. The differences in the ITS2 secondary structure and compensatory base changes in its hairpins between G. alviga and G. pterygialis were not observed. Morphometric comparison of G. alviga and G. pterygialis also showed no significant differences. On this basis, G. alviga is synonymised with G. pterygialis and a redescription of the latter is presented, including G. alviganew syn. Findings of this species in the White and Bering Seas, and possibly off the northeastern coast of North America, require confirmation based on both morphological and molecular data. The results of this study show that G. pterygialis has a wider distribution than previously known. The good concordance of the secondary structure of the first ITS2 hairpin with the phylogenetic reconstruction of Gyrodactylus species based on the whole ITS region was revealed, which is of interest for further studies on the phylogenetic systematics of Gyrodactylus.
... The resultant files were subjected to additional manual corrections using MEGA v.11.0 (Tamura et al. 2021). Subsequent analyses were performed in PhyloSuite, where ambiguous sites and gaps were removed using Gblocks (Talavera and Castresana 2007). The sequences were then concatenated into a single alignment and converted into Nexus format files. ...
Article
Full-text available
In this study, we describe Lilium huanglongense , a newly-discovered lily species identified following extensive surveys in an undeveloped area of the Huanglong National Nature Reserve in Sichuan, China. This region, located in the Hengduan Mountains of south-western China, is recognised as one of the world’s prominent biodiversity hotspots, providing diverse habitats for a wide range of plant species. Morphologically, L. huanglongense resembles Lilium fargesii Franch., which is distributed in central China, as well as other tepal-recurved members of the section Lophophora (Bureau & Franch.) F. T. Wang & Ts. Tang. This section comprises dwarf lilies predominantly found in the alpine scrub of the Hengduan Mountains, extending westwards into the Himalayas. Molecular phylogenetic analyses using both nuclear ITS and chloroplast genomes confirm the independent status of the new species and its placement within the section Lophophora. The identification of this new species helps to fill the distribution gap between broad-leaved forest and alpine scrub species within the section, thereby enhancing our understanding of the diversity and distribution history of Lophophora .
... The genomes were inserted into curated multiple sequence alignments (MSAs) for each COG family using Muscle 48 . The curated alignments were trimmed using GBLOCKS 49 to remove the poorly aligned sections. Subsequently, the MSAs were concatenated and used for the tree construction. ...
Article
Full-text available
Biogas production through the anaerobic digestion (AD) of organic waste plays a crucial role in promoting sustainability and closing the carbon cycle. Over the past decade, this has driven global research on biogas-producing microbiomes, leading to significant advances in our understanding of microbial diversity and metabolic pathways within AD plants. However, substantial knowledge gaps persist, particularly in understanding the specific microbial communities involved in biogas production in countries such as South Korea. The present dataset addresses one of these gaps by providing comprehensive information on the metagenomes of five full-scale mesophilic biogas reactors in South Korea. From 110 GB of raw DNA sequences, 401 metagenome-assembled genomes (MAGs) were created, which include 42,301 annotated genes. Of these, 187 MAGs (46.7%) were classified as high-quality based on Minimum Information about Metagenome-Assembled Genome (MIMAG) standards. The data presented here contribute to a broader understanding of biogas-specific microbial communities and offers a significant resource for future studies and advancements in sustainable biogas production.
Article
Full-text available
Delectopecten is a small genus of the family Pectinidae (Bivalvia: Pectinida) that remains poorly studied in terms of both morphology and phylogeny. Here, we describe the first member of this genus from deep-sea hydrothermal vent ecosystems, D. thermus sp. nov., based on morphological investigations and molecular analyses of a specimen collected from the Higashi-Ensei vent field (962-m depth) in the northern Okinawa Trough. Morphologically, this new species resembles D. vancouverensis and D. gelatinosus in shell size, shape, auricle size and sculpture. However, D. thermus sp. nov. can be distinguished from its congeneric species (including 9 extant and 12 fossil species) by its unequal auricles (the anterior one being larger than the posterior), inwardly recurved anterior auricle of the left valve and a large byssal notch angle of ~90°. Comparisons of genetic sequences from three mitochondrial and three nuclear gene fragments supported the placement of the new species in the genus Delectopecten. Further phylogenetic analyses using these gene markers support that Delectopecten is monophyletic and positioned as an early diverging clade of the family Pectinidae. Additionally, the mitogenome of D. thermus sp. nov. was assembled and annotated, a first for its genus-revealing significant divergences in gene order compared to other pectinids. The 16S rRNA amplicon analysis of the gill tissue indicated that this vent-dwelling scallop does not exhibit symbiosis with chemosynthetic bacteria. A key to all known species of Delectopecten is provided to aid the identification of species in this understudied genus.
Preprint
Full-text available
The passion flower bee, Protandrena (Anthemurgus) passiflorae (Robertson) is a monolectic, host-plant specialist of the passionflower plant Passiflora lutea L. Using a single adult male individual, we generated long-read PacBio HiFi, HiC, and short-read RNA sequencing data to build a well-annotated, chromosome-level genome assembly for this species. The final nuclear genome is 249 Mb with 150x coverage and with most of the genome scaffolding into 12 chromosomes. The scaffold N50 is 21.4 Mb and the genome has a Benchmarking Universal Single-Copy Ortholog (BUSCO) score of 97.2% for 5991 hymenopteran genes. BRAKER3 annotation of the genome identified 12,098 genes and 15,353 total transcripts and found that 20.27% of the genome is made up of repetitive elements. We resolved a mitochondrial genome of 12.7 kb. The P. passiflorae genome represents one of only a few published andrenid bee genomes and one of the first monolectic bees. This new high-quality genome will serve as a valuable resource for investigating the genomic basis of specialization and for providing a useful resource for studying pollinator health and conservation.
Article
Full-text available
Horseshoe bats are natural hosts of zoonotic viruses, yet the genetic basis of their antiviral immunity is poorly understood. Here we generated two new chromosomal-level genome assemblies for horseshoe bat species (Rhinolophus) and three close relatives, and show that, during their diversification, horseshoe bats underwent extensive chromosomal rearrangements and gene expansions linked to segmental duplications. These expansions have generated new adaptive variations in type I interferons and the interferon-stimulated gene ANXA2R, which potentially enhance antiviral states, as suggested by our functional assays. Genome-wide selection screens, including of candidate introgressed regions, uncover numerous putative molecular adaptations linked to immunity, including in viral receptors. By expanding taxon coverage to ten horseshoe bat species, we identify new variants of the SARS-CoV-2 receptor ACE2, and report convergent functionally important residues that could explain wider patterns of susceptibility across mammals. We conclude that horseshoe bats have numerous signatures of adaptation, including some potentially related to immune response to viruses, in genomic regions with diverse and multiscale mutational changes.
Article
Mitochondrial tRNA gene loss and cytosolic tRNA import are two common phenomena in mitochondrial biology, but their importance is often under-appreciated in animals. This is because the mitochondrial DNA (mtDNA) of most bilaterally symmetrical animals (Bilateria) encodes a complete set of tRNAs required for mitochondrial translation. By contrast, the mtDNA of non-bilaterian animals (phyla Cnidaria, Ctenophora, Porifera, and Placozoa) often contains a reduced set of tRNA genes, necessitating tRNA import from the cytosol. Interestingly, in many non-bilaterian lineages, tRNA gene content appears to be set early in evolution and remains conserved thereafter. Here, we report that Clade B of Haplosclerid Sponges (CBHS) represents an exception to this pattern, displaying considerable variation in tRNA gene content even among relatively closely related species. We determined mt-genome sequences for eight CBHS species and analyzed them in conjunction with six previously available sequences. Additionally, we sequenced mt-genomes for two species of haplosclerid sponges outside the CBHS and used eight previously available sequences as outgroups. We found that tRNA gene content varied widely within CBHS, ranging from three in an undescribed Haliclona species (Haliclona sp. TLT785) to 25 in Xestospongia muta and X. testudinaria. Furthermore, we found that all CBHS species outside the genus Xestospongia lacked the atp9 gene, with some also lacking atp8. Analysis of nuclear sequences from Niphates digitalis revealed that both atp8 and atp9 had transferred to the nuclear genome, while the absence of mt-tRNA genes indicated their genuine loss. We argue that CBHS can serve as a valuable system for studying mt-tRNA gene loss, mitochondrial import of cytosolic tRNAs, and the impact of these processes on mitochondrial evolution.
Article
Full-text available
An earlier analysis of the trnL intron in the Colletieae (Rhamnaceae) showed polyphyly of the genus Discaria. Polyphyly of Discaria is supported only by an AT-rich region of ambiguous alignment within the trnL intron. Polyphyly of the genus relies on extracting the information of the AT-rich region correctly. Ambiguously aligned regions are commonly excluded from phylogenetic analysis. In the present study the question was raised whether random or noisy data could generate a pattern like the one found in the AT-rich region of ambiguous alignment. The original pattern was resistant to changes in alignment parameter cost when submitted to a sensitivity analysis using direct optimization. Artificially generated random or noisy data gave well-resolved trees but these were found to be extremely sensitive to changes in parameter costs. However, information from additional data, such as conserved regions, restricts the influence of random data. It is here suggested that the information in ambiguously aligned regions need not be dismissed, provided that an appropriate method that finds all possible optimal alignments is used to extract the information. In addition to commonly used support measures, some information of robustness to changes in alignment parameter costs is needed in order to make the most reliable conclusions.
Article
Phylogenetic analyses of non-protein-coding nucleotide sequences such as ribosomal RNA genes, internal transcribed spacers, and introns are often impeded by regions of the alignments that are ambiguously aligned. These regions are characterized by the presence of gaps and their uncertain positions, no matter which optimization criteria are used. This problem is particularly acute in large-scale phylogenetic studies and when aligning highly diverged sequences. Accommodating these regions, where positional homology is likely to be violated, in phylogenetic analyses has been dealt with very differently by molecular systematists and evolutionists, ranging from the total exclusion of these regions to the inclusion of every position regardless of ambiguity in the alignment. We present a new method that allows the inclusion of ambiguously aligned regions without violating homology.In this three-step procedure, first homologous regions of the alignment containing ambiguously aligned sequences are delimited. Second, each ambiguously aligned region is unequivocally coded as a new character, replacing its respective ambiguous region. Third, each of the coded characters is subjected to a specific step matrix to account for the differential number of changes (summing substitutions and indels) needed to transform one sequence to another.The optimal number of steps included in the step matrix is the one derived from the pairwise alignment with the greatest similarity and the least number of steps. In addition to potentially enhancing phylogenetic resolution and support, by integrating previously nonaccessible characters without violating positional homology,this new approach can improve branch length estimations when using parsimony.
Article
A multiple sequence alignment program, MAFFT, has been developed. The CPU time is drastically reduced as compared with existing methods. MAFFT includes two novel techniques. (i) Homo logous regions are rapidly identified by the fast Fourier transform (FFT), in which an amino acid sequence is converted to a sequence composed of volume and polarity values of each amino acid residue. (ii) We propose a simplified scoring system that performs well for reducing CPU time and increasing the accuracy of alignments even for sequences having large insertions or extensions as well as distantly related sequences of similar length. Two different heuristics, the progressive method (FFT‐NS‐2) and the iterative refinement method (FFT‐NS‐i), are implemented in MAFFT. The performances of FFT‐NS‐2 and FFT‐NS‐i were compared with other methods by computer simulations and benchmark tests; the CPU time of FFT‐NS‐2 is drastically reduced as compared with CLUSTALW with comparable accuracy. FFT‐NS‐i is over 100 times faster than T‐COFFEE, when the number of input sequences exceeds 60, without sacrificing the accuracy.
Article
In the eight years since we last examined the amino acid exchanges seen in closely related proteins, &apos; the information has doubled in quantity and comes from a much wider variety of protein types. The matrices derived from these data that describe the amino acid replacement probabilities between two sequences at various evolutionary distances are more accurate and the scoring matrix that is derived is more sensitive in detecting distant relationships than the one that we previously deri~ed.2, ~ The method used &apos;in this chapter is essentially the same as that described in the Atlas, Volume 34 and Volume 5.&apos; Accepted Point Mutations An accepted poinfmutation in a protein is a replacement of one amino acid by another, accepted by natural selection. It is the result of two distinct processes: the
Article
A multiple sequence alignment program, MAFFT, has been developed. The CPU time is drastically reduced as compared with existing methods. MAFFT includes two novel techniques. (i) Homo logous regions are rapidly identified by the fast Fourier transform (FFT), in which an amino acid sequence is converted to a sequence composed of volume and polarity values of each amino acid residue. (ii) We propose a simplified scoring system that performs well for reducing CPU time and increasing the accuracy of alignments even for sequences having large insertions or extensions as well as distantly related sequences of similar length. Two different heuristics, the progressive method (FFT-NS-2) and the iterative refinement method (FFT-NS-i), are implemented in MAFFT. The performances of FFT-NS-2 and FFT-NS-i were compared with other methods by computer simulations and benchmark tests; the CPU time of FFT-NS-2 is drastically reduced as compared with CLUSTALW with comparable accuracy. FFT-NS-i is over 100 times faster than T-COFFEE, when the number of input sequences exceeds 60, without sacrificing the accuracy.
Article
A metric on general phylogenetic trees is presented. This extends the work of most previous authors, who constructed metrics for binary trees. The metric presented in this paper makes possible the comparison of the many nonbinary phylogenetic trees appearing in the literature. This provides an objective procedure for comparing the different methods for constructing phylogenetic trees. The metric is based on elementary operations which transform one tree into another. Various results obtained in applying these operations are given. They enable the distance between any pair of trees to be calculated efficiently. This generalizes previous work by Bourque to the case where interior vertices can be labeled, and labels may contain more than one element or may be empty.
Article
We describe a new method (T-Coffee) for multiple sequence alignment that provides a dramatic improvement in accuracy with a modest sacrifice in speed as compared to the most commonly used alternatives. The method is broadly based on the popular progressive approach to multiple alignment but avoids the most serious pitfalls caused by the greedy nature of this algorithm. With T-Coffee we pre-process a data set of all pair-wise alignments between the sequences. This provides us with a library of alignment information that can be used to guide the progressive alignment. Intermediate alignments are then based not only on the sequences to be aligned next but also on how all of the sequences align with each other. This alignment information can be derived from heterogeneous sources such as a mixture of alignment programs and/or structure superposition. Here, we illustrate the power of the approach by using a combination of local and global pair-wise alignments to generate the library. The resulting alignments are significantly more reliable, as determined by comparison with a set of 141 test cases, than any of the popular alternatives that we tried. The improvement, especially clear with the more difficult test cases, is always visible, regardless of the phylogenetic spread of the sequences in the tests.