ResearchPDF Available

Parsimony analysis of unaligned sequence data: an exchange. Version 2 (May 2016).

Authors:

Abstract

This is a pdf-version of an ongoing exchange in ResearchGate about parsimony analysis of unaligned sequence data. For reasons explained in the text, pdf is better suited for this than Researchgate's own format to provide feedback on papers. Version two of this exchange (May 2016) contains Santiago Castroviejo-Fisher's latest arguments against maximization of homology in such analyses.
1
Parsimony analysis of unaligned sequence data:
an exchange
10 May 2016, version 2
Introduction ........................................................................................................................................................ i
Version 1 ......................................................................................................................................................... i
Changes and additions in version 2 (10/05/2016) ......................................................................................... ii
Contribution 1: Jan - Tuesday 06/10/2015 ........................................................................................................ 1
Contribution 2: Santiago - Tuesday 06/10/2015 20:01 ..................................................................................... 2
Contribution 3: Jan Saturday 10/10/2015 ...................................................................................................... 3
Contribution 4: Santiago - Wed 14/10/2015 ..................................................................................................... 6
Contribution 5: Jan - Sat 24/10/2015 ................................................................................................................ 7
TREE ALIGNMNENTS ...................................................................................................................................... 7
PARSIMONY AND EXPLANATION ................................................................................................................... 9
COMPOSITIONAL SEQUENCE HOMOLOGY .................................................................................................. 11
SUBSEQUENCE HOMOLOGY ........................................................................................................................ 13
ADDING IT UP ............................................................................................................................................... 15
POY ............................................................................................................................................................... 15
REFERENCES ................................................................................................................................................. 16
Contribution 6: Santiago Friday 06/05/2016 ................................................................................................ 17
An unrealistic biological assumption ........................................................................................................... 17
Logically irreconcilable with the anti-superfluity principle ......................................................................... 18
Character Independence ............................................................................................................................. 19
Explanation and Similarity ........................................................................................................................... 20
i
Introduction
Version 1
A couple of weeks ago, browsing in www.researchgate.net, I came across this paper:
Phylogenetic systematics of egg-brooding frogs (Anura: Hemiphractidae) and the evolution of
direct development
SANTIAGO CASTROVIEJO-FISHER · JOSÉ M. JR. PADIAL · IGNACIO DE LA RIVA · JOSÉ P. POMBAL JR · HELIO R. DA
SILVA · FERNANDO J. M. ROJAS-RUNJAIC · ESTEBAN MEDINA-MÉNDEZ · DARREL R. FROST
Zootaxa 08/2015; 4004(1):1-75. DOI:10.11646/zootaxa.4004.1.1
This paragraph on p. 8 struck me:
Optimality criterion and nucleotide homology. We chose the criterion of parsimony (unweighted) so that
our phylogenetic inferences minimize ad hoc assumptions and maximize falsifiability and explanatory
power of evidence (Wiley 1975; Farris 1983; Farris et al. 2001; Kluge 2001a, b, 2009; Kluge & Grant 2006;
Grant & Kluge 2009). We applied parsimony to tree-alignment (Sankoff 1975; Sankoff & Rousseau 1975;
Sankoff et al. 1976; Wheeler 1996) to infer the minimum number of transformation events needed to
explain observed differences (including indels) in DNA sequences (Grant & Kluge 2004; 2009; Kluge & Grant
2006; Wheeler et al. 2006; Grant & Kluge 2009; Padial et al. 2014).
Researchgate has the nice feature that it allows direct feedback. So I asked Santiago and his coauthors a
question:
And an exchange followed.
ii
Unfortunately, graphics cannot be included in this kind of feedback, and Researchgate only provides a
single font, a font that happens to be proportional. Two serious drawbacks when writing about sequence
alignments. So I think it is useful to collect the ongoing online exchange in this pdf file, a format that allows
some formatting to be added for readability. Other than removing some inconsistencies in interpunction, I
will indicate changes and comments beyond what appeared online by putting them in green between
square brackets.
This is an ongoing effort, so this pdf is expected to grow. I will document any future addition in this
introduction. In the meantime, I hope that the current version may help to clarify the issues that surround
parsimony analysis in a tree alignment context. Feel free to participate in the online discussion, or start a
new one here.
Big thanks to Santiago for taking this up!
Jan De Laet
Veltem-Beisem
27 October 2015
Changes and additions in version 2 (10/05/2016)
Addition of Santiago’s comments from 6 May 2016. Thanks to Santiago for providing me with a slightly
edited version compared to the online comment.
1
Contribution 1: Jan - Tuesday 06/10/2015
Hi Santiago and coauthors,
I have a question.
On page 8 you say that you applied parsimony to tree-alignment to infer the minimum number of
transformation events needed to explain observed differences (including indels) in DNA sequences.
But if you use minimization of transformation events as optimality criterion in a tree-alignment analysis,
then you end up with a methodological breakdown. This is so because any observed sequence can then be
explained by postulating just one big insertion event. For all except the most trivial datasets, that means
that the data are optimally explained on any possible tree by postulating just as many insertion events as
there are observed sequences. So there is no way to choose any tree over any other tree. (This is not new, I
discussed it at length in the two papers at the bottom of this note; one is from 2005, the other just
appeared in Cladistics).
I know that this is absurd from a biological point of view, but it follows from your stated goal to minimize
transformation events in a tree-alignment analysis.
In your paper you clearly prefer some trees over other trees, so you must have been minimizing something
else instead. Your paper does not contain the actual cost settings that you used in POY, so I am at a loss
trying to figure out exactly what. Can you help me out?
Thanks in advance,
Best regards,
-- Jan
https://www.researchgate.net/publication/260812208_Parsimony_and_the_problem_of_inapplicables_in_sequence_data
http://onlinelibrary.wiley.com/doi/10.1111/cla.12098/abstract
2
Contribution 2: Santiago - Tuesday 06/10/2015 20:01
Hi Jan,
Thanks for the feedback . I have never had anyone commenting on my papers so this is kind of new to me.
Anyways, it is very interesting to chat with other people interested in phylogenetics.
Are you Swedish? I did my PhD in Uppsala and have many emotional and professionals links to Sweden
after living there for five years.
About your comment. Yes, I have read your papers. Nice contributions. Well, if I understand you comment
correctly, I think I can answer your concern.
I am considering characters at the "atomic" level including indels and nucleotides. Thus, transformation
series in our analysis are composed of single elements (i.e., a nucleotide or absence of nucleotide) and not
groups of nucleotides and/or idels. In this sense, I think that we are actually minimizing transformation
events. I agree with you that the way we expressed in the paper could be confusing. I also agree with you
that it is a very naive approach to consider that all indels happen as insertion/deletions of single
nucleotides. What I am still unsure if is there is a better way to do parsimony tree-alignment.
Again, if I understand your position correctly, your idea rests on interpreting parsimony as two taxon
analysis (right?) as you suggest in De Laet & Smets (1998).
Best wishes,
Santiago
3
Contribution 3: Jan Saturday 10/10/2015
Hi Santiago,
No, I'm not Swedish, I'm from Belgium. But as a postdoc I've been in Stockholm for a while, and for my
phylogenetic research I'm now associated with Gothenburg Botanical Garden. So quite some links to
Sweden here as well. I've also visited Brazil a couple of times, and also have good memories of those trips.
Thanks for your clarification and feedback. My reply here may suggest otherwise, but I do think that we
have a lot of common ground.
On p. 8 you write: "We chose the criterion of parsimony (unweighted) so that our phylogenetic inferences
minimize ad hoc assumptions and maximize falsifiability and explanatory power of evidence (Wiley 1975;
Farris 1983; Farris et al. 2001; Kluge 2001a, b, 2009; Kluge & Grant 2006; Grant & Kluge 2009)."
I agree that parsimony minimizes ad hoc assumptions and maximizes explanatory power of evidence. But
agreeing with that does not mean that I agree with all argumentation that is provided to that effect in the
references that you provide. I'll concentrate on what Kluge and Grant (2006; KG6) and Grant and Kluge
(2009; GK9) say about explanatory power.
They start from the philosophical principle of anti-superfluity, a parsimony principle. From that principle,
they argue that explanatory power is maximized when historical transformation events, including indels,
are minimized. They are explicit that this also applies in the context of tree alignments. But they don’t
discuss problems that might be posed by historical indel events that span multiple residues. They don’t
discuss how to deal with such events from a theoretical point of view, and they don’t discuss how such
events should be dealt with in practice, when analyzing empirical sequence data with a tree alignment
program such as POY.
But from papers such as Kluge (2005) and Grant et al. (2006; full references are given at the end), it is clear
that they both translate minimization of transformation events into cost set 111 for use in POY. As you
pointed out, this is the same cost set that you have been using: it assigns a cost of one to transitions, a cost
of one to transversions, and a cost of one to unit gaps. (I use the terminology that I also used in my 2005
and 2015 papers: a sequence such as 'a a a - - - a c c c t' has one gap that consists of three unit gaps).
This poses some fundamental problems. I'll use this small dataset of three sequences as an illustration:
A a a a a c c c t
B a a a a c c c t
C a a a c c c a c g c t
With cost set 111, the optimization on the single unrooted tree for three sequences has this implied
alignment:
A a a a - - - a c c c t
B a a a - - - a c c c t
C a a a c c c a c g c t
It comes at a total cost of four: one base substitution and three unit gaps. So the single gap of length three
is explained by postulating three distinct historical indel events, each one involving just a single position.
This exposes the hidden assumption in KG6 and GK9's rationale: historical indel events never involve more
than one nucleotide at a time. I agree that this is a naive assumption. But it sits deeply embedded in the
core of KG6 and GK9's view of parsimony analysis: in their view, each base substitution and each unit gap
stands for a distinct historical event.
4
There is still a deeper problem. KG6 and GK9 rely on the philosophical notion of anti-superfluity. But
properly applied in this context, that principle implies the following: when there is a choice between an
explanation that involves three transformation events and an explanation that only involves a single
transformation event, then the explanation with only a single transformation event should be preferred. In
other words, If I can explain a single gap of length three with only one transformation event, I should not
postulate three such events. And this should also apply during analysis. So KG6 and GK9’s position not only
involves an unrealistic biological assumption, the theoretical core of their position itself is self-
contradictory in its application of anti-superfluity, the central concept on which that core is built (I’ve
discussed this in my recent paper in Cladistics).
You say that you are considering ‘characters at the "atomic" level including indels and nucleotides. Thus,
transformation series in our analysis are composed of single elements (i.e., a nucleotide or absence of
nucleotide) and not groups of nucleotides and/or idels. In this sense, I think that we are actually minimizing
transformation events’.
I agree with that, but only as far as it goes: you are minimizing some abstract concept of transformations
(and doing so postulating some abstract and non-standard notion of transformation series). But such
transformations have no sensible biological meaning, and they are not the kind of unique historical events
that KG6 and GK9 rely on in their theoretical framework (even if they give up that meaning when it comes
to empirical work or recommendations). So I don’t think it makes sense to refer to KG6 and GK9 as a
rationale for 111.
So, are there alternatives?
What happens, for example, if, within KG6 and GK9’s view of parsimony, you give up the assumption that
indels only affect single nucleotides at a time? That leads to giving up 111. The result is the methodological
breakdown that I mentioned in my previous post.
One might take a more pragmatic stance: first use 111 to get trees, then use those trees to infer where
indels might have affected multiple positions (to be sure, I’ve never seen this position defended in papers,
I’m just exploring possibilities). This might give decent results in practice, but as a method it is inconsistent:
during the analysis it is assumed that indels affect only single nucleotides at a time, after the analysis, this
assumption is given up. It may be useful to discuss an example. Consider this dataset:
A a a a
B a a a a a a
C a a a a a a a a a
D a a a a a a a a a a a a
Analysis with cost set 111 leads to the following total costs on the three different unrooted trees for four
terminals:
(A B)(C D)) -> total cost 9
(A C)(B D)) -> total cost 15
(A D)(B C)) -> total cost 15
So, during analysis, ((A B)(C D)) is preferred as the best explanation because it minimizes historical events
under the assumption that historical indel events only affect single nucleotides. After the analysis, it is then
concluded that the best explanation actually involves only three historical indel events (all three involving a
subsequence of length three). But, admitting that indel events can involve multiple nucleotides, the two
other trees can also explain the data with only three historical indel events (but involving subsequences of
5
different lengths). The net result is that equally good explanations under a realistic assumption are not
considered because they have previously been rejected during an analysis that was performed under an
unrealistic assumption.
One might still be more pragmatic: ‘I don’t care what the underlying rationale is, operationally 111 has
again and again proven to give decent results, so I’ll stick to it’. Or ‘111 is simple, that’s sufficient’. That
would come close to arguments of simplicity that Wheeler at times has voiced.
In 2003 and 2005 I’ve argued from a completely different perspective, one that you indeed could say was
inspired by my 1997 view of parsimony as two item-analyis (I only explicitly called it such in 1998, but the
ideas are there). It leads to a view that using 111 in POY amounts to a specific kind of differential weighting
of evidence (I’ve elaborated that in my recent paper in Cladistics). To obtain (an approximation of ) equally
weighted evidence, one should use cost set 3221 (gap opening cost 3, transition and transversion cost 2,
gap extension cost 1). I’ll try to expand a bit on that later this weekend.
Best
Jan
De Laet, J., and Smets, E. 1998. On the three-taxon approach to parsimony analysis. Cladistics 14: 363-381.
De Laet, J. 1997. A reconsideration of three-item analysis, the use of implied weights in cladistics, and a practical application in
Gentianaceae. Dissertation. Available at wwww.anagallis.be or in ResearchGate.
De Laet, J. 2003. When one and one is not two: parsimony analysis of sequence data. XXIIth Meeting of the Willi Hennig Society.
New York Botanical Garden, 20 July 24 July. Abstract appeared in Cladistics 20: 81 (2004). Also available at www.anagallis.be.
Grant, T., Frost, D.R., Caldwell, J.P., Gagliardo, R., Haddad, C.B., Kok, P.J.R., Means, D.B., Noonan, B.P., Schargel,, W.E., Wheeler,
W.C., 2006. Phylogenetic systematics of dart-poison frogs and their relatives (Amphibia; Athesphatanura: Dendrobatidae). Bull.
Am. Mus. Nat. Hist. 299.
Kluge, A.G., 2005. What is the rationale for ‘Ockham’s Razor’ (a.k.a. parsimony) in phylogenetic inference?. In: Albert, V. (Ed.),
Parsimony, Phylogeny, and Genomics. Oxford University Press, Oxford, pp. 1542.
6
Contribution 4: Santiago - Wed 14/10/2015
Hi Jan,
Thanks for taking your time to explain your position. As you said, instrumentalist justifications such as
‘I don’t care what the underlying rationale is, operationally 111 has again and again proven to give decent
results, so I’ll stick to it’. Or ‘111 is simple, that’s sufficient’.
are empty and should not be used. I would greatly appreciate if you find the time to elaborate a bit more
on the 3221 cost regime.
In any case, I want to study your papers in detail to fully grasp your proposal. It sounds very interesting. I
will be in contact with you.
Cheers,
santiago
7
Contribution 5: Jan - Sat 24/10/2015
Hi Santiago,
You wondered if there’s a better way to do parsimony analysis in a tree alignment context than to assume
that indel events affect only single nucleotides at a time, or than to stick to purely instrumentalist
justifications. I think there is, and it indeed involves cost set 3221 for use in POY. Before getting to that cost
set, a lengthy introduction may useful though.
To illustrate some concepts and ideas, I will mainly be using this hypothetical dataset of four observed
sequences (that they have been put in a dataset for phylogenetic analysis means that they are
hypothesized to be orthologous):
Dataset D1
A gggaaaacccggg
B gggaaaaaatttggg
C gggaaaaaaaaaacccggg
D gggaaaaaaaaaaaatttggg
TREE ALIGNMNENTS
A tree alignment is a concept that is due to David Sankoff (see Sankoff 1975, Sankoff and Cedergren 1983).
For a dataset of unaligned sequences, such as D1, a tree alignment consists of (1) a tree with the observed
sequences at the tips and reconstructed sequences at the inner nodes, and (2) a multiple alignment of
observed and reconstructed sequences alike. (The multiple alignment of the observed sequences that is
obtained by deleting the inner nodes from a tree alignment is called an implied alignment). A nice
representation of a tree alignment is a drawing of that tree in which each node is labelled with the
corresponding row of that multiple alignment. Unfortunately I can't do this here because ResearchGate
doesn't permit a non-proportional font or the use of graphics here. So I'll have to use a somewhat less
clear representation, with trees in parenthetical notation and numbered inner nodes [I’ll stick to it in this
formatted version]. For four terminals there are three different unrooted trees:
T1: (1:(A B) 2:(C D))
T2: (3:(A C) 4:(B D))
T3: (5:(A D) 6:(B C))
Consider unrooted tree T1, (1:(A B) 2:(C D)). The single inner branch determines partition AB|CD. The inner
node at the AB-side is labelled node 1, the inner node at the CD side is called node 2. In the same way, the
inner nodes of trees T2 and T3 are called 3, 4, 5 and 6.
Given this convention, the following multiple alignment fully determines a tree alignment for dataset D1
on tree T1:
Tree alignment D1T1A1 (first tree alignment for dataset D1 on tree T1)
A gggaaaa--------cccggg
B gggaaaaaa------tttggg
C gggaaaaaaaaaa--cccggg
D gggaaaaaaaaaaaatttggg
1 gggaaaaaa------tttggg
2 gggaaaaaaaaaa--tttggg
8
It looks a bit garbled in the single font that ResearchGate provides, but it should be clear enough. If not, it
might be useful to paste it in some program that allows a non-proportional font (Courier, for example). [A
superfluous paragraph in this formatted version].
The length of tree alignment D1T1A1 is 21, so there are 21 positional characters. In total, these have six
substitutions: two in position 16, two in position 17, and two in position 18. Three of those are in the
terminal branch leading to A, the other three in the terminal branch leading to C. The length differences of
the observed sequences are explained by three indel events: an indel event of a subsequence of length
two along the branch leading to A, an indel of length four along the inner branch, and an indel of length
two along the branch leading to D. Whether such an indel is an insertion or a deletion depends on how the
tree would be rooted. So the explanation of the observed sequences that is provided by D1T1A1 requires a
total of nine evolutionary events or transformations.
Here’s another tree alignment for D1 on T1:
Tree alignment D1T1A2
A gggaaaa--------cccggg
B gggaaaaaa------tttggg
C gggaaaaaaaaaacccggg
D gggaaaaaaaaaaaatttggg
1 gggaaaaaa------cccggg
2 gggaaaaaaaaaacccggg
D1T1A1 and D1T1A2 only differ in the reconstructed sequences at positions 16, 17, and 18: both inner
nodes have a ‘t’ there in D1T1A1, but a ‘c’ in D1T1A2. And just as in D1T1A1, six subsitutions are required
in these positions, and so the total number of evolutionary events in both tree alignments is the same. The
difference is that, in D1T1A2 B, the subsitutions occur in the terminal branches that lead to B and D, not in
the branches that lead to A and C.
Here's a tree alignment for D1 on T2, also with 21 positions:
Tree alignment D1T2A1
A gggaaaa--------cccggg
B gggaaaaaa------tttggg
C gggaaaaaaaaaa--cccggg
D gggaaaaaaaaaaaatttggg
3 gggaaaaaaaaaa--cccggg
4 gggaaaaaaaaaa--tttggg
This tree alignment requires only three substitutions, all three along the single inner branch: one in
position 16, one in position 17, and one in position 18. The length differences still require only three indel
events, but of different lengths compared to D1T1A1 and D1T1A2: an indel of length six along the branch
to A, an indel of length four along the branch to B, and an indel of length two along the branch to A. So in
total, this tree alignment only requires six evolutionary events: three substitutions and three indel events.
One could be tempted to prefer D1T2A1 - and hence tree T2 over D1T1A1 and D1T1A2 because it
requires less evolutionary events. But using that criterion, the three following tree alignments perform
even better:
9
Tree alignment D1T1A3
A -------------------------------------------------------gggaaaacccggg
B ----------------------------------------gggaaaaaatttggg-------------
C ---------------------gggaaaaaaaaaacccggg----------------------------
D gggaaaaaaaaaaaatttggg-----------------------------------------------
1 --------------------------------------------------------------------
2 --------------------------------------------------------------------
Tree alignment D1T2A2
A -------------------------------------------------------gggaaaacccggg
B ----------------------------------------gggaaaaaatttggg-------------
C ---------------------gggaaaaaaaaaacccggg----------------------------
D gggaaaaaaaaaaaatttggg-----------------------------------------------
3 --------------------------------------------------------------------
4 --------------------------------------------------------------------
Tree alignment D1T3A1
A -------------------------------------------------------gggaaaacccggg
B ----------------------------------------gggaaaaaatttggg-------------
C ---------------------gggaaaaaaaaaacccggg----------------------------
D gggaaaaaaaaaaaatttggg-----------------------------------------------
5 --------------------------------------------------------------------
6 --------------------------------------------------------------------
These three tree alignments can ‘explain’ the observations by postulating only four indel events (of lengths
21, 19, 15, and 13), an explanation that is optimal on every possible tree. They illustrate the
methodological breakdown that follows when minimizing equally weighted evolutionary events in a tree
alignment context and under the assumption that single indel events can involve more than one
nucleotide. But such tree alignments actually explain nothing at all. Given the prior hypothesis that the
observed sequences are orthologous, tree alignment D1T1A1 can, for example, explain the shared
presence in B and D of a stretch of three t’s near the end of their sequence (they inherited it from their
common ancestor). Trivial tree alignments D1T1A3, D1T2A2, D1T3A1 cannot explain such shared
presences.
I have argued (De Laet 2005, 2015) that a proper generalization of parsimony to tree alignments should
focus directly on what alternative hypotheses (tree alignments) can explain about observed empirical data,
not on evolutionary events that are required to that effect.
PARSIMONY AND EXPLANATION
When it comes to explanation, the basic observation is that “genealogies provide only a single kind of
explanation. A genealogy does not explain by itself why one group acquires a new feature while its sister
group retains the ancestral trait. A genealogy is able to explain observed points of similarity among
organisms just when it can account for them as identical by virtue of inheritance from a common
ancestor. (Farris 1983, p. 13, as quoted by Farris 2006, pp. 825-826).
As an illustration, consider hypothetical dataset D2, a dataset with just a single morphological character:
10
Dataset D2
A 0
B 0
C 1
D 1
On tree T1, the shared presence of state zero in A and B (one observed point of similarity) and the shared
presence of state 1 in C and D (another observed point of similarity) can simultaneously be explained as
due to common descent. This explanation requires a character state reconstruction that assigns state 0 to
inner node 1, and state 1 to inner node 2.
This kind of explanation is is independent of the position of the root. Assume for example that the tree is
rooted along the branch that leads to A. In that case, state 0 is plesiomorphic and A and B inherited it from
the common ancestor of all four terminals. Apomorphic state 1 arose along the branch that leads to the
common ancestor of C and D, and C and D inherited it from that common ancestor. Alternative rootings
will differ in assessment of plesiomorphy and apomorphy, but in all possible scenarios both the observed
point of similarity in state zero and the observed point of similarity in state one can be explained by
common ancestry.
On tree T2, the shared presence of state 0 in A and B and the shared presence of state 1 in C and D cannot
simultaneously be explained by common ancestry. It is possible to explain the shared presence of state 0 in
A and B in that way (with a reconstruction that assigns state 0 to both inner nodes), but then the shared
presence of state 1 cannot be so explained. It is then an instance of homoplasy or unexplained shared
similarity. Alternatively, it is possible to explain the shared presence of state 1 in C and D by common
ancestry (with a reconstruction that assigns state 1 to both inner nodes), but then the shared presence of
state 0 in A and B can no longer be so explained. The same is true for tree T3: either the shared presence of
state 0 in A and B or the shared presence of state 1 in C and D can be explained by common ancestry. But
not both.
For the simple character of dataset 2, there is at most a single unexplained point of shared similarity (none
on T1, one on trees T2 and T3). With more terminals, there can be more such instances within a single
character, and these may be logically interdependent. When minimizing homoplasy or unexplained
similarity - as a means to maximize explained observed similarity - not all instances of pairwise homoplasy
should then be counted, but only instances that are logically independent.
As Farris (2006, p. 826) put it (mainly by citing from his 1983 paper): It is common for homoplasies to be
logically interdependent (Farris, 1983, p. 20): “Suppose that a putative genealogy distributes [the 20
terminals showing feature X] into two distantly related groups A and B of ten terminals each. There are 100
distinct two-taxon comparisons of members of A with members of B, and each of those similarities in X
considered in isolation comprises a homoplasy… [But if] X is identical by descent in any two members of A,
and also in any two members of B, then the A-B similarities are all homoplasies if any one of them is.” But
fortunately it is easy to count mutually independent homoplasies (Farris, 1983, p. 20): “If a genealogy is
consistent with a single origin of a feature, then it can explain all similarities in that feature as identical by
descent. A point of similarity in a feature is then required to be a homoplasy only when the feature is
required to originate more than once on the genealogy. A hypothesis of homoplasy logically independent
of others is thus required precisely when a genealogy requires an additional origin of a feature. The
number of logically independent ad hoc hypotheses of homoplasy in a feature required by a genealogy is
then just one less than the number of times the feature is required to originate independently.
11
The same is true when directly counting explained points of similarity, in order to maximize them directly
(De Laet 1997, pp. 66-67). Using Farris’ example, if terminals A1, A2, and A3 are all three members of group
A, then there are three observed points of similarity among those three terminals that can be explained by
inheritance and common descent: the similarity between A1 and A2, the similarity between A1 and A3, and
the similarity between A1 and A3. But these are logically interdependent: if any of two out of those three
hold, then the third follows by necessity. When such logical interdependencies are properly taken into
account, the number of explained shared similarities in such a character on a tree varies directly with the
number of steps that are required for that character on that tree: every additional step amounts to one
less independent observed point of similarity that can be explained.
As each independent explained point of similarity involves two terminals, parsimony analysis can be
characterized as two-item analysis (see De Laet 1997, p. 66-67): it identifies the trees that maximize the
number of independent observed pairwise points of similarity that can simultaneously be explained by
inheritance and common descent. Observed pairwise similarities are the atomic units of empirical
comparative content of a dataset, and the units with which to measure the explanatory power of a tree
with optimized characters.
COMPOSITIONAL SEQUENCE HOMOLOGY
Rather than to rely on the relationship with number of steps, independent explained similarities in a
character on a tree can be counted directly.
Doing this for a general (unordered) morphological character, that number is equal to the number of
observations in the character minus one minus the number of steps that the character has on the tree (this
is a straightforward generalization of the binary case, discussed in De Laet 1997, pp. 66-67). For the
character of dataset D2, there are four observations (one in each terminal; missing information would not
be counted as an observation). On a tree that has an AB|CD partition, the character has one step and the
above number equals two (4 1 - 1). So there are two independent explained shared observed similarities.
The first one is the single similarity in state 0, the second the single observed similarity in state 1. On a tree
that does not have this partition, two steps are required and the above number amounts to one: either the
similarity in state 0 or the similarity in state 1. The tree with the AB|CD partition is preferred because it can
explain one more independent observed point of similarity.
When applying this point of view to a set of unaligned sequences, things get more complicated because
there are no predefined positions and positional characters to which the above calculation can be applied.
But for any given tree alignment for that set of unaligned sequences, it is possible to count how many
independent points of sequence similarity in base composition can be explained by common descent and
inheritance.
That number is equal to the total number of nucleotides in the observed sequences minus the total
number of subcharacters in the tree alignment minus the total number of base substitutions within those
subcharacters (De Laet 2005, pp. 107-108; see also De Laet 2015, p. 552). I have called this the
compositional component of sequence homology (De Laet 2005, p. 106; see also De Laet 215, p. 551).
In the above expression, a subcharacter of a tree alignment is a region in the tree where a particular
position is applicable. Position 13 of D1T1A1, for example, has two nucleotides: an ‘a’ in terminal C, and an
‘a’ in terminal D. The inner node that connects terminals C and D (inner node 2) also has a nucleotide in
that position. Therefore the two observed nucleotides at that position are in the same subcharacter in this
tree alignment. Within this subcharacter, there are no substitutions. As another example in that same tree
12
alignment, there are four nucleotides in position 18: a ‘c’ in A and C, and a ‘t’ in B and D. The two inner
nodes that connect these terminals also have a nucleotide at that position. So there is a single
subcharacter at position 19 as well. In this one, there are two substitutions.
For an example where a single position has more than one subcharacter, consider position 10 of the
following tree alignment of dataset D1 on tree T1:
Tree alignment D1T1A4
A gggaaaa--------cccggg
B gggaaaaaa------tttggg
C gggaaaaaaaaaa--cccggg
D gggaaaaaaaaaaaatttggg
1 gggaaaaaa------tttggg
2 gggaaaaaa------tttggg
That position has two nucleotides: an ‘a’ in terminal C and another ‘a’ in terminal D. But in this tree
alignment (a suboptimal one, to be sure), the inner node that connects these observed nucleotides (inner
node 2) does not have a nucleotide at that position. Therefore, these two observed nucleotides are not
directly comparable. They are in two different subcharacters or regions of applicability. Trivial regions, for
that matter: each subcharacter in position 10 of this tree alignment has only a single nucleotide.
Within any given subcharacter of a tree alignment, the number of independent explained similarities in
base composition can be obtained by applying the formula for a regular unordered character (number of
observations minus one minus steps), but restricted to the terminals that participate in the subcharacter:
the number of nucleotides within the subcharacter minus one minus the number of transformations within
the subcharacter (see below for some examples). When this is summed over all subcharacters of a tree
alignment, the above grand total for the complete tree alignment is obtained: the total number of
nucleotides in the observed sequences minus the total number of subcharacters in the tree alignment
minus the number of substitutions within subcharacters.
Dataset D1, for example, has 68 observed nucleotides (the sum of the lengths of the observed sequences).
Tree alignment D1T1A1 has 21 positions, and in each of these positions the observed nucleotides are in a
single region of applicability. So D1T1A1 has 21 subcharacters. Within the subcharacters, there are six
substitutions: two in the single subcharacter at position 16, two in the single subcharacter at position 17,
and two in the single subcharacter at position 18. So the total measure of compositional homology equals
68 21 6 = 41.
Most of these 41 explained points of similarity in base composition are in the stretches of nucleotide ‘a’
and the stretches of nucleotide ‘g’ in the observed sequences. Take for example the first position: a single
subcharacter that comprises all four terminals. In this subcharacter, all terminals in it have an observed ‘g’.
This amounts to three independent instances of compositional homology (if, for example, the ‘g’ in A and B
is homologous, the ‘g’ in A and C is homologous, and the ‘g’ in C and D is homologous, then by necessity all
other pairwise homologies follow). This number (3) can be obtained as the number of observed
nucleotides in the subcharacter minus one minus the number of substitutions or steps within the
subcharacter: 4 1 0 = 3.
Or position 13 of D1T1A1, with a single subcharacter that comprises just C and D. Both terminals have an
observed ‘a’ there, which amounts to a single instance of compositional homology. Using the above
formula, this is obtained as 2 observed nucleotides minus 1 minus 0 substitutions.
13
Summing over all subcharacters in which no substitutions occur (all subcharacters except those at positions
16, 17 and 18), 38 such independent instances of compositional homology can be counted.
The other three are in the subcharacters at positions 16, 17, and 18, the only subcharacters with
substitutions in this tree alignment. Given that both inner nodes have a reconstructed ‘t’ for each of these
three subcharacters, the observed shared similarity in each of the three nucleotides ‘t’ near the end of the
observed sequences of B and D is homologous: their presence can be explained by common ancestry. The
similarity of three nucleotodes ‘c’ near the end of the observed sequences of A and C, on the other hand,
cannot be explained by common ancestry. In total, this amounts to three independent explained points of
similarity in these subcharacters: one in position 16, one position 17, and one in position 18. (Applying the
formula for any of these three subcharacters: 4 observed nucleotides minus 1 minus 2 substitutions).
This result can be contrasted with tree alignment D1T2A1 for that same dataset. Just as D1T1A1, tree
alignment D1T2A2 has 21 positions and exactly one subcharacter at every position. But in total, there are
only three substitutions: one in the subcharacter at position 16, one in the subcharacter at postion 17, and
one in the subcharacter at postion 18. This amounts to the following grand total of explained similarity in
base composition: 68 21 3 = 44.
The difference with D1T1A1 is 3 (44 - 41). Given the similarity between tree alignments D1T1A1 and
D1T2A1, it is easy to verify this statement: all points of similarity in base composition that DT1A1 can
explain, can also be explained by D1T2A1. The difference is that D1T2A1 can explain three more
similarities. These three additional explained similarities reside in the subcharacters at positions 16, 17,
and 18.
SUBSEQUENCE HOMOLOGY
Sequence homology in a tree alignment is not captured completely by just looking at nucleotide level
homology within positions. This is illustrated using dataset D3:
Dataset D3
A aaaaaa
B aaaaaa
C aaagggaaa
D aaagggaaa
Consider these two tree alignments:
Tree alignment D3T1A1
A aaa---aaa
B aaa---aaa
C aaagggaaa
D aaagggaaa
1 aaa---aaa
2 aaagggaaa
Tree alignment D3T2A1
A aaa---aaa
B aaa---aaa
C aaagggaaa
D aaagggaaa
3 aaagggaaa
4 aaagggaaa
14
In D3T1A1, the shared presence of subsequence ‘ggg’ in the middle of the observed sequences of C and D
can be explained by common ancestry. As can its absence in A and B. In D3T2A1, that shared presence can
still be explained by common ancestry, but its absence in A and D can no longer be so explained. The
difference in explanatory power (one less explained point of similarity in D3T2A1) is exactly matched by
the difference in number of indel events that both tree alignments require: one indel event, of
subsequence ‘ggg’, along the inner branch of tree T1 for D3T1A1; two such events in tree T2 for
D3T2A1(one along the branch leading to A, the other along the branch leading to B).
In this simple case, the subsequence involved is identical in the two terminals where it is observed. But
that does not need to be the case. Consider dataset D4:
Dataset D4
A aaaaaa
B aaaaaa
C aaagggaaa
D aaagtgaaa
D3 and D4 are identical except for the middle position of the observed sequence of D: in D3 it is a ‘g’, in D4
a ‘t’.
Consider these two tree alignments for D4:
Tree alignment D4T1A1
A aaa---aaa
B aaa---aaa
C aaagggaaa
D aaagtgaaa
1 aaa---aaa
2 aaagggaaa
Tree alignment D4T2A1
A aaa---aaa
B aaa---aaa
C aaagggaaa
D aaagtgaaa
3 aaagggaaa
4 aaagggaaa
Even if the three middle positions in the sequences of C and D are no longer identical, the shared presence
of a subsequence at those positions can still be explained by common ancestry. I have called this
subsequence homology, a component of sequence homology that cannot be reduced to homology of
observed nucleotides within orthologous subsequences or subcharacters (De Laet, 2005, p. 106; see also
De Laet 2015: 551-552).
This can be illustrated by rooting tree T1 for example along the branch that leads to D (which involves the
assumption that D is a proper outgroup for this set of terminals). In that case, the absence in A and B of the
middle subsequence that is observed in C and D was inherited from their common ancestor, and it
provides a synapomorphy for A and B. One, moreover, that cannot be expressed in terms of composition
of an observed subsequence in those terminals.
In more complicated datasets and tree alignments, the subsequences that are involved in indel events
along different branches may not fully coincide. There are partially overlapping indels along different
15
branches in such cases. But even then the number of indel events can be used to compare this kind of
explained similarity between any two different tree alignments: whenever the first tree alignment has an
indel event that is not present in a second one, the first tree alignment has an unexplained similarity across
the branch with that indel, compared to the second one: depending on the root, either a subsequence that
was present got lost, or a subsequence that was not present before was gained. The same holds the other
way around. The net balance of indels between the two tree alignments is then a proper measure with
which their subsequence homology can be compared.
ADDING IT UP
So compositional homology in a tree alignment is directly measured by the number of nucleotides in the
observed sequences minus the number of subcharacters in the tree alignment minus the number of
substitutions within subcharacters. And the number of indel events of subsequences provides a measure
with which the amount of subsequence homology can be compared.
Adding it up, when comparing two tree alignments for a set of observed sequences, the tree alignment
with the higher amount of total (equally weighted) sequence similarity that can be explained as homology
is the one with the higher value for this aggregate number: the number of nucleotides in the observed
sequences minus the number of subcharacters in the tree alignment minus the number of substitutions
within subcharacters minus the number of indel events. This holds in general and does not depend on the
aligned length.
The aggregate number (and hence explained shared sequence similarity) is maximal for the tree(s) and tree
alignment(s) for which the sum of subcharacters, substitutions, and indel events is minimal.
POY
In POY, the substitution cost can be used to minimize substitutions. And indel events that span multiple
nucleotides can in principle be minimized by using a positive gap opening cost and a zero gap extension
cost. But POY does not provide a cost parameter to minimize subcharacters. So it may look like POY cannot
be used to maximize sequence similarity that can be explained as homology. But there is a practical work-
around.
First some background on tree alignment algorithms. Exact algorithms to find an optimal tree alignment on
a given tree are so computationally complex that heuristic approximations are unavoidable in practice. This
is where POY’s algorithms such as DO (direct optimization) and iterative-pass optimization come in. These
algorithms never consider more than three sequences at a time when reconstructing a sequence at an
inner node.
Consider for example iterative-pass optimization, an approach due to Sankoff et al. (1973). Using tree T1 as
an example, this is a general description. The algorithm starts out with the calculation of an initial
reconstructed sequence for the two inner nodes. Next, the algorithm enters an iterative process in which it
tries to improve these initial reconstructions by repeatedly revisiting the inner nodes. Assume that it first
revisits inner node 1. This inner node has three incident branches: the branches leading to observed
sequences A and B and the branch leading to inner node 2. Using observed sequences A and B and the
reconstructed sequence at inner node 2, a new reconstructed sequence for node 1 is calculated (the so-
called median sequence). Next it will consider node 2 and recalculate the reconstructed there using the
sequences at nodes C, D, and the new reconstructed sequence at inner node 1. This is repeated until no
more improvements can be made. That is, until the cost stabilizes.
16
Computational complexity is such that it is doable to calculate an exact median sequence during any revisit
of an inner node. Exactness here means that that median sequence is guaranteed to be optimal. Given the
input, that is: the sequences at the ends of the three incident branches. In terms of optimizing the cost,
this algorithm definitely performs better than simple DO (at the expense of longer execution time). But still
it never considers interactions across longer distances than subtrees that consist of one inner node and its
three neighbouring nodes. As a result, exact optimality can only be guaranteed for datasets up to three
terminals. Beyond that, it provides a heuristic approximation. This is a general conclusion and does not
depend on the cost parameters being used.
Back to maximizing explained similarity. It can be shown that the 3221 cost regime measures, up to a
constant, (twice) the number of subcharacters, substitutions, and indel events for (sub)trees that consist of
one inner node and three neighbouring nodes (De Laet 2005, p. 109; see also de Laet 2015, p. 562-563).
This means that explained sequence similarity is guaranteed to be optimal when using that cost set on such
(sub)trees. For larger trees or subtrees, that guarantee of optimality no longer holds. But POY’s algorithms
don’t guarantee optimality for such larger trees or subtrees to start with. So, in practice, 3221 is the best
possible heuristic approximation for maximization of explanatory power using the algorithms that are
available in POY.
A discussion of an analysis of dataset D1 using POY might be useful at this point, but I reckon that this post
is already way too long. So I’ll keep that for some other time. In the meantime I hope that this post may be
useful as a clarification of my position on these issues.
Best
-- Jan
REFERENCES
De Laet, J., 1997. A reconsideration of three-item analysis, the use of implied weights in cladistics, and a practical application in
Gentianaceae. Dissertation. Available at www.anagallis.be or in ResearchGate.
De Laet, J. 2005. Parsimony and the problem of inapplicables in sequence data. Pp. 81-116 in Albert, V. A. (ed.), Parsimony,
phylogeny and genomics. Oxford University Press.
De Laet, J., 2015, Parsimony analysis of unaligned sequence data: maximization of homology and minimization of homoplasy,
not minimization of operationally defined total cost or minimization of equally weighted transformations. Cladistics, 31: 550
567. doi: 10.1111/cla.12098.
Farris, J. S., 1983. The logical basis of phylogenetic analysis. Pp. 7-36 in: Platnick, N. I., Funk, V. A. (Eds.), Advances in Cladistics II.
Columbia University Press, New York.
Farris, J. S., 2008. Parsimony and explanatory power. Cladistics, 24: 825847.
Sankoff, D. 1975. Minimal mutation trees of sequences. SIAM J. Appl. Math. 28: 3542.
Sankoff, D., and Cedergren, R. J. 1983. Simultaneous comparison of three or more sequences related by a tree. Pp. 253-263 in
Sankoff, D., and Kruskal, J. (eds.), Time warps, string edits, and macromolecules. The theory and practice of sequence
comparison.
CSLI Publications, Stanford, California (1999 reprint).
Sankoff, D., Morel, C., and Cedergren, R. J. 1973. Evolution of 5S RNA and the nonrandomness of base replacement. Nature
(New Biology) 245: 232-234.
17
Contribution 6: Santiago Friday 06/05/2016
Dear Jan,
I am sorry it has taken me so long to answer. Thanks to your detailed correspondence, I was able to better
understand your ideas. Your papers in Cladistics are quite technical and I have to admit that I was having
difficulties to follow some of your arguments. I am also what you can call a “slow thinker” and like to have
my time to reflect on things. In February, I was doing field work in the mountain rainforests of Peru. For
almost a month, I was hiking every day looking for amphibians and squamates, living in a tent, and cooking
by the camp fire. It was a very nice setting and scenario to think calmly about your ideas. In any case, here
are my two cents on the topic.
In a nutshell, your concern arises from the fact that the assumption “indels never involve more than one
nucleotide at a time” is (i) an unrealistic biological assumption and (ii) logically irreconcilable with the anti-
superfluity principle used by Kluge and Grant.
Below, I will expand on these two points and how I think they should be considered in the light of character
independence.
An unrealistic biological assumption
I guess that nobody with some basic notions on biology will deny that “indels never involve more than one
nucleotide at a time” is false. However, you do not mention that this false assumption actually implies
other equally important, with regards to their implications to phylogenetics, statements such as:
Indels do not always involve more than one nucleotide at a time
Some indels, even those affecting more than one nucleotide, are the result of transformations of one
nucleotide at a time
Some indels affecting more than one nucleotide are the result of a combination of transformations of
one nucleotide at a time and several nucleotides at a time
Following the same logic, we can also consider false the assumptions “mutations never involve more than
one nucleotide at a time” and “transformations of morphological characters never involve more than one
character at a time”, and the same sort of statements outlined above also follows from these false
assumptions.
You criticized the arguments of Kluge and Grant for considering “indels never involve more than one
nucleotide at a time” but at the same time the same critic can be applied to your proposal because it
assumes that “indels always involve more than one nucleotide at a time”, which arguably is also
biologically unrealistic. I see both statements as extremes of a continuum, because, as explained above,
there are all sorts of situations in between. In other words, the number of possible permutations between
single character events (e.g., indels, mutations, transformation of phenotypic character) and single events
affecting multiple characters (e.g., indels affecting more than one nucleotide at a time, mutations affecting
more than one nucleotide at a time, transformation of more than one phenotypic character at a time)
within a real dataset is absurdly large! I guess that the only biologically plausible assumption in this context
is what my host at AMNH, Darrel Frost, used to say about evolution “shit happens”, meaning that not only
18
many of those possibilities are evolutionary plausible but that they may have happened, at least once,
during the ~3500 million years on life on Earth.
Logically irreconcilable with the anti-superfluity principle
You said that if properly applied, the anti-superfluity principle used by Kluge and Grant to justify their view
favors your position that “indels always involve more than one nucleotide at a time” because the
explanation with less transformation events should always be favored. OK, I see the beauty of that and also
that applying the same antisuperfluity principle naively takes us to the trivial solution that every difference
(not only indels) is most parsimoniously explained by a single transformation event affecting all the
characters. This would be similar, although a more general case because it would also account for
mutations, to the example you provide in page 9 of our exchange.
What I call a naïve application of the anti-superfluity principle is the idea of applying this principle in a
vacuum, without considering anything else. Phylogenetic analyses are performed within a context of
auxiliary assumptions or principles, for example we consider that evolution is hierarchical and that
characters are independent. The second consideration is most crucial to the anti-superfluity principle.
When not considered, it leads to trivial solutions. I never read any contribution by Grant and/or Kluge
defending this naïve use of the anti-superfluity principle. To the contrary, Grant & Kluge (2004) Cladistics,
20, 23-31 devote page 26 to the importance of character independence.
To avoid this type of trivial solutions or naïve use of the anti-superfluity principle (i.e., where a single
transformation is invoked to explain all differences between characters) you suggest embracing
explanation of similarities computed as: (number of nucleotides in the observed sequences) (number of
subcharacters in the tree alignment) (number of substitutions within subcharacters) (number of indel
events). However, you count indel events applying the assumption “indels always involve more than one
nucleotide at a time” by defining subsequence homology so, if I understand it correctly, your approach
does not circumvent the naïve use of the anti-superfluity principle at least when applied to indels
expanding more than one nucleotide. This also leaves the unanswered question, why one should not count
mutations the same way you are counting indels? This is, for a subsequence of nucleotides, the number of
nucleotides that are involved in mutations is irrelevant because it can be explained by a single event
accounting for all observed mutations so the cost = 1. For example:
A aaa---
A B C
B atcttt
C aaattt
In tree (A, (B, C) I can define the subsequence “x” (in red) involving the first three nucleotides and
postulate a single mutation event expanding two nucleotides (second and third positions of the alignment)
as an autapomorphy on the branch of terminal B so the cost = 1, the same way that the subsequence
involving the last three positions of the alignment “y” are explained by a single indel event expanding three
nucleotides and having a cost = 1. As I explained above any combination of subsequences is biologically
plausible, such as inversions or transpositions of series of nucleotides and loops of ribosomal genes.
However, you defend that sets of nucleotides such as those of subsequence “x” should be accounted at the
atomistic level (counting every column and transformation independently), while those including indels
(e.g., subsequence “y”) should be interpreted as single events.
19
In summary, your approach is inconsistent. For certain sets of data (indels) you apply a naïve anti-
superfluity principle excluding character independence to calculate similarities, while for other types
(nucleotides) you apply a sophisticated version of anti-superfluity principle considering character
independence. I do not see how your approach is logically more consistent with the principle of
explanation of similarities than that of Grant & Kluge with the principle of anti-superfluity. At worst, to me
they seem to be equally inconsistent, perhaps the difference is that Grant & Kluge are not thinking about
anti-superfluity in the absence of character independence.
Character Independence
One of the core assumptions of phylogenetics is that characters should be independent. We suspect, and
sometimes even know, that this assumption is often violated (as with the case of the assumption of
hierarchical evolution when species originate by hybridization). Clear examples of non-independent
characters are a phenotypic character and the gene that codes for it, several phenotypic characters that
evolve as a module, nucleotides that mutate simultaneously such as in ribosomal genes, and a single indel
affecting more than one nucleotide but coded as several independent indels. The issue of independence is
more complicated than this, but suffices to leave it at this stage to argue my case.
My point is that in most cases it becomes empirically impossible to test the independence of certain sets of
characters. DNA sequences are a good example. Let’s say we have these two sequences:
ATGCCA
GTACAC
Shall we explain the differences between the first and third position of the alignment as two independent
point mutations or as a single event inversion? The same applies to positions 5 and 6 of the alignment. This
is just a simple example. It can be much more complicated but I guess it is easy to grasp that the two
options are biologically possible and non-testable (at least in this context).
What stops us from explaining everything as a single transformation event is the assumption of character
independence. Of course, if we know that two or more characters are not independent (e.g., a gene and
the phenotype coded by the gene) we should exclude one of the characters. In the absence of evidence,
we are better with keeping our assumption; otherwise we open Pandora’s box of subjectivity to justify why
some types of character sets should be explained by a single event, while others should be considered as
several point transformations. For example, consider these sequences:
ATG- - -CCA
GTATTTCAC
What’s the scientific reason behind explaining the three indels as a single event expanding three
nucleotides but explaining the mutations as single independent events? I see none. So once we let
subjectivity in to sweep away character independence, what stop us from asking: Why not to explain all
the differences observed between the two sequences as a single event where the polymerase screwed up
and all the mutations and indels happen simultaneously?
20
Explanation and Similarity
Grant & Kluge (2004, 2009), Kluge (2005, 2007), and Kluge & Grant (2006) provide a detailed discussion of
why similarity is at best described but not explained, using an object criterion of similarity is logically
incompatible with evolutionary theory (classes cannot change/evolve), homoplasy cannot be a hypothesis
(ad hoc or not) because it explains nothing, transformations (the events) are the things to be explained
(the evidence) and not the objects. I have little to add to their careful discussions, except that in my
opinion there is always a gap between the conceptual and the operational but the fact that our operations
are doomed to fail in some cases should not stop us from investigating. Operationally we work with and
code objects as delimiters of events, but doesn’t mean that what we are trying to explain is the similarity
(or not) of the objects. A very good discussion on the distinction between the operational and the
conceptual is that of Frost & Kluge (1994) Cladistics, 10, 259-294 using species concepts and species
discovery operations. I see your focus on explaining similarities as an attempt to develop the concepts
from the methods, while in my opinion it should be the other way around.
Technical Report
Full-text available
Brazeau et al. (2017) recently published a paper with a single-character algorithm to calculate the score of a character with inapplicable data on a tree, aiming to maximize homology in such characters. In this note, I show by example (using their Fig. 3d) that their algorithm is insufficient to find all optimal inner node state reconstructions in such characters. The root cause seems to be a inherent built-in constraint on the evaluation of implied absence/presence characters in their algorithm. In more complex cases, this constraint can lead to an overestimation of the optimal minimal score of character hierarchies and to errors in the trees that are selected as optimal during tree search.
Article
Full-text available
Wheeler (2012) stated that minimization of ad hoc hypotheses as emphasized by Farris (1983) always leads to a preference for trivial optimizations when analysing unaligned sequence data, leaving no basis for tree choice. That is not correct. Farris's framework can be expressed as maximization of homology, a formulation that has been used to overcome the problems with inapplicables (it leads to the notion of subcharacters as a quantity to be co-minimized in parsimony analysis) and that is known not to lead to a preference for trivial optimizations when analysing unaligned sequence data. Maximization of homology, in turn, can be formulated as a minimization of ad hoc hypotheses of homoplasy in the sense of Farris, as shown here. These issues are not just theoretical but have empirical relevance. It is therefore also discussed how maximization of homology can be approximated under various weighting schemes in heuristic tree alignment programs, such as POY, that do not take into account subcharacters. Empirical analyses that use the so-called 3221 cost set (gap opening cost three, transversion and transition costs two, and gap extension cost one), the cost set that is known to be an optimal approximation under equally weighted homology in POY, are briefly reviewed. From a theoretical point of view, maximization of homology provides the general framework to understand such cost sets in terms that are biologically relevant and meaningful. Whether or not embedded in a sensitivity analysis, this is not the case for minimization of a cost that is defined in operational terms only. Neither is it the case for minimization of equally weighted transformations, a known problem that is not addressed by Kluge and Grant's (2006) proposal to invoke the anti-superfluity principle as a rationale for this minimization.
Chapter
Full-text available
"Problems can arise in parsimony analyses when data sets contain characters that are not applicable across all terminals. Examples of such characters are tail colour when some terminals lack tails, or positions in DNA sequences in which gaps are present. Focusing on regular single-column characters as classically used in phylogenetic analysis, Farris characterized parsimony as a method that maximizes explanatory power in the sense that most-parsimonious trees are best able to explain observed similarities among organisms by inheritance and common ancestry. This led De Laet to formulate parsimony analysis as two-item analysis, whereby parsimony maximizes the number of observed pairwise similarities that can be explained as identical by virtue of common descent, subject to two methodological constraints: the same evidence should not be taken into account multiple times, and the overall explanation must be free of internal contradictions. In this chapter, the way this formulation can be used to deal with the problem of inapplicables is discussed vis-à-vis the optimization of entire nucleotide sequences as complex characters in a tree alignment." [from http://oxfordindex.oup.com/view/10.1093/acprof:oso/9780199297306.003.0006]
Thesis
Full-text available
[Introduction, pp. xiii-xv] The central theme of this thesis is cladistics. This approach to phylogenetic analysis has its roots in Willi Hennig’s theoretical work of the fifties (see chapter 1), and after a modest take-off in the sixties and a period of exponential growth in the seventies and the eighties, cladistics has now become a basic tool in systematic research. Its merits are that it has stimulated the development of a conceptual framework that enables us to think and talk in a clear way about phylogenetic relationships, and that it provides a set of powerful methods to analyze systematic data in order to discover the underlying phylogenetic relationships. In the first chapter, a basic survey of the main concepts and terms of cladistics is given. It is presented in Dutch because to date no general introductions to cladistics exist in Dutch. At present, there is a set of generally accepted methods in cladistics. However, this does not mean that the theoretical work has come to an end. To the contrary, old ideas are constantly being refined and new ideas keep popping up. Two of those are treated in the second and the third chapter. The first one is three-item analysis, a method that was introduced some years ago as a novel approach to parsimony analysis in both biogeography (Nelson & Ladiges 1991a, b) and systematics (Nelson & Platnick 1991). The name three-item analysis refers to the fact that each statement about relationships between more than three items (areas in biogeography, homologous features in systematics) is decomposed into a series of basic statements, each of which involves only three items. Such a basic statement simply says which two of the three items are thought to be related more closely to each other than either is related to the third. Following its introduction, three-item analysis has been severely criticized because of three basic defects: (1) it is flawed because it presupposes that character evolution is irreversible; (2) it is flawed because basic statements that are not logically independent are treated as if they are; (3) it is flawed because some of the three-item statements that are considered as independent support for a given tree may be mutually exclusive on that tree. In the second chapter it is shown that these criticisms only relate to the particular way that the approach was implemented by Nelson & Platnick (1991), and an alternative implementation that solves each of the three basic problems is derived. However, the resulting method is not an improvement over standard parsimony analysis: it is identical to the standard approach but for one small constraint, which is a highly unnatural restriction on the maximum amount of homoplasy that may be concentrated in a single character state. As this restriction follows directly from the decomposition of character state distributions into basic statements, it is concluded that any approach that is based on such decompositions will be defective. The second one is about character weighting. Some years ago, Goloboff (1993a) proposed a non-iterative homoplasy-based weighting method in which the weight or fit of a character on a cladogram is defined as a hyperbolic decreasing function of its homoplasy. The best trees are those that have the highest total fit over all characters of a data set. Goloboff considered his approach to be in direct agreement with cladistic ideas, but most parsimonious trees are those trees that imply the lowest amount of weighted homoplasy (Farris 1983), and these are not necessarily the trees that imply that the characters have the highest total fit, as is shown in chapter three. Several implications of this observation are discussed, and an alternative way of weighting characters is proposed. A computer program in which this approach is available is discussed in appendix A, and the approach is illustrated by using an indecisive data set (see chapter six) and the morphological Gentianaceae data set that is presented in chapter five. Several cladistic analyses based on various types of data indicate that the Gentianaceae, a cosmoplolitan family of medium size, is one of the principal families of a monophyletic order Gentianales. Recent developments concerning the order Gentianales are reviewed against a historical background in chapter four. While a consensus is emerging about the monophyly of the Gentianales, much work remains to be done concerning the interfamilial and intrafamilial relationships within the order. The most recent worldwide monograph of the Gentianaceae is over a century old (Gilg 1895). The 21 genera that are selected for the current analysis represent all Gilg’s tribes and subtribes except Leiphaimeae, Rusbyantheae and Voyrieae. Standard parsimony analyses and analyses using Goloboff’s approach of maximum fit give congruent results as far as the global relationships are concerned. The best supported clade contains Eustoma (Tachiinae) and all included Gentianinae, Erythraeinae and Chironiinae. The basal parts of the cladograms, involving the woody tropical representatives and Exacum, are poorly resolved. This thesis is concluded with a short chapter on indecisive data sets. Goloboff (1991a, b) defined the cladistic decisiveness of a data set as the degree to which all possible resolved trees for the data set differ in length. He proposed a measure of the decisiveness of data sets, the DD statistic, and discussed some properties of indecisive data sets, a special type of data set for which every possible cladogram has the same length. His discussion of indecisive data sets was restricted to characters that have no missing entries. In this chapter I will first show how indecisive data sets can be constructed when missing entries are present. Without missing entries, there is essentially only a single indecisive data set for a given number of taxa, but by allowing missing entries a wide variety of different indecisive data sets with a wide range of ensemble consistency and retention indices can be constructed (easy-to-calculate formulas for the length of an indecisive data set on a dichotomous tree and on an unresolved bush are derived in Appendix C). Such data sets are useful in the construction of hypothetical examples that illustrate the elusive nature of data decisiveness. It is concluded that simple measures such as Goloboff’s DD statistic are unable to capture the various aspects of the concept.
Article
Full-text available
Given a finite tree, some of whose vertices are identified with given finite sequences, it is shown how to construct sequences for all the remaining vertices simultaneously, so as to minimize the total edge-length of the tree. Edge-length is calculated by a metric whose biological significance is the mutational distance between two sequences.
Article
Full-text available
The following three basic defects for which three-taxon analysis has been rejected as a method for biological systematics are reviewed: (1) character evolution is a priori assumed to be irreversible; (2) basic statements that are not logically independent are treated as if they are; (3) three-taxon statements that are considered as independent support for a given tree may be mutually exclusive on that tree. It is argued that these criticisms only relate to the particular way the three-taxon approach was originally implemented. Four-taxon analysis, an alternative implementation that circumvents these problems, is derived. Four-taxon analysis is identical to standard parsimony analysis except for an unnatural restriction on the maximum amount of homoplasy that may be concentrated in a single character state. This restriction follows directly from the basic tenet of the three-taxon approach, that character state distributions should be decomposed into basic statements that are, in themselves, still informative with respect to relationships. A reconsideration of what constitutes an elementary relevant statement in systematics leads to a reformulation of standard parsimony as two-taxon analysis and to a rejection of four-taxon analysis as a method for biological systematics.
Chapter
Philosophers continue to debate the meaning and rationale behind 'Ockham's razor'. Two formally distinct justifications for parsimony can be distinguished: one that recommends positing as few theoretical components as possible, and a second recommending against positing the superfluous. Furthermore, parsimony as applied to phylogenetic inference can be separated into conceptual versus operational aspects. It is argued that phylogenetic inference is ideographic, springing from the idea that the relative recency of common ancestry can be represented directly as a concrete, spatio-temporally restricted, explainable thing - the cladogram - just as it can be accompanying transformations of inherited traits. Any phylogenetic method that assumes an evolutionary model can be criticized since it may assume counter-factual conditionals. Since models are usually statistical, relating them to the necessarily unique hypotheses of phylogeny is illogical. A further argument is that employing a model assumes more than background knowledge: that which is minimally sufficient to provide a causal explanation of historical individuality. Given the ideographic argument presented, in quantitative terms, parsimony should choose the hypothesis of cladistic relationships that minimizes the overall patristic difference, because that hypothesis has the greatest power to explain the independently heritable transformation events as propositions of homology.
Article
Parsimony can be related to explanatory power, either by noting that each additional requirement for a separate origin of a feature reduces the number of observed similarities that can be explained as inheritance from a common ancestor; or else by applying Popper’s formula for explanatory power together with the fact that parsimony yields maximum likelihood trees under No Common Mechanism (NCM). Despite deceptive claims made by some likelihoodists, most maximum likelihood methods cannot be justified in this way because they rely on unrealistic background assumptions. These facts have been disputed on the various grounds that ad hoc hypotheses of homoplasy are explanatory, that they are not explanatory, that character states are ontological individuals, that character data do not comprise evidence, that unrealistic theories can be used as background knowledge, that NCM is unrealistic, and that likelihoods cannot be used to evaluate explanatory power. None of these objections is even remotely well founded, and indeed most of them do not even seem to have been meant seriously, having instead been put forward merely to obstruct the development of phylogenetic methods. © The Willi Hennig Society 2008.