Content uploaded by Jan De Laet

Author content

All content in this area was uploaded by Jan De Laet on May 10, 2016

Content may be subject to copyright.

1

Parsimony analysis of unaligned sequence data:

an exchange

10 May 2016, version 2

Introduction ........................................................................................................................................................ i

Version 1 ......................................................................................................................................................... i

Changes and additions in version 2 (10/05/2016) ......................................................................................... ii

Contribution 1: Jan - Tuesday 06/10/2015 ........................................................................................................ 1

Contribution 2: Santiago - Tuesday 06/10/2015 20:01 ..................................................................................... 2

Contribution 3: Jan – Saturday 10/10/2015 ...................................................................................................... 3

Contribution 4: Santiago - Wed 14/10/2015 ..................................................................................................... 6

Contribution 5: Jan - Sat 24/10/2015 ................................................................................................................ 7

TREE ALIGNMNENTS ...................................................................................................................................... 7

PARSIMONY AND EXPLANATION ................................................................................................................... 9

COMPOSITIONAL SEQUENCE HOMOLOGY .................................................................................................. 11

SUBSEQUENCE HOMOLOGY ........................................................................................................................ 13

ADDING IT UP ............................................................................................................................................... 15

POY ............................................................................................................................................................... 15

REFERENCES ................................................................................................................................................. 16

Contribution 6: Santiago – Friday 06/05/2016 ................................................................................................ 17

An unrealistic biological assumption ........................................................................................................... 17

Logically irreconcilable with the anti-superfluity principle ......................................................................... 18

Character Independence ............................................................................................................................. 19

Explanation and Similarity ........................................................................................................................... 20

i

Introduction

Version 1

A couple of weeks ago, browsing in www.researchgate.net, I came across this paper:

Phylogenetic systematics of egg-brooding frogs (Anura: Hemiphractidae) and the evolution of

direct development

SANTIAGO CASTROVIEJO-FISHER · JOSÉ M. JR. PADIAL · IGNACIO DE LA RIVA · JOSÉ P. POMBAL JR · HELIO R. DA

SILVA · FERNANDO J. M. ROJAS-RUNJAIC · ESTEBAN MEDINA-MÉNDEZ · DARREL R. FROST

Zootaxa 08/2015; 4004(1):1-75. DOI:10.11646/zootaxa.4004.1.1

This paragraph on p. 8 struck me:

“Optimality criterion and nucleotide homology. We chose the criterion of parsimony (unweighted) so that

our phylogenetic inferences minimize ad hoc assumptions and maximize falsifiability and explanatory

power of evidence (Wiley 1975; Farris 1983; Farris et al. 2001; Kluge 2001a, b, 2009; Kluge & Grant 2006;

Grant & Kluge 2009). We applied parsimony to tree-alignment (Sankoff 1975; Sankoff & Rousseau 1975;

Sankoff et al. 1976; Wheeler 1996) to infer the minimum number of transformation events needed to

explain observed differences (including indels) in DNA sequences (Grant & Kluge 2004; 2009; Kluge & Grant

2006; Wheeler et al. 2006; Grant & Kluge 2009; Padial et al. 2014).”

Researchgate has the nice feature that it allows direct feedback. So I asked Santiago and his coauthors a

question:

And an exchange followed.

ii

Unfortunately, graphics cannot be included in this kind of feedback, and Researchgate only provides a

single font, a font that happens to be proportional. Two serious drawbacks when writing about sequence

alignments. So I think it is useful to collect the ongoing online exchange in this pdf file, a format that allows

some formatting to be added for readability. Other than removing some inconsistencies in interpunction, I

will indicate changes and comments beyond what appeared online by putting them in green between

square brackets.

This is an ongoing effort, so this pdf is expected to grow. I will document any future addition in this

introduction. In the meantime, I hope that the current version may help to clarify the issues that surround

parsimony analysis in a tree alignment context. Feel free to participate in the online discussion, or start a

new one here.

Big thanks to Santiago for taking this up!

Jan De Laet

Veltem-Beisem

27 October 2015

Changes and additions in version 2 (10/05/2016)

Addition of Santiago’s comments from 6 May 2016. Thanks to Santiago for providing me with a slightly

edited version compared to the online comment.

1

Contribution 1: Jan - Tuesday 06/10/2015

Hi Santiago and coauthors,

I have a question.

On page 8 you say that you applied parsimony to tree-alignment to infer the minimum number of

transformation events needed to explain observed differences (including indels) in DNA sequences.

But if you use minimization of transformation events as optimality criterion in a tree-alignment analysis,

then you end up with a methodological breakdown. This is so because any observed sequence can then be

explained by postulating just one big insertion event. For all except the most trivial datasets, that means

that the data are optimally explained on any possible tree by postulating just as many insertion events as

there are observed sequences. So there is no way to choose any tree over any other tree. (This is not new, I

discussed it at length in the two papers at the bottom of this note; one is from 2005, the other just

appeared in Cladistics).

I know that this is absurd from a biological point of view, but it follows from your stated goal to minimize

transformation events in a tree-alignment analysis.

In your paper you clearly prefer some trees over other trees, so you must have been minimizing something

else instead. Your paper does not contain the actual cost settings that you used in POY, so I am at a loss

trying to figure out exactly what. Can you help me out?

Thanks in advance,

Best regards,

-- Jan

https://www.researchgate.net/publication/260812208_Parsimony_and_the_problem_of_inapplicables_in_sequence_data

http://onlinelibrary.wiley.com/doi/10.1111/cla.12098/abstract

2

Contribution 2: Santiago - Tuesday 06/10/2015 20:01

Hi Jan,

Thanks for the feedback . I have never had anyone commenting on my papers so this is kind of new to me.

Anyways, it is very interesting to chat with other people interested in phylogenetics.

Are you Swedish? I did my PhD in Uppsala and have many emotional and professionals links to Sweden

after living there for five years.

About your comment. Yes, I have read your papers. Nice contributions. Well, if I understand you comment

correctly, I think I can answer your concern.

I am considering characters at the "atomic" level including indels and nucleotides. Thus, transformation

series in our analysis are composed of single elements (i.e., a nucleotide or absence of nucleotide) and not

groups of nucleotides and/or idels. In this sense, I think that we are actually minimizing transformation

events. I agree with you that the way we expressed in the paper could be confusing. I also agree with you

that it is a very naive approach to consider that all indels happen as insertion/deletions of single

nucleotides. What I am still unsure if is there is a better way to do parsimony tree-alignment.

Again, if I understand your position correctly, your idea rests on interpreting parsimony as two taxon

analysis (right?) as you suggest in De Laet & Smets (1998).

Best wishes,

Santiago

3

Contribution 3: Jan – Saturday 10/10/2015

Hi Santiago,

No, I'm not Swedish, I'm from Belgium. But as a postdoc I've been in Stockholm for a while, and for my

phylogenetic research I'm now associated with Gothenburg Botanical Garden. So quite some links to

Sweden here as well. I've also visited Brazil a couple of times, and also have good memories of those trips.

Thanks for your clarification and feedback. My reply here may suggest otherwise, but I do think that we

have a lot of common ground.

On p. 8 you write: "We chose the criterion of parsimony (unweighted) so that our phylogenetic inferences

minimize ad hoc assumptions and maximize falsifiability and explanatory power of evidence (Wiley 1975;

Farris 1983; Farris et al. 2001; Kluge 2001a, b, 2009; Kluge & Grant 2006; Grant & Kluge 2009)."

I agree that parsimony minimizes ad hoc assumptions and maximizes explanatory power of evidence. But

agreeing with that does not mean that I agree with all argumentation that is provided to that effect in the

references that you provide. I'll concentrate on what Kluge and Grant (2006; KG6) and Grant and Kluge

(2009; GK9) say about explanatory power.

They start from the philosophical principle of anti-superfluity, a parsimony principle. From that principle,

they argue that explanatory power is maximized when historical transformation events, including indels,

are minimized. They are explicit that this also applies in the context of tree alignments. But they don’t

discuss problems that might be posed by historical indel events that span multiple residues. They don’t

discuss how to deal with such events from a theoretical point of view, and they don’t discuss how such

events should be dealt with in practice, when analyzing empirical sequence data with a tree alignment

program such as POY.

But from papers such as Kluge (2005) and Grant et al. (2006; full references are given at the end), it is clear

that they both translate minimization of transformation events into cost set 111 for use in POY. As you

pointed out, this is the same cost set that you have been using: it assigns a cost of one to transitions, a cost

of one to transversions, and a cost of one to unit gaps. (I use the terminology that I also used in my 2005

and 2015 papers: a sequence such as 'a a a - - - a c c c t' has one gap that consists of three unit gaps).

This poses some fundamental problems. I'll use this small dataset of three sequences as an illustration:

A a a a a c c c t

B a a a a c c c t

C a a a c c c a c g c t

With cost set 111, the optimization on the single unrooted tree for three sequences has this implied

alignment:

A a a a - - - a c c c t

B a a a - - - a c c c t

C a a a c c c a c g c t

It comes at a total cost of four: one base substitution and three unit gaps. So the single gap of length three

is explained by postulating three distinct historical indel events, each one involving just a single position.

This exposes the hidden assumption in KG6 and GK9's rationale: historical indel events never involve more

than one nucleotide at a time. I agree that this is a naive assumption. But it sits deeply embedded in the

core of KG6 and GK9's view of parsimony analysis: in their view, each base substitution and each unit gap

stands for a distinct historical event.

4

There is still a deeper problem. KG6 and GK9 rely on the philosophical notion of anti-superfluity. But

properly applied in this context, that principle implies the following: when there is a choice between an

explanation that involves three transformation events and an explanation that only involves a single

transformation event, then the explanation with only a single transformation event should be preferred. In

other words, If I can explain a single gap of length three with only one transformation event, I should not

postulate three such events. And this should also apply during analysis. So KG6 and GK9’s position not only

involves an unrealistic biological assumption, the theoretical core of their position itself is self-

contradictory in its application of anti-superfluity, the central concept on which that core is built (I’ve

discussed this in my recent paper in Cladistics).

You say that you are considering ‘characters at the "atomic" level including indels and nucleotides. Thus,

transformation series in our analysis are composed of single elements (i.e., a nucleotide or absence of

nucleotide) and not groups of nucleotides and/or idels. In this sense, I think that we are actually minimizing

transformation events’.

I agree with that, but only as far as it goes: you are minimizing some abstract concept of transformations

(and doing so postulating some abstract and non-standard notion of transformation series). But such

transformations have no sensible biological meaning, and they are not the kind of unique historical events

that KG6 and GK9 rely on in their theoretical framework (even if they give up that meaning when it comes

to empirical work or recommendations). So I don’t think it makes sense to refer to KG6 and GK9 as a

rationale for 111.

So, are there alternatives?

What happens, for example, if, within KG6 and GK9’s view of parsimony, you give up the assumption that

indels only affect single nucleotides at a time? That leads to giving up 111. The result is the methodological

breakdown that I mentioned in my previous post.

One might take a more pragmatic stance: first use 111 to get trees, then use those trees to infer where

indels might have affected multiple positions (to be sure, I’ve never seen this position defended in papers,

I’m just exploring possibilities). This might give decent results in practice, but as a method it is inconsistent:

during the analysis it is assumed that indels affect only single nucleotides at a time, after the analysis, this

assumption is given up. It may be useful to discuss an example. Consider this dataset:

A a a a

B a a a a a a

C a a a a a a a a a

D a a a a a a a a a a a a

Analysis with cost set 111 leads to the following total costs on the three different unrooted trees for four

terminals:

(A B)(C D)) -> total cost 9

(A C)(B D)) -> total cost 15

(A D)(B C)) -> total cost 15

So, during analysis, ((A B)(C D)) is preferred as the best explanation because it minimizes historical events

under the assumption that historical indel events only affect single nucleotides. After the analysis, it is then

concluded that the best explanation actually involves only three historical indel events (all three involving a

subsequence of length three). But, admitting that indel events can involve multiple nucleotides, the two

other trees can also explain the data with only three historical indel events (but involving subsequences of

5

different lengths). The net result is that equally good explanations under a realistic assumption are not

considered because they have previously been rejected during an analysis that was performed under an

unrealistic assumption.

One might still be more pragmatic: ‘I don’t care what the underlying rationale is, operationally 111 has

again and again proven to give decent results, so I’ll stick to it’. Or ‘111 is simple, that’s sufficient’. That

would come close to arguments of simplicity that Wheeler at times has voiced.

In 2003 and 2005 I’ve argued from a completely different perspective, one that you indeed could say was

inspired by my 1997 view of parsimony as two item-analyis (I only explicitly called it such in 1998, but the

ideas are there). It leads to a view that using 111 in POY amounts to a specific kind of differential weighting

of evidence (I’ve elaborated that in my recent paper in Cladistics). To obtain (an approximation of ) equally

weighted evidence, one should use cost set 3221 (gap opening cost 3, transition and transversion cost 2,

gap extension cost 1). I’ll try to expand a bit on that later this weekend.

Best

Jan

De Laet, J., and Smets, E. 1998. On the three-taxon approach to parsimony analysis. Cladistics 14: 363-381.

De Laet, J. 1997. A reconsideration of three-item analysis, the use of implied weights in cladistics, and a practical application in

Gentianaceae. Dissertation. Available at wwww.anagallis.be or in ResearchGate.

De Laet, J. 2003. When one and one is not two: parsimony analysis of sequence data. XXIIth Meeting of the Willi Hennig Society.

New York Botanical Garden, 20 July – 24 July. Abstract appeared in Cladistics 20: 81 (2004). Also available at www.anagallis.be.

Grant, T., Frost, D.R., Caldwell, J.P., Gagliardo, R., Haddad, C.B., Kok, P.J.R., Means, D.B., Noonan, B.P., Schargel,, W.E., Wheeler,

W.C., 2006. Phylogenetic systematics of dart-poison frogs and their relatives (Amphibia; Athesphatanura: Dendrobatidae). Bull.

Am. Mus. Nat. Hist. 299.

Kluge, A.G., 2005. What is the rationale for ‘Ockham’s Razor’ (a.k.a. parsimony) in phylogenetic inference?. In: Albert, V. (Ed.),

Parsimony, Phylogeny, and Genomics. Oxford University Press, Oxford, pp. 15–42.

6

Contribution 4: Santiago - Wed 14/10/2015

Hi Jan,

Thanks for taking your time to explain your position. As you said, instrumentalist justifications such as

‘I don’t care what the underlying rationale is, operationally 111 has again and again proven to give decent

results, so I’ll stick to it’. Or ‘111 is simple, that’s sufficient’.

are empty and should not be used. I would greatly appreciate if you find the time to elaborate a bit more

on the 3221 cost regime.

In any case, I want to study your papers in detail to fully grasp your proposal. It sounds very interesting. I

will be in contact with you.

Cheers,

santiago

7

Contribution 5: Jan - Sat 24/10/2015

Hi Santiago,

You wondered if there’s a better way to do parsimony analysis in a tree alignment context than to assume

that indel events affect only single nucleotides at a time, or than to stick to purely instrumentalist

justifications. I think there is, and it indeed involves cost set 3221 for use in POY. Before getting to that cost

set, a lengthy introduction may useful though.

To illustrate some concepts and ideas, I will mainly be using this hypothetical dataset of four observed

sequences (that they have been put in a dataset for phylogenetic analysis means that they are

hypothesized to be orthologous):

Dataset D1

A gggaaaacccggg

B gggaaaaaatttggg

C gggaaaaaaaaaacccggg

D gggaaaaaaaaaaaatttggg

TREE ALIGNMNENTS

A tree alignment is a concept that is due to David Sankoff (see Sankoff 1975, Sankoff and Cedergren 1983).

For a dataset of unaligned sequences, such as D1, a tree alignment consists of (1) a tree with the observed

sequences at the tips and reconstructed sequences at the inner nodes, and (2) a multiple alignment of

observed and reconstructed sequences alike. (The multiple alignment of the observed sequences that is

obtained by deleting the inner nodes from a tree alignment is called an implied alignment). A nice

representation of a tree alignment is a drawing of that tree in which each node is labelled with the

corresponding row of that multiple alignment. Unfortunately I can't do this here because ResearchGate

doesn't permit a non-proportional font or the use of graphics here. So I'll have to use a somewhat less

clear representation, with trees in parenthetical notation and numbered inner nodes [I’ll stick to it in this

formatted version]. For four terminals there are three different unrooted trees:

T1: (1:(A B) 2:(C D))

T2: (3:(A C) 4:(B D))

T3: (5:(A D) 6:(B C))

Consider unrooted tree T1, (1:(A B) 2:(C D)). The single inner branch determines partition AB|CD. The inner

node at the AB-side is labelled node 1, the inner node at the CD side is called node 2. In the same way, the

inner nodes of trees T2 and T3 are called 3, 4, 5 and 6.

Given this convention, the following multiple alignment fully determines a tree alignment for dataset D1

on tree T1:

Tree alignment D1T1A1 (first tree alignment for dataset D1 on tree T1)

A gggaaaa--------cccggg

B gggaaaaaa------tttggg

C gggaaaaaaaaaa--cccggg

D gggaaaaaaaaaaaatttggg

1 gggaaaaaa------tttggg

2 gggaaaaaaaaaa--tttggg

8

It looks a bit garbled in the single font that ResearchGate provides, but it should be clear enough. If not, it

might be useful to paste it in some program that allows a non-proportional font (Courier, for example). [A

superfluous paragraph in this formatted version].

The length of tree alignment D1T1A1 is 21, so there are 21 positional characters. In total, these have six

substitutions: two in position 16, two in position 17, and two in position 18. Three of those are in the

terminal branch leading to A, the other three in the terminal branch leading to C. The length differences of

the observed sequences are explained by three indel events: an indel event of a subsequence of length

two along the branch leading to A, an indel of length four along the inner branch, and an indel of length

two along the branch leading to D. Whether such an indel is an insertion or a deletion depends on how the

tree would be rooted. So the explanation of the observed sequences that is provided by D1T1A1 requires a

total of nine evolutionary events or transformations.

Here’s another tree alignment for D1 on T1:

Tree alignment D1T1A2

A gggaaaa--------cccggg

B gggaaaaaa------tttggg

C gggaaaaaaaaaa—cccggg

D gggaaaaaaaaaaaatttggg

1 gggaaaaaa------cccggg

2 gggaaaaaaaaaa—cccggg

D1T1A1 and D1T1A2 only differ in the reconstructed sequences at positions 16, 17, and 18: both inner

nodes have a ‘t’ there in D1T1A1, but a ‘c’ in D1T1A2. And just as in D1T1A1, six subsitutions are required

in these positions, and so the total number of evolutionary events in both tree alignments is the same. The

difference is that, in D1T1A2 B, the subsitutions occur in the terminal branches that lead to B and D, not in

the branches that lead to A and C.

Here's a tree alignment for D1 on T2, also with 21 positions:

Tree alignment D1T2A1

A gggaaaa--------cccggg

B gggaaaaaa------tttggg

C gggaaaaaaaaaa--cccggg

D gggaaaaaaaaaaaatttggg

3 gggaaaaaaaaaa--cccggg

4 gggaaaaaaaaaa--tttggg

This tree alignment requires only three substitutions, all three along the single inner branch: one in

position 16, one in position 17, and one in position 18. The length differences still require only three indel

events, but of different lengths compared to D1T1A1 and D1T1A2: an indel of length six along the branch

to A, an indel of length four along the branch to B, and an indel of length two along the branch to A. So in

total, this tree alignment only requires six evolutionary events: three substitutions and three indel events.

One could be tempted to prefer D1T2A1 - and hence tree T2 – over D1T1A1 and D1T1A2 because it

requires less evolutionary events. But using that criterion, the three following tree alignments perform

even better:

9

Tree alignment D1T1A3

A -------------------------------------------------------gggaaaacccggg

B ----------------------------------------gggaaaaaatttggg-------------

C ---------------------gggaaaaaaaaaacccggg----------------------------

D gggaaaaaaaaaaaatttggg-----------------------------------------------

1 --------------------------------------------------------------------

2 --------------------------------------------------------------------

Tree alignment D1T2A2

A -------------------------------------------------------gggaaaacccggg

B ----------------------------------------gggaaaaaatttggg-------------

C ---------------------gggaaaaaaaaaacccggg----------------------------

D gggaaaaaaaaaaaatttggg-----------------------------------------------

3 --------------------------------------------------------------------

4 --------------------------------------------------------------------

Tree alignment D1T3A1

A -------------------------------------------------------gggaaaacccggg

B ----------------------------------------gggaaaaaatttggg-------------

C ---------------------gggaaaaaaaaaacccggg----------------------------

D gggaaaaaaaaaaaatttggg-----------------------------------------------

5 --------------------------------------------------------------------

6 --------------------------------------------------------------------

These three tree alignments can ‘explain’ the observations by postulating only four indel events (of lengths

21, 19, 15, and 13), an explanation that is optimal on every possible tree. They illustrate the

methodological breakdown that follows when minimizing equally weighted evolutionary events in a tree

alignment context and under the assumption that single indel events can involve more than one

nucleotide. But such tree alignments actually explain nothing at all. Given the prior hypothesis that the

observed sequences are orthologous, tree alignment D1T1A1 can, for example, explain the shared

presence in B and D of a stretch of three t’s near the end of their sequence (they inherited it from their

common ancestor). Trivial tree alignments D1T1A3, D1T2A2, D1T3A1 cannot explain such shared

presences.

I have argued (De Laet 2005, 2015) that a proper generalization of parsimony to tree alignments should

focus directly on what alternative hypotheses (tree alignments) can explain about observed empirical data,

not on evolutionary events that are required to that effect.

PARSIMONY AND EXPLANATION

When it comes to explanation, the basic observation is that “genealogies provide only a single kind of

explanation. A genealogy does not explain by itself why one group acquires a new feature while its sister

group retains the ancestral trait. … A genealogy is able to explain observed points of similarity among

organisms just when it can account for them as identical by virtue of inheritance from a common

ancestor.” (Farris 1983, p. 13, as quoted by Farris 2006, pp. 825-826).

As an illustration, consider hypothetical dataset D2, a dataset with just a single morphological character:

10

Dataset D2

A 0

B 0

C 1

D 1

On tree T1, the shared presence of state zero in A and B (one observed point of similarity) and the shared

presence of state 1 in C and D (another observed point of similarity) can simultaneously be explained as

due to common descent. This explanation requires a character state reconstruction that assigns state 0 to

inner node 1, and state 1 to inner node 2.

This kind of explanation is is independent of the position of the root. Assume for example that the tree is

rooted along the branch that leads to A. In that case, state 0 is plesiomorphic and A and B inherited it from

the common ancestor of all four terminals. Apomorphic state 1 arose along the branch that leads to the

common ancestor of C and D, and C and D inherited it from that common ancestor. Alternative rootings

will differ in assessment of plesiomorphy and apomorphy, but in all possible scenarios both the observed

point of similarity in state zero and the observed point of similarity in state one can be explained by

common ancestry.

On tree T2, the shared presence of state 0 in A and B and the shared presence of state 1 in C and D cannot

simultaneously be explained by common ancestry. It is possible to explain the shared presence of state 0 in

A and B in that way (with a reconstruction that assigns state 0 to both inner nodes), but then the shared

presence of state 1 cannot be so explained. It is then an instance of homoplasy or unexplained shared

similarity. Alternatively, it is possible to explain the shared presence of state 1 in C and D by common

ancestry (with a reconstruction that assigns state 1 to both inner nodes), but then the shared presence of

state 0 in A and B can no longer be so explained. The same is true for tree T3: either the shared presence of

state 0 in A and B or the shared presence of state 1 in C and D can be explained by common ancestry. But

not both.

For the simple character of dataset 2, there is at most a single unexplained point of shared similarity (none

on T1, one on trees T2 and T3). With more terminals, there can be more such instances within a single

character, and these may be logically interdependent. When minimizing homoplasy or unexplained

similarity - as a means to maximize explained observed similarity - not all instances of pairwise homoplasy

should then be counted, but only instances that are logically independent.

As Farris (2006, p. 826) put it (mainly by citing from his 1983 paper): ‘It is common for homoplasies to be

logically interdependent (Farris, 1983, p. 20): “Suppose that a putative genealogy distributes [the 20

terminals showing feature X] into two distantly related groups A and B of ten terminals each. There are 100

distinct two-taxon comparisons of members of A with members of B, and each of those similarities in X

considered in isolation comprises a homoplasy… [But if] X is identical by descent in any two members of A,

and also in any two members of B, then the A-B similarities are all homoplasies if any one of them is.” But

fortunately it is easy to count mutually independent homoplasies (Farris, 1983, p. 20): “If a genealogy is

consistent with a single origin of a feature, then it can explain all similarities in that feature as identical by

descent. A point of similarity in a feature is then required to be a homoplasy only when the feature is

required to originate more than once on the genealogy. A hypothesis of homoplasy logically independent

of others is thus required precisely when a genealogy requires an additional origin of a feature. The

number of logically independent ad hoc hypotheses of homoplasy in a feature required by a genealogy is

then just one less than the number of times the feature is required to originate independently.’

11

The same is true when directly counting explained points of similarity, in order to maximize them directly

(De Laet 1997, pp. 66-67). Using Farris’ example, if terminals A1, A2, and A3 are all three members of group

A, then there are three observed points of similarity among those three terminals that can be explained by

inheritance and common descent: the similarity between A1 and A2, the similarity between A1 and A3, and

the similarity between A1 and A3. But these are logically interdependent: if any of two out of those three

hold, then the third follows by necessity. When such logical interdependencies are properly taken into

account, the number of explained shared similarities in such a character on a tree varies directly with the

number of steps that are required for that character on that tree: every additional step amounts to one

less independent observed point of similarity that can be explained.

As each independent explained point of similarity involves two terminals, parsimony analysis can be

characterized as two-item analysis (see De Laet 1997, p. 66-67): it identifies the trees that maximize the

number of independent observed pairwise points of similarity that can simultaneously be explained by

inheritance and common descent. Observed pairwise similarities are the atomic units of empirical

comparative content of a dataset, and the units with which to measure the explanatory power of a tree

with optimized characters.

COMPOSITIONAL SEQUENCE HOMOLOGY

Rather than to rely on the relationship with number of steps, independent explained similarities in a

character on a tree can be counted directly.

Doing this for a general (unordered) morphological character, that number is equal to the number of

observations in the character minus one minus the number of steps that the character has on the tree (this

is a straightforward generalization of the binary case, discussed in De Laet 1997, pp. 66-67). For the

character of dataset D2, there are four observations (one in each terminal; missing information would not

be counted as an observation). On a tree that has an AB|CD partition, the character has one step and the

above number equals two (4 – 1 - 1). So there are two independent explained shared observed similarities.

The first one is the single similarity in state 0, the second the single observed similarity in state 1. On a tree

that does not have this partition, two steps are required and the above number amounts to one: either the

similarity in state 0 or the similarity in state 1. The tree with the AB|CD partition is preferred because it can

explain one more independent observed point of similarity.

When applying this point of view to a set of unaligned sequences, things get more complicated because

there are no predefined positions and positional characters to which the above calculation can be applied.

But for any given tree alignment for that set of unaligned sequences, it is possible to count how many

independent points of sequence similarity in base composition can be explained by common descent and

inheritance.

That number is equal to the total number of nucleotides in the observed sequences minus the total

number of subcharacters in the tree alignment minus the total number of base substitutions within those

subcharacters (De Laet 2005, pp. 107-108; see also De Laet 2015, p. 552). I have called this the

compositional component of sequence homology (De Laet 2005, p. 106; see also De Laet 215, p. 551).

In the above expression, a subcharacter of a tree alignment is a region in the tree where a particular

position is applicable. Position 13 of D1T1A1, for example, has two nucleotides: an ‘a’ in terminal C, and an

‘a’ in terminal D. The inner node that connects terminals C and D (inner node 2) also has a nucleotide in

that position. Therefore the two observed nucleotides at that position are in the same subcharacter in this

tree alignment. Within this subcharacter, there are no substitutions. As another example in that same tree

12

alignment, there are four nucleotides in position 18: a ‘c’ in A and C, and a ‘t’ in B and D. The two inner

nodes that connect these terminals also have a nucleotide at that position. So there is a single

subcharacter at position 19 as well. In this one, there are two substitutions.

For an example where a single position has more than one subcharacter, consider position 10 of the

following tree alignment of dataset D1 on tree T1:

Tree alignment D1T1A4

A gggaaaa--------cccggg

B gggaaaaaa------tttggg

C gggaaaaaaaaaa--cccggg

D gggaaaaaaaaaaaatttggg

1 gggaaaaaa------tttggg

2 gggaaaaaa------tttggg

That position has two nucleotides: an ‘a’ in terminal C and another ‘a’ in terminal D. But in this tree

alignment (a suboptimal one, to be sure), the inner node that connects these observed nucleotides (inner

node 2) does not have a nucleotide at that position. Therefore, these two observed nucleotides are not

directly comparable. They are in two different subcharacters or regions of applicability. Trivial regions, for

that matter: each subcharacter in position 10 of this tree alignment has only a single nucleotide.

Within any given subcharacter of a tree alignment, the number of independent explained similarities in

base composition can be obtained by applying the formula for a regular unordered character (number of

observations minus one minus steps), but restricted to the terminals that participate in the subcharacter:

the number of nucleotides within the subcharacter minus one minus the number of transformations within

the subcharacter (see below for some examples). When this is summed over all subcharacters of a tree

alignment, the above grand total for the complete tree alignment is obtained: the total number of

nucleotides in the observed sequences minus the total number of subcharacters in the tree alignment

minus the number of substitutions within subcharacters.

Dataset D1, for example, has 68 observed nucleotides (the sum of the lengths of the observed sequences).

Tree alignment D1T1A1 has 21 positions, and in each of these positions the observed nucleotides are in a

single region of applicability. So D1T1A1 has 21 subcharacters. Within the subcharacters, there are six

substitutions: two in the single subcharacter at position 16, two in the single subcharacter at position 17,

and two in the single subcharacter at position 18. So the total measure of compositional homology equals

68 – 21 – 6 = 41.

Most of these 41 explained points of similarity in base composition are in the stretches of nucleotide ‘a’

and the stretches of nucleotide ‘g’ in the observed sequences. Take for example the first position: a single

subcharacter that comprises all four terminals. In this subcharacter, all terminals in it have an observed ‘g’.

This amounts to three independent instances of compositional homology (if, for example, the ‘g’ in A and B

is homologous, the ‘g’ in A and C is homologous, and the ‘g’ in C and D is homologous, then by necessity all

other pairwise homologies follow). This number (3) can be obtained as the number of observed

nucleotides in the subcharacter minus one minus the number of substitutions or steps within the

subcharacter: 4 – 1 – 0 = 3.

Or position 13 of D1T1A1, with a single subcharacter that comprises just C and D. Both terminals have an

observed ‘a’ there, which amounts to a single instance of compositional homology. Using the above

formula, this is obtained as 2 observed nucleotides minus 1 minus 0 substitutions.

13

Summing over all subcharacters in which no substitutions occur (all subcharacters except those at positions

16, 17 and 18), 38 such independent instances of compositional homology can be counted.

The other three are in the subcharacters at positions 16, 17, and 18, the only subcharacters with

substitutions in this tree alignment. Given that both inner nodes have a reconstructed ‘t’ for each of these

three subcharacters, the observed shared similarity in each of the three nucleotides ‘t’ near the end of the

observed sequences of B and D is homologous: their presence can be explained by common ancestry. The

similarity of three nucleotodes ‘c’ near the end of the observed sequences of A and C, on the other hand,

cannot be explained by common ancestry. In total, this amounts to three independent explained points of

similarity in these subcharacters: one in position 16, one position 17, and one in position 18. (Applying the

formula for any of these three subcharacters: 4 observed nucleotides minus 1 minus 2 substitutions).

This result can be contrasted with tree alignment D1T2A1 for that same dataset. Just as D1T1A1, tree

alignment D1T2A2 has 21 positions and exactly one subcharacter at every position. But in total, there are

only three substitutions: one in the subcharacter at position 16, one in the subcharacter at postion 17, and

one in the subcharacter at postion 18. This amounts to the following grand total of explained similarity in

base composition: 68 – 21 – 3 = 44.

The difference with D1T1A1 is 3 (44 - 41). Given the similarity between tree alignments D1T1A1 and

D1T2A1, it is easy to verify this statement: all points of similarity in base composition that DT1A1 can

explain, can also be explained by D1T2A1. The difference is that D1T2A1 can explain three more

similarities. These three additional explained similarities reside in the subcharacters at positions 16, 17,

and 18.

SUBSEQUENCE HOMOLOGY

Sequence homology in a tree alignment is not captured completely by just looking at nucleotide level

homology within positions. This is illustrated using dataset D3:

Dataset D3

A aaaaaa

B aaaaaa

C aaagggaaa

D aaagggaaa

Consider these two tree alignments:

Tree alignment D3T1A1

A aaa---aaa

B aaa---aaa

C aaagggaaa

D aaagggaaa

1 aaa---aaa

2 aaagggaaa

Tree alignment D3T2A1

A aaa---aaa

B aaa---aaa

C aaagggaaa

D aaagggaaa

3 aaagggaaa

4 aaagggaaa

14

In D3T1A1, the shared presence of subsequence ‘ggg’ in the middle of the observed sequences of C and D

can be explained by common ancestry. As can its absence in A and B. In D3T2A1, that shared presence can

still be explained by common ancestry, but its absence in A and D can no longer be so explained. The

difference in explanatory power (one less explained point of similarity in D3T2A1) is exactly matched by

the difference in number of indel events that both tree alignments require: one indel event, of

subsequence ‘ggg’, along the inner branch of tree T1 for D3T1A1; two such events in tree T2 for

D3T2A1(one along the branch leading to A, the other along the branch leading to B).

In this simple case, the subsequence involved is identical in the two terminals where it is observed. But

that does not need to be the case. Consider dataset D4:

Dataset D4

A aaaaaa

B aaaaaa

C aaagggaaa

D aaagtgaaa

D3 and D4 are identical except for the middle position of the observed sequence of D: in D3 it is a ‘g’, in D4

a ‘t’.

Consider these two tree alignments for D4:

Tree alignment D4T1A1

A aaa---aaa

B aaa---aaa

C aaagggaaa

D aaagtgaaa

1 aaa---aaa

2 aaagggaaa

Tree alignment D4T2A1

A aaa---aaa

B aaa---aaa

C aaagggaaa

D aaagtgaaa

3 aaagggaaa

4 aaagggaaa

Even if the three middle positions in the sequences of C and D are no longer identical, the shared presence

of a subsequence at those positions can still be explained by common ancestry. I have called this

subsequence homology, a component of sequence homology that cannot be reduced to homology of

observed nucleotides within orthologous subsequences or subcharacters (De Laet, 2005, p. 106; see also

De Laet 2015: 551-552).

This can be illustrated by rooting tree T1 for example along the branch that leads to D (which involves the

assumption that D is a proper outgroup for this set of terminals). In that case, the absence in A and B of the

middle subsequence that is observed in C and D was inherited from their common ancestor, and it

provides a synapomorphy for A and B. One, moreover, that cannot be expressed in terms of composition

of an observed subsequence in those terminals.

In more complicated datasets and tree alignments, the subsequences that are involved in indel events

along different branches may not fully coincide. There are partially overlapping indels along different

15

branches in such cases. But even then the number of indel events can be used to compare this kind of

explained similarity between any two different tree alignments: whenever the first tree alignment has an

indel event that is not present in a second one, the first tree alignment has an unexplained similarity across

the branch with that indel, compared to the second one: depending on the root, either a subsequence that

was present got lost, or a subsequence that was not present before was gained. The same holds the other

way around. The net balance of indels between the two tree alignments is then a proper measure with

which their subsequence homology can be compared.

ADDING IT UP

So compositional homology in a tree alignment is directly measured by the number of nucleotides in the

observed sequences minus the number of subcharacters in the tree alignment minus the number of

substitutions within subcharacters. And the number of indel events of subsequences provides a measure

with which the amount of subsequence homology can be compared.

Adding it up, when comparing two tree alignments for a set of observed sequences, the tree alignment

with the higher amount of total (equally weighted) sequence similarity that can be explained as homology

is the one with the higher value for this aggregate number: the number of nucleotides in the observed

sequences minus the number of subcharacters in the tree alignment minus the number of substitutions

within subcharacters minus the number of indel events. This holds in general and does not depend on the

aligned length.

The aggregate number (and hence explained shared sequence similarity) is maximal for the tree(s) and tree

alignment(s) for which the sum of subcharacters, substitutions, and indel events is minimal.

POY

In POY, the substitution cost can be used to minimize substitutions. And indel events that span multiple

nucleotides can in principle be minimized by using a positive gap opening cost and a zero gap extension

cost. But POY does not provide a cost parameter to minimize subcharacters. So it may look like POY cannot

be used to maximize sequence similarity that can be explained as homology. But there is a practical work-

around.

First some background on tree alignment algorithms. Exact algorithms to find an optimal tree alignment on

a given tree are so computationally complex that heuristic approximations are unavoidable in practice. This

is where POY’s algorithms such as DO (direct optimization) and iterative-pass optimization come in. These

algorithms never consider more than three sequences at a time when reconstructing a sequence at an

inner node.

Consider for example iterative-pass optimization, an approach due to Sankoff et al. (1973). Using tree T1 as

an example, this is a general description. The algorithm starts out with the calculation of an initial

reconstructed sequence for the two inner nodes. Next, the algorithm enters an iterative process in which it

tries to improve these initial reconstructions by repeatedly revisiting the inner nodes. Assume that it first

revisits inner node 1. This inner node has three incident branches: the branches leading to observed

sequences A and B and the branch leading to inner node 2. Using observed sequences A and B and the

reconstructed sequence at inner node 2, a new reconstructed sequence for node 1 is calculated (the so-

called median sequence). Next it will consider node 2 and recalculate the reconstructed there using the

sequences at nodes C, D, and the new reconstructed sequence at inner node 1. This is repeated until no

more improvements can be made. That is, until the cost stabilizes.

16

Computational complexity is such that it is doable to calculate an exact median sequence during any revisit

of an inner node. Exactness here means that that median sequence is guaranteed to be optimal. Given the

input, that is: the sequences at the ends of the three incident branches. In terms of optimizing the cost,

this algorithm definitely performs better than simple DO (at the expense of longer execution time). But still

it never considers interactions across longer distances than subtrees that consist of one inner node and its

three neighbouring nodes. As a result, exact optimality can only be guaranteed for datasets up to three

terminals. Beyond that, it provides a heuristic approximation. This is a general conclusion and does not

depend on the cost parameters being used.

Back to maximizing explained similarity. It can be shown that the 3221 cost regime measures, up to a

constant, (twice) the number of subcharacters, substitutions, and indel events for (sub)trees that consist of

one inner node and three neighbouring nodes (De Laet 2005, p. 109; see also de Laet 2015, p. 562-563).

This means that explained sequence similarity is guaranteed to be optimal when using that cost set on such

(sub)trees. For larger trees or subtrees, that guarantee of optimality no longer holds. But POY’s algorithms

don’t guarantee optimality for such larger trees or subtrees to start with. So, in practice, 3221 is the best

possible heuristic approximation for maximization of explanatory power using the algorithms that are

available in POY.

A discussion of an analysis of dataset D1 using POY might be useful at this point, but I reckon that this post

is already way too long. So I’ll keep that for some other time. In the meantime I hope that this post may be

useful as a clarification of my position on these issues.

Best

-- Jan

REFERENCES

De Laet, J., 1997. A reconsideration of three-item analysis, the use of implied weights in cladistics, and a practical application in

Gentianaceae. Dissertation. Available at www.anagallis.be or in ResearchGate.

De Laet, J. 2005. Parsimony and the problem of inapplicables in sequence data. Pp. 81-116 in Albert, V. A. (ed.), Parsimony,

phylogeny and genomics. Oxford University Press.

De Laet, J., 2015, Parsimony analysis of unaligned sequence data: maximization of homology and minimization of homoplasy,

not minimization of operationally defined total cost or minimization of equally weighted transformations. Cladistics, 31: 550–

567. doi: 10.1111/cla.12098.

Farris, J. S., 1983. The logical basis of phylogenetic analysis. Pp. 7-36 in: Platnick, N. I., Funk, V. A. (Eds.), Advances in Cladistics II.

Columbia University Press, New York.

Farris, J. S., 2008. Parsimony and explanatory power. Cladistics, 24: 825–847.

Sankoff, D. 1975. Minimal mutation trees of sequences. SIAM J. Appl. Math. 28: 3542.

Sankoff, D., and Cedergren, R. J. 1983. Simultaneous comparison of three or more sequences related by a tree. Pp. 253-263 in

Sankoff, D., and Kruskal, J. (eds.), Time warps, string edits, and macromolecules. The theory and practice of sequence

comparison.

CSLI Publications, Stanford, California (1999 reprint).

Sankoff, D., Morel, C., and Cedergren, R. J. 1973. Evolution of 5S RNA and the nonrandomness of base replacement. Nature

(New Biology) 245: 232-234.

17

Contribution 6: Santiago – Friday 06/05/2016

Dear Jan,

I am sorry it has taken me so long to answer. Thanks to your detailed correspondence, I was able to better

understand your ideas. Your papers in Cladistics are quite technical and I have to admit that I was having

difficulties to follow some of your arguments. I am also what you can call a “slow thinker” and like to have

my time to reflect on things. In February, I was doing field work in the mountain rainforests of Peru. For

almost a month, I was hiking every day looking for amphibians and squamates, living in a tent, and cooking

by the camp fire. It was a very nice setting and scenario to think calmly about your ideas. In any case, here

are my two cents on the topic.

In a nutshell, your concern arises from the fact that the assumption “indels never involve more than one

nucleotide at a time” is (i) an unrealistic biological assumption and (ii) logically irreconcilable with the anti-

superfluity principle used by Kluge and Grant.

Below, I will expand on these two points and how I think they should be considered in the light of character

independence.

An unrealistic biological assumption

I guess that nobody with some basic notions on biology will deny that “indels never involve more than one

nucleotide at a time” is false. However, you do not mention that this false assumption actually implies

other equally important, with regards to their implications to phylogenetics, statements such as:

• Indels do not always involve more than one nucleotide at a time

• Some indels, even those affecting more than one nucleotide, are the result of transformations of one

nucleotide at a time

• Some indels affecting more than one nucleotide are the result of a combination of transformations of

one nucleotide at a time and several nucleotides at a time

Following the same logic, we can also consider false the assumptions “mutations never involve more than

one nucleotide at a time” and “transformations of morphological characters never involve more than one

character at a time”, and the same sort of statements outlined above also follows from these false

assumptions.

You criticized the arguments of Kluge and Grant for considering “indels never involve more than one

nucleotide at a time” but at the same time the same critic can be applied to your proposal because it

assumes that “indels always involve more than one nucleotide at a time”, which arguably is also

biologically unrealistic. I see both statements as extremes of a continuum, because, as explained above,

there are all sorts of situations in between. In other words, the number of possible permutations between

single character events (e.g., indels, mutations, transformation of phenotypic character) and single events

affecting multiple characters (e.g., indels affecting more than one nucleotide at a time, mutations affecting

more than one nucleotide at a time, transformation of more than one phenotypic character at a time)

within a real dataset is absurdly large! I guess that the only biologically plausible assumption in this context

is what my host at AMNH, Darrel Frost, used to say about evolution “shit happens”, meaning that not only

18

many of those possibilities are evolutionary plausible but that they may have happened, at least once,

during the ~3500 million years on life on Earth.

Logically irreconcilable with the anti-superfluity principle

You said that if properly applied, the anti-superfluity principle used by Kluge and Grant to justify their view

favors your position that “indels always involve more than one nucleotide at a time” because the

explanation with less transformation events should always be favored. OK, I see the beauty of that and also

that applying the same antisuperfluity principle naively takes us to the trivial solution that every difference

(not only indels) is most parsimoniously explained by a single transformation event affecting all the

characters. This would be similar, although a more general case because it would also account for

mutations, to the example you provide in page 9 of our exchange.

What I call a naïve application of the anti-superfluity principle is the idea of applying this principle in a

vacuum, without considering anything else. Phylogenetic analyses are performed within a context of

auxiliary assumptions or principles, for example we consider that evolution is hierarchical and that

characters are independent. The second consideration is most crucial to the anti-superfluity principle.

When not considered, it leads to trivial solutions. I never read any contribution by Grant and/or Kluge

defending this naïve use of the anti-superfluity principle. To the contrary, Grant & Kluge (2004) Cladistics,

20, 23-31 devote page 26 to the importance of character independence.

To avoid this type of trivial solutions or naïve use of the anti-superfluity principle (i.e., where a single

transformation is invoked to explain all differences between characters) you suggest embracing

explanation of similarities computed as: (number of nucleotides in the observed sequences) – (number of

subcharacters in the tree alignment) – (number of substitutions within subcharacters) – (number of indel

events). However, you count indel events applying the assumption “indels always involve more than one

nucleotide at a time” by defining subsequence homology so, if I understand it correctly, your approach

does not circumvent the naïve use of the anti-superfluity principle at least when applied to indels

expanding more than one nucleotide. This also leaves the unanswered question, why one should not count

mutations the same way you are counting indels? This is, for a subsequence of nucleotides, the number of

nucleotides that are involved in mutations is irrelevant because it can be explained by a single event

accounting for all observed mutations so the cost = 1. For example:

A aaa---

A B C

B atcttt

C aaattt

In tree (A, (B, C) I can define the subsequence “x” (in red) involving the first three nucleotides and

postulate a single mutation event expanding two nucleotides (second and third positions of the alignment)

as an autapomorphy on the branch of terminal B so the cost = 1, the same way that the subsequence

involving the last three positions of the alignment “y” are explained by a single indel event expanding three

nucleotides and having a cost = 1. As I explained above any combination of subsequences is biologically

plausible, such as inversions or transpositions of series of nucleotides and loops of ribosomal genes.

However, you defend that sets of nucleotides such as those of subsequence “x” should be accounted at the

atomistic level (counting every column and transformation independently), while those including indels

(e.g., subsequence “y”) should be interpreted as single events.

19

In summary, your approach is inconsistent. For certain sets of data (indels) you apply a naïve anti-

superfluity principle excluding character independence to calculate similarities, while for other types

(nucleotides) you apply a sophisticated version of anti-superfluity principle considering character

independence. I do not see how your approach is logically more consistent with the principle of

explanation of similarities than that of Grant & Kluge with the principle of anti-superfluity. At worst, to me

they seem to be equally inconsistent, perhaps the difference is that Grant & Kluge are not thinking about

anti-superfluity in the absence of character independence.

Character Independence

One of the core assumptions of phylogenetics is that characters should be independent. We suspect, and

sometimes even know, that this assumption is often violated (as with the case of the assumption of

hierarchical evolution when species originate by hybridization). Clear examples of non-independent

characters are a phenotypic character and the gene that codes for it, several phenotypic characters that

evolve as a module, nucleotides that mutate simultaneously such as in ribosomal genes, and a single indel

affecting more than one nucleotide but coded as several independent indels. The issue of independence is

more complicated than this, but suffices to leave it at this stage to argue my case.

My point is that in most cases it becomes empirically impossible to test the independence of certain sets of

characters. DNA sequences are a good example. Let’s say we have these two sequences:

ATGCCA

GTACAC

Shall we explain the differences between the first and third position of the alignment as two independent

point mutations or as a single event inversion? The same applies to positions 5 and 6 of the alignment. This

is just a simple example. It can be much more complicated but I guess it is easy to grasp that the two

options are biologically possible and non-testable (at least in this context).

What stops us from explaining everything as a single transformation event is the assumption of character

independence. Of course, if we know that two or more characters are not independent (e.g., a gene and

the phenotype coded by the gene) we should exclude one of the characters. In the absence of evidence,

we are better with keeping our assumption; otherwise we open Pandora’s box of subjectivity to justify why

some types of character sets should be explained by a single event, while others should be considered as

several point transformations. For example, consider these sequences:

ATG- - -CCA

GTATTTCAC

What’s the scientific reason behind explaining the three indels as a single event expanding three

nucleotides but explaining the mutations as single independent events? I see none. So once we let

subjectivity in to sweep away character independence, what stop us from asking: Why not to explain all

the differences observed between the two sequences as a single event where the polymerase screwed up

and all the mutations and indels happen simultaneously?

20

Explanation and Similarity

Grant & Kluge (2004, 2009), Kluge (2005, 2007), and Kluge & Grant (2006) provide a detailed discussion of

why similarity is at best described but not explained, using an object criterion of similarity is logically

incompatible with evolutionary theory (classes cannot change/evolve), homoplasy cannot be a hypothesis

(ad hoc or not) because it explains nothing, transformations (the events) are the things to be explained

(the evidence) and not the objects. I have little to add to their careful discussions, except that in my

opinion there is always a gap between the conceptual and the operational but the fact that our operations

are doomed to fail in some cases should not stop us from investigating. Operationally we work with and

code objects as delimiters of events, but doesn’t mean that what we are trying to explain is the similarity

(or not) of the objects. A very good discussion on the distinction between the operational and the

conceptual is that of Frost & Kluge (1994) Cladistics, 10, 259-294 using species concepts and species

discovery operations. I see your focus on explaining similarities as an attempt to develop the concepts

from the methods, while in my opinion it should be the other way around.