Genome B Bi io ol lo og gy y 2008, 9 9: :235
L La ar rg ge e- -s sc ca al le e a as ss si ig gn nm me en nt t o of f o or rt th ho ol lo og gy y: : b ba ac ck k t to o p ph hy yl lo og ge en ne et ti ic cs s? ?
Bioinformatics and Genomics Program, Center for Genomic Regulation, Doctor Aiguader, 88, 08003 Barcelona, Spain.
A Ab bs st tr ra ac ct t
Reliable orthology prediction is central to comparative genomics. Although orthology is defined
by phylogenetic criteria, most automated prediction methods are based on pairwise sequence
comparisons. Recently, automated phylogeny-based orthology prediction has emerged as a
feasible alternative for genome-wide studies.
Published: 30 October 2008
Genome B Bi io ol lo og gy y 2008, 9 9: :235 (doi:10.1186/gb-2008-9-10-235)
The electronic version of this article is the complete one and can be
found online at http://genomebiology.com/2008/9/10/235
© 2008 BioMed Central Ltd
Homologous sequences - that is, those derived from a common
ancestral sequence - can be further divided into two different
classes according to the mode in which they diverged from
their last common ancestor . The divergence of two
homologous sequences by a speciation event gives rise to
orthologous sequences, whereas a duplication event will
define a paralogous relationship between the duplicates.
Although such straightforward definitions could suggest that
distinguishing paralogs and orthologs is simple, it is
definitely not. For example, it is not unusual for multiple
lineage-specific gene loss or duplication events, as well as
other evolutionary processes, to result in intricate scenarios
that are difficult to interpret. Far from being a simple
curiosity, the establishment of correct orthology and
paralogy relationships is crucial in many biological studies.
For instance, phylogenetic analyses that aim to infer correct
evolutionary relationships between several species should be
based on orthologous sets of sequences . Moreover, as
orthologs are, relative to paralogs, more likely to share a
common function, the correct determination of orthology
has deep implications for the transfer of functional informa-
tion across organisms . Finally, the establishment of
equivalences among genes in different genomes is a pre-
requisite for comparative analyses of genome-wide data to
detect evolutionarily conserved traits [4,5].
Originally defined on an evolutionary basis, orthology
relationships are best established through phylogenetic
analysis. This usually involves the reconstruction of a phylo-
genetic tree describing the evolutionary relationships among
the sequences and species involved, so that speciation and
duplication events can then be mapped on the nodes of the
tree. This is the classical procedure for establishing
orthology relationships. However, the availability of whole
sequenced genomes means the need to detect orthology at a
genomic scale, a task for which the, mostly manual,
phylogeny-based approach is not suited. Automated
approaches were soon developed that inferred orthology
relationships from pairwise sequence comparisons. Although
these methods perform reasonably well, they have many
drawbacks that can lead to annotation errors or misinter-
pretation of data [6,7]. To avoid such pitfalls, and in an
attempt to approximate the classical approach for detecting
orthology, several automatic methods have been proposed
that delineate orthology relationships from phylogenetic
trees. Despite the greater accuracy of such methods com-
pared with pairwise approaches, the large demands of time
and computing power needed to generate reliable trees have
limited their use to datasets of moderate size. Recently,
however, the combination of automated large-scale phylo-
genetic reconstruction with newer algorithms is paving the
way for the use of phylogeny-based methods for orthology
detection at genomic scales [8,9]. This progress is likely to
have a deep impact on future comparative studies.
H Ho om mo ol lo og gy y, , o or rt th ho ol lo og gy y a an nd d p pa ar ra al lo og gy y
Homology is defined as the relationship that exists between
two biological entities - for example, two sequences or two
anatomic characters - that are derived from a common
ancestor. In 1970, Walter Fitch coined the concepts of
orthology and paralogy to distinguish two types of homology
relationships between biological sequences . Orthologous
sequences are those that derive by a speciation event from
their common ancestor, whereas the origin of paralogous
sequences can be traced back to a gene-duplication event.
Despite this clear definition, orthology and paralogy are
often misinterpreted by biologists. This is partly due to the
fact that what may seem simple when comparing pairs of
closely related species, easily gets complicated when wider
groups of distantly related species are involved. It is some-
times wrongly claimed, for example, that only two sequences
from the same species can be regarded as paralogs, or that
two sequences from different species are orthologous to each
other only if they perform the same biological function. I will
briefly summarize here the main misunderstandings that
can arise when dealing with properties of orthologous
sequences (see  for a more thorough discussion), which
are key to understanding why some of the methods
discussed later would be more appropriate than others.
The first clarification is that orthology is a purely evolution-
ary concept, certainly related to, but not based on, the func-
tionality of the sequences involved. All homologous proteins
have a common ancestry and thus are expected to have
similar three-dimensional structures and to perform related
functions. But changes in functionality within a homologous
family of proteins caused by sequence variation or context-
dependency are not rare . This is especially true in the
case of paralogs, because processes of neo- or subfunctionali-
zation may favor the retention of duplicate genes . Ortholo-
gous sequences derived by speciation are, therefore, less prone
to functional shifts but are definitely not free from them.
A second important point to note is that the orthology or
paralogy relationship between two genes will extend to their
descendants as they disperse by further speciation or
duplication events. Thus, groups of orthologs, and not just
pairs, may more adequately represent the ancestral relation-
ships of the genes in a set of organisms. An important
corollary of its definition is that orthology, in contrast to
homology, is not transitive. If a gene A is orthologous to B
and B to C, A and C are not necessarily orthologous to each
other. For instance if A and C are related by a duplication
event, they will be paralogous to each other while both being
co-orthologous to B. This is best explained with a graphical
example (Figure 1). The human tumor suppressor protein
p53 belongs to a wider family of proteins that also includes
p73 and p73L. The tree shown in Figure 1 depicts the
evolutionary relationships among several metazoan members
of the family, ranging from insects to mammals. As can be
inferred from the tree, several duplications (nodes marked
with gray circles) occurred at different periods. Most signifi-
cantly, two consecutive duplications at the base of the verte-
brates originated three sister groups (shadowed regions in
the tree) that correspond to the p53, p73 and p73L sub-
families. Human p53 can be considered orthologous to the
sequences in other vertebrates that cluster within the same
shadowed region, because they all derive by speciation
events. Paralogous relationships can be drawn between human
p53 and human p73 and p73L, because their common ances-
tral node always corresponds to a duplication node. The
same reasoning can be used to infer paralogous relationships
between any sequence within the p53 subfamily and those in
the p73 and p73L subfamilies, even though they might not
be encoded in the same genome, such as human p53 and
mouse p73L. The only criteria to mark them as paralogs is
the fact that they derived by the duplication of an ancestral
gene. Human p53 is also orthologous to any of the two Ciona
intestinalis sequences, because they diverged from a
speciation node (marked with an arrow). Note that this is the
only node that is important in defining their orthology
relationship, and we do not consider the fact that, subse-
quent to that speciation, both lineages experienced duplica-
tion events. These later duplication events are, however,
important to define other proteins at the same orthology
level. In fact, human p53, p73 and p73L all are orthologous
to any of the sequences in C. intestinalis because they
diverged at the same speciation node. To accurately define
the orthology relationships between human and C. intestinalis
members of this family one should say that human p53,
p73L and p73 are all co-orthologous to the two C. intestinalis
Yet another complication in defining orthology relationships
among proteins is that they often comprise distinct domains
that may have followed different evolutionary histories .
Such evolutionary chimeras can be created by fusion and
recombination events between different genes and may lead
to situations in which, for example, a single member of a
given protein family has recently acquired a new domain
through recombination with another family. In such cases
the different domains should in principle be treated as
independent evolutionary units and orthology relationships
be delineated accordingly. Thus, in multidomain families,
orthology relationships should be first established among
core domains and then extended, where possible, to adjacent
P Pa ai ir rw wi is se e m me et th ho od ds s f fo or r o or rt th ho ol lo og gy y i in nf fe er re en nc ce e
The need to compare sets of genomic sequences has prompted
the development of several automatic methods that infer
orthology relationships from pairwise sequence comparisons.
The first, and still most widely used, method for auto-
matically establishing orthology relationships is based on
the detection of best bi-directional best hits (BBH), also
known as best reciprocal hits (BRH), which consists of the
detection of pairs of sequences from different species that
are, reciprocally, the best hit of each other in a sequence search
 (Figure 2a). This operational definition of orthology is
fairly adequate when comparing two closely related genomes.
At larger evolutionary distances, however, the scenario
becomes more complicated. By definition, the BBH approach
http://genomebiology.com/2008/9/10/235Genome B Bi io ol lo og gy y 2008,Volume 9, Issue 10, Article 235Gabaldón 235.2
Genome B Bi io ol lo og gy y 2008, 9 9: :235
can only account for one-to-one orthology relationships.
Therefore, if gene duplications have taken place in any of the
two compared lineages after their divergence, a one-to-many
or a many-to-many relationship will be necessary to properly
describe their orthology relationships. In such cases the
BBH approach will miss many true orthologs.
To avoid these pitfalls and extend the procedure to multiple
genome comparisons, Tatusov and colleagues introduced the
concept of clusters of orthologous groups (COGs) 
(Figure 2b). COGs are derived from the search for
‘triangular’ BBH relationships across a minimum of three
species, and their subsequent combination into larger
groups. This strategy has been followed by many groups and
is the operational definition of orthology used by many
databases such as EGO  and STRING .
Other extensions of the BBH approach include recent
implementations such as Inparanoid  (Figure 2c) or
OrthoMCL , which achieve higher sensitivity through
sequence-clustering techniques that consider a range of BLAST
scores beyond the absolute best hits. For instance, Inparanoid
predicts paralogs resulting from lineage-specific duplications,
which it calls ‘in-paralogs’, by including intraspecific BLAST
hits that are reciprocally better than between-species BLAST
hits. So, to a certain level, Inparanoid is able to include one-to-
many and many-to-many relationships. Its limitation is that it
is designed for comparing pairs of genomes only. OrthoMCL
expands the procedure to comparisons of multiple genomes. It
first uses a similar strategy to Inparanoid to define orthologous
relationships between each pair of genomes. The comparisons
of all possible pairs of genomes are represented as a graph in
which the nodes represent genes and the edges represent
http://genomebiology.com/2008/9/10/235Genome B Bi io ol lo og gy y 2008, Volume 9, Issue 10, Article 235Gabaldón 235.3
Genome B Bi io ol lo og gy y 2008, 9 9: :235
F Fi ig gu ur re e 1 1
p53 phylogeny. Phylogenetic tree representing the evolutionary relationships among p53 and related proteins. Sequences were obtained from the p53
tree at phylomeDB  (entry code Hsa0012331). After selecting a group of representative sequences, a maximum likelihood tree was reconstructed
using the same parameters used for the JTT tree in PhylomeDB. Shaded boxes indicate vertebrate members of the p53, p73 and p73L subfamilies.
Duplication nodes are marked with a gray circle. The arrow indicates the speciation node that marks the bifurcation between urochordates and
orthology relationships. A Markov clustering algorithm (MCL)
is then applied. In brief, OrthoMCL simulates random walks on
the graph of orthology predictions to determine the transition
probabilities among the nodes, that is, the probabilities that
two nodes are connected in a random walk. The graph is parti-
tioned into different orthologous groups on the basis of these
Yet another type of method that cannot be strictly
considered pairwise-based but that does not specifically
build phylogenetic trees to define orthology, aims to refine
previously made COGs. Generally, these methods organize
clusters of orthologous genes into a hierarchical structure by
using some evolutionary information. For instance, COCO-
CL subdivides a given orthologous group on the basis of the
correlation coefficient between their sequences, as inferred
from a multiple sequence alignment . In contrast,
OrthoDB uses the information regarding the species to
which a given sequence belongs, to organize an orthologous
group in a hierarchy that is guided by the species tree .
http://genomebiology.com/2008/9/10/235Genome B Bi io ol lo og gy y 2008,Volume 9, Issue 10, Article 235Gabaldón 235.4
Genome B Bi io ol lo og gy y 2008, 9 9: :235
F Fi ig gu ur re e 2 2
Orthology prediction methods. ( (a a- -c c) ) Pairwise-based and ( (d d, ,e e) ) phylogeny-based methods. Circles of different colors indicate proteins encoded in genomes
from different species. Black arrows represent reciprocal BLAST hits. Proteins within dashed ovals are predicted by the method to belong to the same
orthologous group. (a) Best bi-directional hit (BBH). All pairs of proteins with reciprocal best hits are considered orthologs. Note that this method is
unable to predict the othology with the yellow protein 2. (b) COG-like approach. Proteins in the nodes of triangular networks of BBHs are considered
as orthologs (for example, green, red and yellow protein 1 in the example). New proteins are added to the orthologous group if they are present in BBH
triangles that share an edge with a given cluster; for example, the gray protein will be added to the orthologous group because it forms a BBH triangle
with the red and green proteins. Note that a BBH link with yellow protein 1 is not required. The COG-like approach can add additional proteins from
the same genome if they are more similar to each other than to proteins in other genomes, or if they form BBH triangles with members of the cluster.
This is not the case for yellow protein 2, which is, again, misclassified. (c) Inparanoid approach. This is similar to (a), but other proteins within a
proteome (yellow protein 2 in this example) are included as ‘in-paralogs’ if they are more similar to each other than to their corresponding hits in the
other species. (d) Tree-reconciliation phylogenetic approach. Duplication nodes (marked with a D) are defined by comparing the gene tree (small tree at
the top) with the species tree (small tree at the bottom) to derive a reconciled tree (big tree on the right) in which the minimal number of duplication
and gene loss (dashed lines) events necessary to explain the gene tree are included. In this case, both the yellow proteins are included in the orthologous
group but the red and gray proteins are excluded. (e) Species-overlap phylogenetic approach. All proteins that derive from a common ancestor by
speciation are considered members of the same orthologous group. Duplication nodes are detected when they define partitions with at least one shared
species. A one-to-many orthology relationship emerges because of a recent duplication in the lineage leading to the yellow proteome.
P Ph hy yl lo og ge en ny y- -b ba as se ed d o or rt th ho ol lo og gy y i in nf fe er re en nc ce e i in n t tr re ee e
r re ec co on nc ci il li ia at ti io on n
In the classical procedure for determining orthology
relationships a phylogenetic tree is constructed from an
alignment of homologous sequences and subsequently com-
pared to a species tree. This comparison allows the geneticist
to infer the events of gene loss and duplication that have
occurred along the evolution of the sequence family
considered. The first strategy for inferring such relationships
automatically was proposed by Goodman and colleagues
, who developed an algorithm for fitting a given gene tree
to its corresponding species tree and inferring the minimum
set of duplications needed to explain the data. This problem
came to be known as ‘tree reconciliation’ (Figure 2d), and
several other algorithms have been implemented that solve
it efficiently [22-24]. These tree-based algorithms for
orthology detection are very intuitive, as they simply imple-
ment automatically what an expert would do manually and,
provided that correct species and gene trees are given, the
algorithm will infer the correct orthology relationships. A
number of databases have been developed that use such
algorithms to derive orthology relationships from auto-
matically reconstructed trees [25-27].
The main limitation of the tree-reconciliation method is that
for many scenarios the species tree is not known with
confidence. Moreover, it has been shown that another
assumption of the tree-reconciliation problem, the correct-
ness of the gene tree, is frequently violated . In such
cases, erroneous gene trees will inevitably led to incorrect
orthology and paralogy assignments and the inference of
many extraneous duplications and gene losses. As a result,
these methods are very sensitive to slight variations in the
topology or the rooting of the gene tree and, when applied at
a large scale they perform similarly to and even worse than
standard pairwise methods  and need manual curation
. Even if the gene tree is correctly reconstructed, it may
not conform to the species tree in cases where horizontal
gene transfer events have occurred. Such gene trees are hard
to reconcile with the species tree and are often confused by
apparent events of massive gene loss.
One possible solution to cope with the existing ambiguity in
gene and species trees is to account for this uncertainty
during the process of tree reconciliation. Some approaches
consider the uncertainty of the different nodes of the gene
tree as inferred from their bootstrap, or equivalent, values,
and weight the gene loss and duplication events accordingly
[31,32]. Another approach that tackles the uncertainty of
both the gene and the species tree was recently proposed by
the group of David Liberles . This algorithm, called ‘soft
parsimony’, modifies uncertain or poorly supported branches
by minimizing the number of gene duplication and loss
events implied by the tree. It starts by generating all possible
rooted trees that can be derived from a given gene tree. Then
the edges that have a support value under a given threshold
are collapsed. Each tree is subsequently reconciled with the
species tree, which can include multifurcations at unresolved
nodes, and the number of duplications is computed. If more
than one tree minimizes the necessary duplications, these
are compared in terms of the number of gene losses implied.
Finally, the collapsed nodes are reconstituted.
Soft parsimony is able to solve the most obvious errors
arising from tree reconciliation, which normally implies a
multitude of gene losses and duplications. It also allows the
use of species trees with unresolved nodes, which usually
better represent what we really know about relationships
within most phylogenetic groups. Nevertheless, these algor-
ithms still need a certain level of resolution in the species
trees and have a number of underlying assumptions that
should be taken into account. For instance, the scenario with
the minimal number of losses and gene duplications is not
necessarily the real one, as losses and duplications can be
rampant in some cases . Furthermore, the number of
iterations and tree-reconciliation steps that these methods
involve may limit its use in large-scale datasets.
S Sp pe ec ci ie es s- -o ov ve er rl la ap p m me et th ho od ds s
Yet another way out of the problem of ambiguity in species
and gene trees is to consider the gene tree topology in a very
relaxed way and minimize the need to know the true evolu-
tionary relationships of species. This approach is followed in
recent algorithms that are based on the level of overlap
between the species encountered within a tree. Basically,
these algorithms examine the level of overlap in the species
connected to two related nodes to decide whether their
parental node represents a duplication or speciation event
(Figure 2e). They assume that a node represents a duplication
event if it is ancestral to two tree-partitions that contain sets
of species that overlap to some degree. Conversely, if the two
partitions contain sets of species that are mutually exclusive,
the node is considered to represent a speciation event. The
only evolutionary information that such algorithms require is
that needed to root the tree so that a polarity (ancestors to
descendants) between the internal nodes is defined.
One such algorithm has been used in the prediction of all
orthology and paralogy relationships for all human genes
and their homologs in 38 other eukaryotic species . The
reason for using this type of algorithm was its speed and the
high degree of topological diversity observed in the human
phylome, something that would have resulted in many
wrong assignments if a reconciliation algorithm had been
used. This orthology-prediction methodology is now imple-
mented in all phylomes deposited at PhylomeDB . Van
der Heijden and colleagues implemented a species-overlap
algorithm in a program called LOFT (Levels of Orthology
From Trees) . Besides predicting orthology relationships
between genes in a phylogenetic tree, LOFT assigns a
hierarchy to the orthology relationships. Similar to the
http://genomebiology.com/2008/9/10/235Genome B Bi io ol lo og gy y 2008,Volume 9, Issue 10, Article 235Gabaldón 235.5
Genome B Bi io ol lo og gy y 2008, 9 9: :235
Enzyme Clasification (EC) numbers, each gene of a family is
given a code that indicates its level within the orthology
hierarchy. In this way orthologous groups can be defined at
different levels and the orthology and paralogy relationships
can be readily inferred from the code.
In conclusion, the prediction of orthology, rather than just
homology, relationships among genes in sequenced genomes
is a necessary task that often needs to be performed in an
automated way. Most automatic strategies to derive such
orthology relationships still use rough approximations that
are far away from the original definition of orthology.
Nowadays, however, the increasing speed at which computer
programs can generate phylogenetic trees, as well as the
availability of new algorithms, allows the possibility of
actually predicting orthology by mapping the speciation and
duplication events on a tree, thus following the formal
definition of orthology. It is likely that soon this strategy will
become the most commonly used in genome-wide searches
for orthology. The expected increase in the accuracy of the
predicted relationships will result in a higher reliability of
transfer of information across species. Recent analyses show
that phylogeny-based methods are less prone to error than
similarity-based approaches. The same analyses show,
however, that there is still room for improvement and that
future algorithms will need to take into account the inherent
topological variability that is expected in any genome-wide
A Ac ck kn no ow wl le ed dg ge em me en nt ts s
This work was partly funded by grants from the Spanish Ministries of Health
(FIS06-213) and Science and Innovation (GEN2006-27784-E/PAT) to TG.
R Re ef fe er re en nc ce es s
1. Fitch WM: D Di is st ti in ng gu ui is sh hi in ng g h ho om mo ol lo og go ou us s f fr ro om m a an na al lo og go ou us s p pr ro ot te ei in ns s. . Syst
Zool 1970, 1 19 9: :99-113.
2. Moreira D, Philippe H: M Mo ol le ec cu ul la ar r p ph hy yl lo og ge en ny y: : p pi it tf fa al ll ls s a an nd d p pr ro og gr re es ss s. . Int
Microbiol 2000, 3 3: :9-16.
3. Gabaldón T: E Ev vo ol lu ut ti io on n o of f p pr ro ot te ei in ns s a an nd d p pr ro ot te eo om me es s, , a a p ph hy yl lo og ge en ne et ti ic cs s
a ap pp pr ro oa ac ch h. . Evol Bioinf Online 2005, 1 1: :51-56.
4.Gabaldón T, Huynen MA: P Pr re ed di ic ct ti io on n o of f p pr ro ot te ei in n f fu un nc ct ti io on n a an nd d p pa at th h- -
w wa ay ys s i in n t th he e g ge en no om me e e er ra a. . Cell Mol Life Sci 2004, 6 61 1: :930-944.
5. Huynen MA, Gabaldón T, Snel B: V Va ar ri ia at ti io on n a an nd d e ev vo ol lu ut ti io on n o of f b bi io om mo ol le e- -
c cu ul la ar r s sy ys st te em ms s: : s se ea ar rc ch hi in ng g f fo or r f fu un nc ct ti io on na al l r re el le ev va an nc ce e. . FEBS Lett 2005,
5 57 79 9: :1839-1845.
6.Eisen JA: P Ph hy yl lo og ge en no om mi ic cs s: : i im mp pr ro ov vi in ng g f fu un nc ct ti io on na al l p pr re ed di ic ct ti io on ns s f fo or r
u un nc ch ha ar ra ac ct te er ri iz ze ed d g ge en ne es s b by y e ev vo ol lu ut ti io on na ar ry y a an na al ly ys si is s. . Genome Res 1998,
8 8: :163-167.
7.Koonin EV: O Or rt th ho ol lo og gs s, , p pa ar ra al lo og gs s, , a an nd d e ev vo ol lu ut ti io on na ar ry y g ge en no om mi ic cs s. . Annu
Rev Genet 2005, 3 39 9: :309-338.
8.Huerta-Cepas J, Dopazo H, Dopazo J, Gabaldón T: T Th he e h hu um ma an n
p ph hy yl lo om me e. . Genome Biol 2007, 8 8: :R109.
9.Wapinski I, Pfeffer A, Friedman N, Regev A: A Au ut to om ma at ti ic c g ge en no om me e- -w wi id de e
r re ec co on ns st tr ru uc ct ti io on n o of f p ph hy yl lo og ge en ne et ti ic c g ge en ne e t tr re ee es s. . Bioinformatics 2007,
2 23 3: :i549-i558.
10. Thornton JW, DeSalle R: G Ge en ne e f fa am mi il ly y e ev vo ol lu ut ti io on n a an nd d h ho om mo ol lo og gy y: :
g ge en no om mi ic cs s m me ee et ts s p ph hy yl lo og ge en ne et ti ic cs s. . Annu Rev Genomics Hum Genet
2000, 1 1: :41-73.
11. Roth C, Rastogi S, Arvestad L, Dittmar K, Light S, Ekman D, Liberles
DA: E Ev vo ol lu ut ti io on n a af ft te er r g ge en ne e d du up pl li ic ca at ti io on n: : m mo od de el ls s, , m me ec ch ha an ni is sm ms s, ,
s se eq qu ue en nc ce es s, , s sy ys st te em ms s, , a an nd d o or rg ga an ni is sm ms s. . J Exp Zool B Mol Dev Evol 2007,
3 30 08 8: :58-73.
12. Doolittle RF: T Th he e m mu ul lt ti ip pl li ic ci it ty y o of f d do om ma ai in ns s i in n p pr ro ot te ei in ns s. . Annu Rev
Biochem 1995, 6 64 4: :287-314.
13. Huynen MA, Bork P: M Me ea as su ur ri in ng g g ge en no om me e e ev vo ol lu ut ti io on n. . Proc Natl Acad
Sci USA 1998, 9 95 5: :5849-5856.
Tatusov RL, Koonin EV, Lipman DJ: A A g ge en no om mi ic c p pe er rs sp pe ec ct ti iv ve e o on n
p pr ro ot te ei in n f fa am mi il li ie es s. . Science 1997, 2 27 78 8: :631-637.
Lee Y, Sultana R, Pertea G, Cho J, Karamycheva S, Tsai J, Parvizi B,
Cheung F, Antonescu V, White J, Holt I, Liang F, Quackenbush J:
C Cr ro os ss s- -r re ef fe er re en nc ci in ng g e eu uk ka ar ry yo ot ti ic c g ge en no om me es s: : T TI IG GR R O Or rt th ho ol lo og go ou us s G Ge en ne e
A Al li ig gn nm me en nt ts s ( (T TO OG GA A) ). . Genome Res 2002, 1 12 2: :493-502.
von Mering C, Jensen LJ, Kuhn M, Chaffron S, Doerks T, Kruger B,
Snel B, Bork P: S ST TR RI IN NG G 7 7 - - r re ec ce en nt t d de ev ve el lo op pm me en nt ts s i in n t th he e i in nt te eg gr ra at ti io on n
a an nd d p pr re ed di ic ct ti io on n o of f p pr ro ot te ei in n i in nt te er ra ac ct ti io on ns s. . Nucleic Acids Res 2007,
3 35 5( (D Da at ta ab ba as se e i is ss su ue e) ): :D358-D362.
O’Brien KP, Remm M, Sonnhammer EL: I In np pa ar ra an no oi id d: : a a c co om mp pr re eh he en ns si iv ve e
d da at ta ab ba as se e o of f e eu uk ka ar ry yo ot ti ic c o or rt th ho ol lo og gs s. . Nucleic Acids Res 2005, 3 33 3( (D Da at ta a- -
b ba as se e i is ss su ue e) ): :D476-D480.
Li L, Stoeckert CJ Jr, Roos DS: O Or rt th ho oM MC CL L: : i id de en nt ti if fi ic ca at ti io on n o of f o or rt th ho ol lo og g
g gr ro ou up ps s f fo or r e eu uk ka ar ry yo ot ti ic c g ge en no om me es s. . Genome Res 2003, 1 13 3: :2178-2189.
Jothi R, Zotenko E, Tasneem A, Przytycka TM: C CO OC CO O- -C CL L: : h hi ie er ra ar r- -
c ch hi ic ca al l c cl lu us st te er ri in ng g o of f h ho om mo ol lo og gy y r re el la at ti io on ns s b ba as se ed d o on n e ev vo ol lu ut ti io on na ar ry y c co or rr re e- -
l la at ti io on ns s. . Bioinformatics 2006, 2 22 2: :779-788.
Kriventseva EV, Rahman N, Espinosa O, Zdobnov EM: O Or rt th ho oD DB B: : t th he e
h hi ie er ra ar rc ch hi ic ca al l c ca at ta al lo og g o of f e eu uk ka ar ry yo ot ti ic c o or rt th ho ol lo og gs s. . Nucleic Acids Res 2008,
3 36 6( (D Da at ta ab ba as se e i is ss su ue e) ): :D271-D275.
Goodman M, Czelusniak J, Moore GM, Romero-Herrera AE,
Matsuda G: F Fi it tt ti in ng g t th he e g ge en ne e l li in ne ea ag ge e i in nt to o i it ts s s sp pe ec ci ie es s l li in ne ea ag ge e, , a a p pa ar rs si i- -
m mo on ny y s st tr ra at te eg gy y i il ll lu us st tr ra at te ed d b by y c cl la ad do og gr ra am ms s c co on ns st tr ru uc ct te ed d f fr ro om m g gl lo ob bi in n
s se eq qu ue en nc ce es s. . Syst Zool 1979, 2 28 8: :132-163.
Zmasek CM, Eddy SR: A A s si im mp pl le e a al lg go or ri it th hm m t to o i in nf fe er r g ge en ne e d du up pl li ic ca at ti io on n
a an nd d s sp pe ec ci ia at ti io on n e ev ve en nt ts s o on n a a g ge en ne e t tr re ee e. . Bioinformatics 2001, 1 17 7: :821-
Page RD, Charleston MA: F Fr ro om m g ge en ne e t to o o or rg ga an ni is sm ma al l p ph hy yl lo og ge en ny y: : r re ec c- -
o on nc ci il le ed d t tr re ee es s a an nd d t th he e g ge en ne e t tr re ee e/ /s sp pe ec ci ie es s t tr re ee e p pr ro ob bl le em m. . Mol Phylo-
genet Evol 1997, 7 7: :231-240.
Dufayard JF, Duret L, Penel S, Gouy M, Rechenmann F, Perriere G:
T Tr re ee e p pa at tt te er rn n m ma at tc ch hi in ng g i in n p ph hy yl lo og ge en ne et ti ic c t tr re ee es s: : a au ut to om ma at ti ic c s se ea ar rc ch h f fo or r
o or rt th ho ol lo og gs s o or r p pa ar ra al lo og gs s i in n h ho om mo ol lo og go ou us s g ge en ne e s se eq qu ue en nc ce e d da at ta ab ba as se es s. .
Bioinformatics 2005, 2 21 1: :2596-2603.
Zmasek CM, Eddy SR: R RI IO O: : a an na al ly yz zi in ng g p pr ro ot te eo om me es s b by y a au ut to om ma at te ed d p ph hy y- -
l lo og ge en no om mi ic cs s u us si in ng g r re es sa am mp pl le ed d i in nf fe er re en nc ce e o of f o or rt th ho ol lo og gs s. . BMC Bioinfor-
matics 2002, 3 3: :14.
Dehal PS, Boore JL: A A p ph hy yl lo og ge en no om mi ic c g ge en ne e c cl lu us st te er r r re es so ou ur rc ce e: : t th he e P Ph hy y- -
l lo og ge en ne et ti ic ca al ll ly y I In nf fe er rr re ed d G Gr ro ou up ps s ( (P Ph hI IG Gs s) ) d da at ta ab ba as se e. . BMC Bioinformatics
2006, 7 7: :201.
Chiu JC, Lee EK, Egan MG, Sarkar IN, Coruzzi GM, DeSalle R:
O Or rt th ho ol lo og gI ID D: : a au ut to om ma at ti io on n o of f g ge en no om me e- -s sc ca al le e o or rt th ho ol lo og g i id de en nt ti if fi ic ca at ti io on n
w wi it th hi in n a a p pa ar rs si im mo on ny y f fr ra am me ew wo or rk k. . Bioinformatics 2006, 2 22 2: :699-707.
Rasmussen MD, Kellis M: A Ac cc cu ur ra at te e g ge en ne e- -t tr re ee e r re ec co on ns st tr ru uc ct ti io on n b by y
l le ea ar rn ni in ng g g ge en ne e- - a an nd d s sp pe ec ci ie es s- -s sp pe ec ci if fi ic c s su ub bs st ti it tu ut ti io on n r ra at te es s a ac cr ro os ss s m mu ul lt ti ip pl le e
c co om mp pl le et te e g ge en no om me es s. . Genome Res 2007, 1 17 7: :1932-1942.
Hulsen T, Huynen MA, de Vlieg J, Groenen PM: B Be en nc ch hm ma ar rk ki in ng g
o or rt th ho ol lo og g i id de en nt ti if fi ic ca at ti io on n m me et th ho od ds s u us si in ng g f fu un nc ct ti io on na al l g ge en no om mi ic cs s d da at ta a. .
Genome Biol 2006, 7 7: :R31.
Li H, Coghlan A, Ruan J, Coin LJ, Hériché JK, Osmotherly L, Li R, Liu
T, Zhang Z, Bolund L, Wong GK, Zheng W, Dehal P, Wang J, Durbin
R: T Tr re ee eF Fa am m: : a a c cu ur ra at te ed d d da at ta ab ba as se e o of f p ph hy yl lo og ge en ne et ti ic c t tr re ee es s o of f a an ni im ma al l g ge en ne e
f fa am mi il li ie es s. . Nucleic Acids Res 2006, 3 34 4( (D Da at ta ab ba as se e i is ss su ue e) ): :D572-D580.
Durand D, Halldorsson BV, Vernot B: A A h hy yb br ri id d m mi ic cr ro o- -m ma ac cr ro oe ev vo ol lu u- -
t ti io on na ar ry y a ap pp pr ro oa ac ch h t to o g ge en ne e t tr re ee e r re ec co on ns st tr ru uc ct ti io on n. . J Comput Biol 2006,
1 13 3: :320-335.
Chen K, Durand D, Farach-Colton M: N NO OT TU UN NG G: : a a p pr ro og gr ra am m f fo or r
d da at ti in ng g g ge en ne e d du up pl li ic ca at ti io on ns s a an nd d o op pt ti im mi iz zi in ng g g ge en ne e f fa am mi il ly y t tr re ee es s. . J Comput
Biol 2000, 7 7: :429-447.
Berglund-Sonnhammer AC, Steffansson P, Betts MJ, Liberles DA:
O Op pt ti im ma al l g ge en ne e t tr re ee es s f fr ro om m s se eq qu ue en nc ce es s a an nd d s sp pe ec ci ie es s t tr re ee es s u us si in ng g a a s so of ft t
i in nt te er rp pr re et ta at ti io on n o of f p pa ar rs si im mo on ny y. . J Mol Evol 2006, 6 63 3: :240-250.
Gabaldón T, Huynen MA: L Li in ne ea ag ge e- -s sp pe ec ci if fi ic c g ge en ne e l lo os ss s f fo ol ll lo ow wi in ng g m mi it to o- -
c ch ho on nd dr ri ia al l e en nd do os sy ym mb bi io os si is s a an nd d i it ts s p po ot te en nt ti ia al l f fo or r f fu un nc ct ti io on n p pr re ed di ic ct ti io on n i in n
e eu uk ka ar ry yo ot te es s. . Bioinformatics 2005, 2 21 1 S Su up pp pl l 2 2: :ii144-ii150.
Huerta-Cepas J, Bueno A, Dopazo J, Gabaldón T: P Ph hy yl lo om me eD DB B: : a a
d da at ta ab ba as se e f fo or r g ge en no om me e- -w wi id de e c co ol ll le ec ct ti io on ns s o of f g ge en ne e p ph hy yl lo og ge en ni ie es s. . Nucleic
Acids Res 2008, 3 36 6( (D Da at ta ab ba as se e i is ss su ue e) ): :D491-D496.
van der Heijden RT, Snel B, van Noort V, Huynen MA: O Or rt th ho ol lo og gy y
p pr re ed di ic ct ti io on n a at t s sc ca al la ab bl le e r re es so ol lu ut ti io on n b by y p ph hy yl lo og ge en ne et ti ic c t tr re ee e a an na al ly ys si is s. . BMC
Bioinformatics 2007, 8 8: :83.
http://genomebiology.com/2008/9/10/235Genome B Bi io ol lo og gy y 2008, Volume 9, Issue 10, Article 235Gabaldón 235.6
Genome B Bi io ol lo og gy y 2008, 9 9: :235