Article

Least Common Ancestors in Trees and Directed Acyclic Graphs

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

and with a nave algorithm. We compare the performance of these algorithms as a function of the number of queries performed and the imbalance of the tree. For DAGs, we compare a transitive-closure based algorithm, an intelligent traversal algorithm, and our input-sensitive algorithm that uses LCA in trees. We compare the performance of these algorithms as a function of DAG density. We show that the input-sensitive algorithm outperforms the other two algorithms on all test data. Keywords: Least Common Ancestor (LCA), Directed Cyclic Graph (DAG), Range Minimum Query (RMQ), Shortest Path, Cartesian Tree, Genealogy, Pedigree, Data Structure. 1. INTRODUCTION One of the fundamental algorithmic problems on trees is how to nd the least common ancestor (LCA) of a given pair of nodes. The LCA of nodes u and v in a tree is the ancestor of u and v that is located farthest from the root. The

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... We will address these problems from different perspectives. Numerous results have been established for Question 1 and 2 for the case |A| = 2 Nykänen and Ukkonen 1994;Bender et al. 2005;Mathialagan et al. 2022;Czumaj et al. 2007;Grandoni et al. 2021;Bender et al. 2001;Kowaluk and Lingas 2005;Harel and Tarjan 1984), or when assuming that A = C G (v) for a given vertex v (Nakhleh and Wang 2005). This paper is organized as follows. ...
Article
Full-text available
Rooted phylogenetic networks, or more generally, directed acyclic graphs (DAGs), are widely used to model species or gene relationships that traditional rooted trees cannot fully capture, especially in the presence of reticulate processes or horizontal gene transfers. Such networks or DAGs are typically inferred from observable data (e.g., genomic sequences of extant species), providing only an estimate of the true evolutionary history. However, these inferred DAGs are often complex and difficult to interpret. In particular, many contain vertices that do not serve as least common ancestors (LCAs) for any subset of the underlying genes or species, thus may lack direct support from the observable data. In contrast, LCA vertices are witnessed by historical traces justifying their existence and thus represent ancestral states substantiated by the data. To reduce unnecessary complexity and eliminate unsupported vertices, we aim to simplify a DAG to retain only LCA vertices while preserving essential evolutionary information. In this paper, we characterize LCA\textrm{LCA} LCA -relevant and lca\textrm{lca} lca -relevant DAGs, defined as those in which every vertex serves as an LCA (or unique LCA) for some subset of taxa. We introduce methods to identify LCAs in DAGs and efficiently transform any DAG into an LCA\textrm{LCA} LCA -relevant or lca\textrm{lca} lca -relevant one while preserving key structural properties of the original DAG or network. This transformation is achieved using a simple operator “ \ominus ⊖ ” that mimics vertex suppression.
... One easily verifies, however, that the existence of "unused colors" in M only increases the size of the species tree S (in particular, the number of leaves in S that are attached to ρ S ) but does not affect the existence of a relaxed scenario that explains G. We can employ the LCA data structure described by Bender et al [4], which pre-processes S in O(|M|) = O(|L|) time to allow O(1)-query of the last common ancestor of pairs of vertices in S afterwards. In addition, we want to access the vertex w ∈ child S (u) satisfying v S w for two given vertices u, v ∈ V (T ) with v ≺ S u. ...
Preprint
Full-text available
Evolutionary scenarios describing the evolution of a family of genes within a collection of species comprise the mapping of the vertices of a gene tree T to vertices and edges of a species tree S. The relative timing of the last common ancestors of two extant genes (leaves of T) and the last common ancestors of the two species (leaves of S) in which they reside is indicative of horizontal gene transfers (HGT) and ancient duplications. Orthologous gene pairs, on the other hand, require that their last common ancestors coincides with a corresponding speciation event. The relative timing information of gene and species divergences is captured by three colored graphs that have the extant genes as vertices and the species in which the genes are found as vertex colors: the equal-divergence-time (EDT) graph, the later-divergence-time (LDT) graph and the prior-divergence-time (PDT) graph, which together form an edge partition of the complete graph. Here we give a complete characterization in terms of informative and forbidden triples that can be read off the three graphs and provide a polynomial time algorithm for constructing an evolutionary scenario that explains the graphs, provided such a scenario exists. While both LDT and PDT graphs are cographs, this is not true for the EDT graph in general. We show that every EDT graph is perfect. While the information about LDT and PDT graphs is necessary to recognize EDT graphs in polynomial-time for general scenarios, this extra information can be dropped in the HGT-free case. However, recognition of EDT graphs without knowledge of putative LDT and PDT graphs is NP-complete for general scenarios. We finally connect the EDT graph to the alternative definitions of orthology that have been proposed for scenarios with horizontal gene transfer. With one exception, the corresponding graphs are shown to be colored cographs.
... A lowest common ancestor (LCA) query takes as input two nodes of a rooted tree and returns the deepest node of the tree that is an ancestor of both. Such queries can be answered in constant time after a linear-time preprocessing of the tree[9]. ...
Preprint
We introduce data structures answering queries concerning the occurrences of patterns from a given dictionary D\mathcal{D} in fragments of a given string T of length n. The dictionary is internal in the sense that each pattern in D\mathcal{D} is given as a fragment of T. This way, D\mathcal{D} takes space proportional to the number of patterns d=Dd=|\mathcal{D}| rather than their total length, which could be Θ(nd)\Theta(n\cdot d). In particular, we consider the following types of queries: reporting and counting all occurrences of patterns from D\mathcal{D} in a fragment T[i..j] and reporting distinct patterns from D\mathcal{D} that occur in T[i..j]. We show how to construct, in O((n+d)logO(1)n)\mathcal{O}((n+d) \log^{\mathcal{O}(1)} n) time, a data structure that answers each of these queries in time O(logO(1)n+output)\mathcal{O}(\log^{\mathcal{O}(1)} n+|output|). The case of counting patterns is much more involved and needs a combination of a locally consistent parsing with orthogonal range searching. Reporting distinct patterns, on the other hand, uses the structure of maximal repetitions in strings. Finally, we provide tight---up to subpolynomial factors---upper and lower bounds for the case of a dynamic dictionary.
... Standard data structures answer LCE queries in constant time and take linear space. The original construction algorithm [29,43,20] works in linear time for constant alphabets only, but it has been subsequently generalized to larger integer alphabets [12] and simplified substantially [24,6]. Thus, LCE queries are completely resolved in the classic setting where the text T is stored in O(n) space. ...
Preprint
Full-text available
Burrows-Wheeler transform (BWT) is an invertible text transformation that, given a text T of length n, permutes its symbols according to the lexicographic order of suffixes of T. BWT is one of the most heavily studied algorithms in data compression with numerous applications in indexing, sequence analysis, and bioinformatics. Its construction is a bottleneck in many scenarios and settling the complexity of this task is one of the most important unsolved problems in sequence analysis that has remained open for 25 years. Given a binary string of length n occupying O(n/logn)O(n/\log n) machine words of space, the BWT construction algorithm due to Hon et al. (FOCS 2003) runs in O(n) time and O(n/logn)O(n/\log n) space. Recent advancements (Belazzougui, STOC 2014, and Munro et al., SODA 2017) focus on removing the alphabet-size dependency in the time complexity, but they still require Ω(n)\Omega(n) time. In this paper, we propose the first algorithm that breaks the O(n)-time barrier for BWT construction. Given a binary string of length n, our procedure builds the Burrows-Wheeler transform in O(n/logn)O(n/\sqrt{\log n}) time and O(n/logn)O(n/\log n) space. We complement this result with a conditional lower bound proving that any further progress in the time complexity of BWT construction would yield faster algorithms for the very well studied problem of counting inversions: it would improve the state-of-the-art O(mlogm)O(m\sqrt{\log m})-time solution by Chan and P\v{a}tra\c{s}cu (SODA 2010). Our algorithm is based on a novel concept of string synchronizing sets. As one of the applications, we show a data structure of the optimal size O(n/logn)O(n/\log n) that answers longest common extension queries in O(1) time and, furthermore, can be deterministically constructed in the optimal O(n/logn)O(n/\log n) time. This significantly improves upon the previously best data structures and essentially closes the LCE problem.
Preprint
Full-text available
Rooted phylogenetic networks, or more generally, directed acyclic graphs (DAGs), are widely used to model species or gene relationships that traditional rooted trees cannot fully capture, especially in the presence of reticulate processes or horizontal gene transfers. Such networks or DAGs are typically inferred from genomic data of extant taxa, providing only an estimate of the true evolutionary history. However, these inferred DAGs are often complex and difficult to interpret. In particular, many contain vertices that do not serve as least common ancestors (LCAs) for any subset of the underlying genes or species, thus lacking direct support from the observed data. In contrast, LCA vertices represent ancestral states substantiated by the data, offering important insights into evolutionary relationships among subsets of taxa. To reduce unnecessary complexity and eliminate unsupported vertices, we aim to simplify a DAG to retain only LCA vertices while preserving essential evolutionary information. In this paper, we characterize LCA\mathrm{LCA}-relevant and lca\mathrm{lca}-relevant DAGs, defined as those in which every vertex serves as an LCA (or unique LCA) for some subset of taxa. We introduce methods to identify LCAs in DAGs and efficiently transform any DAG into an LCA\mathrm{LCA}-relevant or lca\mathrm{lca}-relevant one while preserving key structural properties of the original DAG or network. This transformation is achieved using a simple operator ``\ominus'' that mimics vertex suppression.
Preprint
Full-text available
We investigate the connections between clusters and least common ancestors (LCAs) in directed acyclic graphs (DAGs). We focus on the class of DAGs having unique least common ancestors for certain subsets of their minimal elements since these are of interest, particularly as models of phylogenetic networks. Here, we use the close connection between the canonical k-ary transit function and the closure function on a set system to show that pre-k-ary clustering systems are exactly those that derive from a class of DAGs with unique LCAs. We show that k-ary T-systems and k-weak hierarchies are associated with DAGs that satisfy stronger conditions on the existence of unique LCAs for sets of size at most k. Moreover, we introduce a LCA-graphs as unique graphs derived from arbitrary DAGs that are sufficient to study their LCAs.
Article
We present a new and simple algorithm to reconstruct suffix links in suffix trees and suffix arrays. The algorithm is based on observations regarding suffix tree construction algorithms. With our algorithm we bring suffix arrays even closer to the ease of use and implementation of suffix trees.
ResearchGate has not been able to resolve any references for this publication.