Article

On the Theory and Computation of Evolutionary Distances

Authors:
To read the full-text of this research, you can request a copy directly from the author.

Abstract

This paper gives a formal definition of the biological concept of evolutionary distance and an algorithm to compute it. For any set S of finite sequences of varying lengths this distance is a real-valued function on S×SS \times S, and it is shown to be a metric under conditions which are wide enough to include the biological application. The algorithm, introduced here, lends itself to computer programming and provides a method to compute evolutionary distance which is shorter than the other methods currently in use.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

... where M(e i , f j ) is the cost of changing the interval e i to the interval f j , namely q|e i -f j |. Sellers (37) has shown that with this recursion rule, the distance between two subsequences (e 1 , e 2 , ...,e i ) and (f 1 , f 2 , ...,f j ) is given by ...
... time, and then consider each spike train to be a sequence of 0's and 1's, with 0's at times without spikes, and 1's at times with spikes. But, with this formalism, a shift in time of a spike corresponds to a transposition of sequence elements, an action which is not within the realm of possibilities considered in (37). ...
... The validity of the recursion (4) follows directly. The similarity of the algorithms for D spike and D interval suggests that they share a common fundamental basis in the theory of dynamic programming algorithms (38), which encompasses the validation of the algorithm for D interval (37) and the validation of the algorithm for D spike (Figure 2). ...
Preprint
We present the mathematical basis of a new approach to the analysis of temporal coding. The foundation of the approach is the construction of several families of novel distances (metrics) between neuronal impulse trains. In contrast to most previous approaches to the analysis of temporal coding, the present approach does not attempt to embed impulse trains in a vector space, and does not assume a Euclidean notion of distance. Rather, the proposed metrics formalize physiologically-based hypotheses for what aspects of the firing pattern might be stimulus-dependent, and make essential use of the point process nature of neural discharges. We show that these families of metrics endow the space of impulse trains with related but inequivalent topological structures. We show how these metrics can be used to determine whether a set of observed responses have stimulus-dependent temporal structure without a vector-space embedding. We show how multidimensional scaling can be used to assess the similarity of these metrics to Euclidean distances. For two of these families of metrics (one based on spike times and one based on spike intervals), we present highly efficient computational algorithms for calculating the distances. We illustrate these ideas by application to artificial datasets and to recordings from auditory and visual cortex.
... In the literature there exist a number of algorithms dealing with the calculation of the edit distance between two strings. The basic dynamic programming algorithm that solves the problem in O(mn) time and linear space has been invented and analyzed several times in different contexts [2][3][4][5][6][7], published between 1968 and 1975. Early on there was an algorithm by Masek and Paterson [8], building on a technique called the "Four-Russian paradigm" [9], which computes the edit distance of two strings over a finite alphabet in time O(mn/ log 2 n). ...
... The basic dynamic programming algorithm employed to solve the edit distance problem, invented in a number of different contexts [2][3][4][5][6][7], makes use of the edit graph, an (n + 1) × (m + 1) matrix (d ij ) that is computed from the recurrence: d00 = 0 dij = min( di−1,j−1 + (If ai = bj then 0 else 1), di−1,j + 1, di,j−1 + 1), i > 0 or j > 0. ...
Preprint
Full-text available
The problem of approximate string matching is important in many different areas such as computational biology, text processing and pattern recognition. A great effort has been made to design efficient algorithms addressing several variants of the problem, including comparison of two strings, approximate pattern identification in a string or calculation of the longest common subsequence that two strings share. We designed an output sensitive algorithm solving the edit distance problem between two strings of lengths n and m respectively in time O((s-|n-m|)min(m,n,s)+m+n) and linear space, where s is the edit distance between the two strings. This worst-case time bound sets the quadratic factor of the algorithm independent of the longest string length and improves existing theoretical bounds for this problem. The implementation of our algorithm excels also in practice, especially in cases where the two strings compared differ significantly in length. Source code of our algorithm is available at http://www.cs.miami.edu/\~dimitris/edit_distance
... Typical approaches to biosequence comparison are either distance- [67,82] or similarity-based [55,69]. The distance-based approaches minimize the cost, while those based on similarity maximize the likelihood of transformation of one sequence into another. ...
... Example 3.14. The Sellers distance, introduced by Sellers in 1974 [67], is a metric obtained by extension of a metric d on the set Σ † = Σ ∪ {e}, the set of generators plus the identity element, to the free monoid Σ * . It is realized as ...
Preprint
We propose a general framework for converting global and local similarities between biological sequences to quasi-metrics. In contrast to previous works, our formulation allows asymmetric distances, originating from uneven weighting of strings, that may induce non-trivial partial orders on sets of biosequences. Furthermore, the p\ell^p-type distances considered are more general than traditional generalized string edit distances corresponding to the 1\ell^1 case, and enable conversion of sequence similarities to distances for a much wider class of scoring schemes. Our constructions require much less restrictive gap penalties than the ones regularly used. Numerous examples are provided to illustrate the concepts introduced and their potential applications.
... For input strings of lengths n and m, this method creates a (n + 1) × (m + 1) table that is filled cell by cell using a recursive formula. Needleman and Wunsch (1970) gave the first O(n 2 m) algorithm, and Sellers (1974) and Wagner and Fischer (1974) improved this to what is now known as the O(nm) Needleman-Wunsch algorithm, building on the quadratic algorithm for longest common subsequence by Sankoff (1972). ...
... The shortest path from vs := ⟨0, 0⟩ to vt := ⟨n, m⟩ in the edit graph corresponds to an optimal alignment of A and B. The distance d(u, v) from u to v is the length of the shortest (minimal cost) path from u to v, and we use distance, length, and cost interchangeably. We write g * (u) := d(vs, u) for the distance from the start to u, h * (u) := d(u, vt) for the distance from u to the end, and f * (u) := g * (u) + h * (u) for the minimal cost of a path through u. Sellers (1974). Solid edges indicate insertion/deletion/substitution edges of cost 1, while dashed edges indicate matches of cost 0. ...
Preprint
Full-text available
Methods: We introduce A*PA2, an exact global pairwise aligner with respect to edit distance. The goal of A*PA2 is to unify the near-linear runtime of A*PA on similar sequences with the efficiency of dynamic programming (DP) based methods. Like Edlib, A*PA2 uses Ukkonen's band doubling in combination with Myers' bitpacking. A*PA2 1) extends this with SIMD (single instruction, multiple data), 2) uses large block sizes inspired by Block Aligner, 3) avoids recomputation of states where possible as suggested before by Fickett, 4) introduces a new optimistic technique for traceback based on diagonal transition, and 5) applies the heuristics developed in A*PA and improves them using pre-pruning. Results: The average runtime of A*PA2 is 19× faster than the exact aligners BiWFA and Edlib on >500 kbp long ONT reads of a human genome having 6% divergence on average. On shorter ONT reads of 11% average divergence the speedup is 5.6× (avg. length 11 kbp) and 0.81× (avg. length 800 bp). On all tested datasets, A*PA2 is competitive with or faster than approximate methods. Availability: https://github.com/RagnarGrootKoerkamp/astar-pairwise-aligner Contact: ragnar.grootkoerkamp@inf.ethz.ch
... Observe that not all scoring matrices induce scoring functions that are distances, since the scoring function is not necessarily a metric. Sellers [19] described a sufficient condition for a weighted edit distance be a metric on Σ * , i.e., proved that a scoring matrix γ induces an optA γ -metric on sequences. Araujo and Soares [3] presented necessary and sufficient conditions for a scoring matrix γ induce a weighted edit distance as a metric on Σ * . ...
... Sellers [19] showed that scoring matrices in M C induce an optA γ -metric on sequences. The class of scoring matrices M A is such that, for each a, b, c ∈ Σ, ...
Article
Sequence comparison is a basic task to capture similarities and differences between two or more sequences of symbols, with countless applications such as in computational biology. An alignment is a way to compare sequences, where a giving scoring function determines the degree of similarity between them. Many scoring functions are obtained from scoring matrices. However, not all scoring matrices induce scoring functions which are distances, since the scoring function is not necessarily a metric. In this work we establish necessary and sufficient conditions for scoring matrices to induce each one of the properties of a metric in weighted edit distances. For a subset of scoring matrices that induce normalized edit distances, we also characterize each class of scoring matrices inducing normalized edit distances. Furthermore, we define an extended edit distance, which takes into account a set of editing operations that transforms one sequence into another regardless of the existence of a usual corresponding alignment to represent them, describing a criterion to find a sequence of edit operations whose weight is minimum. Similarly, we determine the class of scoring matrices that induces extended edit distances for each of the properties of a metric.
... Variants of the edit distance based on the idea of string alignment are prominently used in biology to compare the similarity of DNA or protein sequences (e.g. Needleman & Wunsch, 1970;Sellers, 1974). There are several types of edit distances, which differ in the operations that are allowed to edit the sequence and their weighting (for overviews, see Boytsov, 2011;Navarro, 2001;van der Loo, 2014). ...
... In contrast to partialcredit scoring, the edit distance represents an exhaustive solution to the general problem of quantifying the discrepancy between two series of items, which has driven its adoption by other fields such as biology and computer sciences (e.g. Damerau, 1964;Levenshtein, 1966;Needleman & Wunsch, 1970;Sellers, 1974). It is therefore unlikely that a better general solution for scoring can be found. ...
Article
Full-text available
For researchers and psychologists interested in estimating a subject's memory capacity, the current standard for scoring memory span tasks is the partial-credit method: subjects are credited with the number of stimuli that they manage to recall correctly in the correct serial position. A critical issue with this method, however, is that intrusions and omissions can radically change the scores depending on where they occur. For example, when recalling the sequence ABCDE, “ABCD” is worth 4 points but “BCDE” is worth 0 points. This paper presents an improved scoring method based on the edit distance, meaning the number of changes required to edit the recalled sequence into the target. Edit-distance scoring gives results close to partial-credit scoring, but without the corresponding vulnerability to positional shifts. A reanalysis of memory performance in two large datasets (N = 1093 and N = 758) confirms that in addition to being more logically consistent, edit-distance scoring demonstrates similar or better psychometric properties than partial-credit, with comparable validity, a small increase in reliability, and a substantial increase of test information (measurement precision in the context of item response theory). Test information was especially improved for harder items and for subjects with ability in the lower range, whose scores tend to be severely underestimated by partial-credit scoring. Code to compute edit-distance scores with various software is made available at https://osf.io/wdb83/.
... To build such a picture, we typically want to know the m ininum num ber of m utational steps between one sequence and another, and where the m utations are located. A lthough algorithm s now exist for considering inversions and transpositions between one sequence and another [195], the m ethods of choice are generally based on dynam ic program m ing [163,196,203]. These concentrate on finding the most biologically probable way of in troducing substitutions and deletions to sequences; they are elegant, generally robust and provide the best approxim ations to protein evolution, given th a t we are prepared to grant certain assum ptions. ...
... Some progress has been m ade towards the production of more accurate alignm ents w ithout considering th e surrounding context of where the gap is made. Discounting gap penalties at th e N and C term ini for the m om ent, one can divide conventional gap penalties into three types: 1. Linear gap penalties where the overall gap penalty at any position is of th e form n X g, where g is the gap penalty and n is the num ber of tim es insertions or deletions are m ade [196]; 2. A ffine gap penalties, where the penalty is of the form a + (n x g), where a is the cost of opening a gap [97, 10]; 3. Concave gap penalties, where the cost of gap extension decreases after th e first gap is m ade [153]. ...
Thesis
The structure any particular part of a protein adopts is determined by the sequence itself and the surrounding context or chemical environment. The correct prediction of protein structure at any particular position in the sequence therefore requires a component relating the context of the position to the sequence at that position. In this thesis we formalize the embedding of context in protein grammars which describe the arrangement of structural features in protein families; we also develop an algorithm to recognize these grammars. This algorithm requires substantial extension of classical syntactic analysis to cope with the problem of overlapping tokens. Development of new methods to rank grammatical paths and evaluate semantic predicates was also necessary. The interaction of context and sequence is manifest in the type of mutations that occur at a particular point in a multiple sequence alignment; in particular, whether changes that occur there are related to changes elsewhere in the multiple alignment. Such changes are referred to as coordinated changes; a method was devised to recognise these changes at positions in protein multiple sequence alignments using different models of amino acid relatedness. The method had been designed to bypass the limitations of previous methods based on simple pattern matching. We discuss the implications of the ability to harness contextual information from evolutionary, folding and comparative structural analysis for topology prediction.
... In the running example, γ SN ( a, c, b, d, e ) equals to γ 2 is an optimal alignment, and λ(σ L ) = a, c, b, e is the corresponding model trace for the optimal alignment. We can compute the distance of two traces (or two sequences) faster using the adapted version of Levenshtein distance [23]. Suppose that σ, σ ∈ A * , Edit Distance function (σ, σ ) → N returns the minimum number of edits that are needed to transform σ to σ . ...
... If the user decides to use alignments for creating model behavior, she can select candidates based on their frequency, random, or using the clustering algorithm. For finding the distance of a log trace and a model trace, we used the edit distance function that is an adapted version of Levenshtein distance [23]. To cluster traces, we implement the K-Medoids algorithm that returns one trace as a candidate for each cluster [27] based on their edit distance. ...
Chapter
Conformance checking techniques let us find out to what degree a process model and real execution data correspond to each other. In recent years, alignments have proven extremely useful in calculating conformance statistics. Most techniques to compute alignments provide an exact solution. However, in many applications, it is enough to have an approximation of the conformance value. Specifically, for large event data, the computation time for alignments is considerably long using current techniques which makes them inapplicable in reality. Also, it is no longer feasible to use standard hardware for complex process models. This paper, proposes new approximation techniques to compute approximated conformance checking values close to exact solution values in less time. These methods also provide upper and lower bounds for the approximated alignment value. Our experiments on real event data show that it is possible to improve the performance of conformance checking by using the proposed methods compared to using the state-of-the-art alignment approximation technique. Results show that in most of the cases, we provide tight bounds, accurate approximated alignment values, and similar deviation statistics.
... .)))))). Positions 1 and 53 (for instance) are unpaired, as indicated by a dot, while positions 2 and 52 are paired and form the outermost base pair (2, 52), positions 12, 16 are paired and base pair (12,16) constitutes one of the two apical (hairpin) loops, while the other apical (hairpin) loop is closed by the base pair (31,40), etc. ...
... In [39], Seller's (distance-based) global pairwise alignment algorithm [40] was rigorously shown to be equivalent to Needleman and Wunsch's (similarity-based) global pairwise alignment algorithm [8]. Recalling that Seller's alignment distance is defined as the minimum, taken over all alignments of the sum of distances d(x, y) between aligned nucleotides x, y plus the sum of (positive) weights w(k) for size k gaps, while Needleman-Wunsch alignment similarity is defined as the maximum, taken over all alignments of the sum of similarities s(x, y) between aligned nucleotides x, y plus the sum of (negative) gap weights g(k) for size k gaps, Smith and Waterman [39] show that by defining and by taking the minimum distance, rather than maximum similarity, the Needleman-Wunsch algorithm is transformed into Seller's algorithm. ...
Article
Full-text available
Alignment of structural RNAs is an important problem with a wide range of applications. Since function is often determined by molecular structure, RNA alignment programs should take into account both sequence and base-pairing information for structural homology identification. This paper describes C++ software, RNAmountAlign, for RNA sequence/structure alignment that runs in O(n³) time and O(n²) space for two sequences of length n; moreover, our software returns a p-value (transformable to expect value E) based on Karlin-Altschul statistics for local alignment, as well as parameter fitting for local and global alignment. Using incremental mountain height, a representation of structural information computable in cubic time, RNAmountAlign implements quadratic time pairwise local, global and global/semiglobal (query search) alignment using a weighted combination of sequence and structural similarity. RNAmountAlign is capable of performing progressive multiple alignment as well. Benchmarking of RNAmountAlign against LocARNA, LARA, FOLDALIGN, DYNALIGN, STRAL, MXSCARNA, and MUSCLE shows that RNAmountAlign has reasonably good accuracy and faster run time supporting all alignment types. Additionally, our extension of RNAmountAlign, called RNAmountAlignScan, which scans a target genome sequence to find hits having high sequence and structural similarity to a given query sequence, outperforms RSEARCH and sequence-only query scans and runs faster than FOLDALIGN query scan.
... We are able to compute the distance of two traces (or two sequences) in a faster way using the adapted version of Levenshtein edit distance [22]. Suppose that σ, σ ∈ A * , Edit Distance function (σ, σ ) → N is the minimum number of edits that are needed to transform σ to σ . ...
... If the user decides to use alignments for creating model behavior, she can select candidates based on their frequency, random, or using the clustering algorithm. For finding the distance of a log trace and a model trace, we used the edit distance function that is an adapted version of Levenshtein distance [22]. To cluster traces, we implement the K-Medoids algorithm that returns one trace as a candidate for each cluster [26] based on their edit distance. ...
Preprint
Full-text available
Conformance checking techniques let us find out to what degree a process model and real execution data correspond to each other. In recent years, alignments have proven extremely useful in calculating conformance statistics. Most techniques to compute alignments provide an exact solution. However, in many applications, it is enough to have an approximation of the conformance value. Specifically, for large event data, the computing time for alignments is considerably long using current techniques which makes them inapplicable in reality. Also, it is no longer feasible to use standard hardware for complex processes. Hence, we need techniques that enable us to obtain fast, and at the same time, accurate approximation of the conformance values. This paper proposes new approximation techniques to compute approximated conformance checking values close to exact solution values in a faster time. Those methods also provide upper and lower bounds for the approximated alignment value. Our experiments on real event data show that it is possible to improve the performance of conformance checking by using the proposed methods compared to using the state-of-the-art alignment approximation technique. Results show that in most of the cases, we provide tight bounds, accurate approximated alignment values, and similar deviation statistics.
... The Levenshtein distance between two strings is defined as the minimum number of edits required to transform one string into the other, where the allowed operations include insertion, deletion, and substitution of a single character. Under certain parameter settings, the Needleman-Wunsch algorithm is equivalent to computing the Levenshtein distance between two DNA sequences [26]. Therefore, Levenshtein distance can be applied as a metric to measure the similarity between DNA sequences. ...
Preprint
Computing the similarity between two DNA sequences is of vital importance in bioscience. However, traditional computational methods can be resource-intensive due to the enormous sequence length encountered in practice. Recently, applied quantum algorithms have been anticipated to provide potential advantages over classical approaches. In this paper, we propose a permutation-invariant variational quantum kernel method specifically designed for DNA comparison. To represent the four nucleotide bases in DNA sequences with quantum states, we introduce a novel, theoretically motivated encoding scheme: the four distinct bases are encoded using the states of symmetric, informationally complete, positive operator-valued measures (SIC-POVMs). This encoding ensures mutual equality: each pair of symbols is equidistant on the Bloch sphere. Also, since permutation invariance is inherent to common DNA similarity measures such as Levenshtein distance, we realize it by using a specially designed parameterized quantum layer. We show that our novel encoding method and parameterized layers used in the quantum kernel model can effectively capture the symmetric characteristics of the pairwise DNA sequence comparison task. We validate our model through numerical experiments, which yield promising results on length-8 DNA sequences.
... [7]. Furthermore, a similar matrix is used to compute several variations of the edit distance via usual dynamic programming (edit distance, Hamming distance, LCS length, to name a few) [10,11,13,14,18]. ...
Chapter
Full-text available
The Continuous Interval Hamming distance (CIH) was introduced in 2010 in the context of detecting similarity for huge string data, such as genome sequences. Given two input strings, this metric provides a guarantee on the number of errors between each pair of aligned substrings of a given length k (called k -mers), while retaining a good definition of maximality. Indeed, the set of CIH-maximal substrings of two strings can be used to define maximal areas of similarity within a limited error ratio, which is hard to do with other widespread measures. Still, CIH has a major drawback: it has a low tolerance for insertion and deletion errors, which arise quite commonly in practical applications. With the aim of overcoming this issue, in this chapter we go a step beyond, introducing several novel similarity measures based on CIH-maximal substrings.
... Update 4.2 can readily be carried out by applying the Viterbi algorithm (Forney, 1973) ("dynamic programming") on a trellis with the pairs of coincident events (x jk , x j k ) as states, or equivalently, by applying the maxproduct algorithm on a cycle-free factor graph (Loeliger, 2004;Loeliger et al., 2007) of p(x, x , j, j , δ t , s t ). The procedure is equivalent to dynamic time warping (Myers & Rabiner, 1981); it is, for example, used in the context of bio-informatics to compute the distance between genetic sequences (Sellers, 1974(Sellers, , 1979. It is also applied in neuroscience to compute various spike metrics (Victor & Purpura, 1997;Aronov, 2003;Victor et al., 2007). ...
Article
Full-text available
We present a novel approach to quantify the statistical interdependence of two time series, referred to as stochastic event synchrony (SES). The first step is to extract “events” from the two given time series. The next step is to try to align events from one time series with events from the other. The better the alignment, the more similar the two time series are considered to be. More precisely, the similarity is quantified by the following parameters: time delay, variance of the timing jitter, fraction of noncoincident events, and average similarity of the aligned events. The pairwise alignment and SES parameters are determined by statistical inference. In particular, the SES parameters are computed by maximum a posteriori (MAP) estimation, and the pairwise alignment is obtained by applying the max-product algorithm. This letter deals with one-dimensional point processes; the extension to multidimensional point processes is considered in a companion letter in this issue. By analyzing surrogate data, we demonstrate that SES is able to quantify both timing precision and event reliability more robustly than classical measures can. As an illustration, neuronal spike data generated by the Morris-Lecar neuron model are considered.
... Most often, similarity of spike trains is assumed to be computed by embedding them in a vector space and computing the Euclidean distance between them [16], or, via a "spike time" metric. The latter is an "editlength distance" [17] given by the minimum "cost" to morph one spike train into another by inserting or deleting spikes, or shifting them in time [19]. ...
Preprint
We introduce the notion of a "walk with jumps", which we conceive as an evolving process in which a point moves in a space (for us, typically H2\mathbb{H}^2) over time, in a consistent direction and at a consistent speed except that it is interrupted by a finite set of "jumps" in a fixed direction and distance from the walk direction. Our motivation is biological; specifically, to use walks with jumps to encode the activity of a neuron over time (a ``spike train``). Because (in H2\mathbb{H}^2) the walk is built out of a sequence of transformations that do not commute, the walk's endpoint encodes aspects of the sequence of jump times beyond their total number, but does so incompletely. The main results of the paper use the tools of hyperbolic geometry to give positive and negative answers to the following question: to what extent does the endpoint of a walk with jumps faithfully encode the walk's sequence of jump times?
... This classic approach to aligning two sequences computes a table where each cell contains the edit distance between a prefix of the first sequence and a prefix of the second by reusing the solutions for shorter prefixes. This quadratic DP was introduced for speech signals Vintsyuk (1968) and genetic sequences (Needleman and Wunsch, 1970;Sankoff, 1972;Sellers, 1974;Wagner and Fischer, 1974). The quadratic O(nm) runtime for sequences of lengths n and m allowed for aligning of long sequences for the time but speeding it up has been a central goal in later works. ...
Article
Full-text available
Motivation Sequence alignment has been at the core of computational biology for half a century. Still, it is an open problem to design a practical algorithm for exact alignment of a pair of related sequences in linear-like time (Medvedev, 2023a). Methods We solve exact global pairwise alignment with respect to edit distance by using the A* shortest path algorithm. In order to efficiently align long sequences with high divergence, we extend the recently proposed seed heuristic (Ivanov et al., 2022) with match chaining, gap costs, and inexact matches. We additionally integrate the novel match pruning technique and diagonal transition (Ukkonen, 1985) to improve the A* search. We prove the correctness of our algorithm, implement it in the A*PA aligner, and justify our extensions intuitively and empirically. Results On random sequences of divergence d=4% and length n, the empirical runtime of A*PA scales near-linearly with length (best fit n1.06, n≤107bp). A similar scaling remains up to d=12% (best fit n1.24, n≤107bp). For n=107bp and d=4%, A*PA reaches >500× speedup compared to the leading exact aligners Edlib and BiWFA. The performance of A*PA is highly influenced by long gaps. On long (n>500kbp) ONT reads of a human sample it efficiently aligns sequences with d<10%, leading to 3× median speedup compared to Edlib and BiWFA. When the sequences come from different human samples, A*PA performs 1.7× faster than Edlib and BiWFA. Availability github.com/RagnarGrootKoerkamp/astar-pairwise-aligner. Supplementary information Supplementary data are available at Bioinformatics online.
... The edit distance computation is one of the cornerstone problems in computer science with wide applications. In the static setting, the complexity of edit distance problem is well understood: Under the Strong Exponential Time Hypothesis (SETH), there does not exist a truly subquadratic algorithm for computing edit distance [ABW15,BK15,AHWW16,BI18], whereas a textbook dynamic programming [Vin68,NW70,WF74,Sel74] solves the problem trivially in O(n 2 ) time for strings of lengths at most n. This has naturally fuels the quest of designing fast approximation algorithms that run in subquadratic, near-linear, or even sublinear time -an area witnessing tremendous growth [LV88, BEK + 03, BES06, AO09, AKO10, BEG + 21, CDG + 20, GKS19, GRS20, KS20b, BR20, AN20, KS20a, GKKS22, BCFN22a,BCFN22b]. ...
Preprint
The edit distance is a fundamental measure of sequence similarity, defined as the minimum number of character insertions, deletions, and substitutions needed to transform one string into the other. Given two strings of length at most n, simple dynamic programming computes their edit distance exactly in O(n2)O(n^2) time, which is also the best possible (up to subpolynomial factors) assuming the Strong Exponential Time Hypothesis (SETH). The last few decades have seen tremendous progress in edit distance approximation, where the runtime has been brought down to subquadratic, near-linear, and even sublinear at the cost of approximation. In this paper, we study the dynamic edit distance problem, where the strings change dynamically as the characters are substituted, inserted, or deleted over time. Each change may happen at any location of either of the two strings. The goal is to maintain the (exact or approximate) edit distance of such dynamic strings while minimizing the update time. The exact edit distance can be maintained in O~(n)\tilde{O}(n) time per update (Charalampopoulos, Kociumaka, Mozes; 2020), which is again tight assuming SETH. Unfortunately, even with the unprecedented progress in edit distance approximation in the static setting, strikingly little is known regarding dynamic edit distance approximation. Utilizing the off-the-shelf tools, it is possible to achieve an O(nc)O(n^{c})-approximation in n0.5c+o(1)n^{0.5-c+o(1)} update time for any constant c[0,16]c\in [0,\frac16]. Improving upon this trade-off remains open. The contribution of this work is a dynamic no(1)n^{o(1)}-approximation algorithm with amortized expected update time of no(1)n^{o(1)}. In other words, we bring the approximation-ratio and update-time product down to no(1)n^{o(1)}. Our solution utilizes an elegant framework of precision sampling tree for edit distance approximation (Andoni, Krauthgamer, Onak; 2010).
... 3 Most of sequence alignments methods are based on the shortest run time and the best score obtained by dynamic programming algorithms. Such as those originally applied by Needleman and Wunsch [1] and formalized by Sellers [2] and Waterman et al. [3]. They provide different approaches based on the calculation of the minimum distance between two sequences. ...
Preprint
Full-text available
Protein sequence alignment has many applications in molecular biology. The existing algorithms produce imperfect alignments, because of two factors: 1) the proposed gap penalty function and its parameter values 2) the gap problems. The present work is an attempt to obtain correct alignments using a novel alignment method based on the harmonic identical words present in the sequences to be aligned. To prove the accuracy of this approach, our method is applied to three pairs of sequences. The results are compared with the corresponding results obtained from three alignment programs; the EMBL-EBI EMBOSS Needle program with minimum number of gaps, the MSAEMBL-EBI MUSCLE and Clustal Omega programs. The comparison shows that, in case of highly similar protein sequences the identity and the maximum score value of our alignment method is higher than that of the EMBL-EBI EMBOSS Needle, MUSCLE and Clustal Omega programs. In case of distantly related sequences, the maximum score value of our alignment is much higher than that of the EMBL-EBI EMBOSS Needle, MUSCLE and Clustal Omega programs and the obtained percent identity is shifted from the twilight zone to the safe zone.Our results also show that the gap lengths distribution follows a power law, and this is not achieved by many current alignment programs.
... For a, b, c, ∈ Σ -, (ii ) γ a→b = γ b→a , and (iii ) γ a→c ≤ γ a→b + γ a→c . Sellers (1974) showed that scoring matrices in M C induce an optA γ -metric on sequences. The class of scoring matrices M A is such that, for each a, b, c ∈ Σ, ...
Preprint
Full-text available
Sequence comparison is a basic task to capture similarities and differences between two or more sequences of symbols, with countless applications such as in computational biology. An alignment is a way to compare sequences, where a giving scoring function determines the degree of similarity between them. Many scoring functions are obtained from scoring matrices. However,not all scoring matrices induce scoring functions which are distances, since the scoring function is not necessarily a metric. In this work we establish necessary and sufficient conditions for scoring matrices to induce each one of the properties of a metric in weighted edit distances. For a subset of scoring matrices that induce normalized edit distances, we also characterize each class of scoring matrices inducing normalized edit distances. Furthermore, we define an extended edit distance, which takes into account a set of editing operations that transforms one sequence into another regardless of the existence of a usual corresponding alignment to represent them, describing a criterion to find a sequence of edit operations whose weight is minimum. Similarly, we determine the class of scoring matrices that induces extended edit distances for each of the properties of a metric.
... The raw sequencing data were deposited in the NCBI SRA under bioproject ID PRJNA769063. The sequencing data were then checked for their quality and processed via the FROGS analysis pipeline developed by the GenoToul genomic platform in the Galaxy interface (Escudié et al., 2018) using Flash to merge the paired-end reads (Magoc and Salzberg, 2011), Swarm for sequence clustering based on the Sellers' evolutionary distance (Sellers, 1974;Mahé et al., 2014), and VSEARCH with the de novo UCHIME method to eliminate chimeras (Edgar et al., 2011;Rognes et al., 2016). From the initial 518,289 reads, the pre-processing and filtration steps led to 409,781 reads. ...
Article
Full-text available
To be effective, microbiological studies of deep aquifers must be free from surface microbial contaminants and from infrastructures allowing access to formation water (wellheads, well completions). Many microbiological studies are based on water samples obtained after rinsing a well without guaranteeing the absence of contaminants from the biofilm development in the pipes. The protocol described in this paper presents the adaptation, preparation, sterilization and deployment of a commercial downhole sampler (PDSshort, Leutert, Germany) for the microbiological studying of deep aquifers. The ATEX sampler (i.e., explosive atmospheres) can be deployed for geological gas storage (methane, hydrogen). To validate our procedure and confirm the need to use such a device, cell counting and bacterial taxonomic diversity based on high-throughput sequencing for different water samples taken at the wellhead or at depth using the downhole sampler were compared and discussed. The results show that even after extensive rinsing (7 bore volumes), the water collected at the wellhead was not free of microbial contaminants, as shown by beta-diversity analysis. The downhole sampler procedure was the only way to ensure the purity of the formation water samples from the microbiological point of view. In addition, the downhole sampler allowed the formation water and the autochthonous microbial community to be maintained at in situ pressure for laboratory analysis. The prevention of the contamination of the sample and the preservation of its representativeness are key to guaranteeing the best interpretations and understanding of the functioning of the deep biosphere.
... Here, an event, or a phrase, refers to a sequence of consecutive segments that share the same predicted label. This method is suitable for tasks requiring precise start (and Therefore, we propose a new way to evaluate predicted against observed phrases with dynamic programming used for sequence alignment (Eddy, 2004;Sellers, 1974). Our third measure is simply how well we predict the total number of phrases in an audio file (the encounter rate accuracy). ...
Article
Full-text available
When recorders are used to survey acoustically conspicuous species, identification calls of the target species in recordings is essential for estimating density and abundance. We investigate how well deep neural networks identify vocalisations consisting of phrases of varying lengths, each containing a variable number of syllables . We use recordings of Hainan gibbon Nomascus hainanus vocalisations to develop and test the methods. We propose two methods for exploiting the two‐level structure of such data. The first combines convolutional neural network (CNN) models with a hidden Markov model (HMM) and the second uses a convolutional recurrent neural network (CRNN). Both models learn acoustic features of syllables via a CNN and temporal correlations of syllables into phrases either via an HMM or recurrent network. We compare their performance to commonly used CNNs LeNet and VGGNet, and support vector machine (SVM). We also propose a dynamic programming method to evaluate how well phrases are predicted. This is useful for evaluating performance when vocalisations are labelled by phrases, not syllables. Our methods perform substantially better than the commonly used methods when applied to the gibbon acoustic recordings. The CRNN has an F ‐score of 90% on phrase prediction, which is 18% higher than the best of the SVM or LeNet and VGGNet methods. HMM post‐processing raised the F ‐score of these last three methods to as much as 87%. The number of phrases is overestimated by CNNs and SVM, leading to error rates between 49% and 54%. With HMM, these error rates can be reduced to 0.4% at the lowest. Similarly, the error rate of CRNN's prediction is no more than 0.5%. CRNNs are better at identifying phrases of varying lengths composed of a varying number of syllables than simpler CNN or SVM models. We find a CRNN model to be best at this task, with a CNN combined with an HMM performing almost as well. We recommend that these kinds of models are used for species whose vocalisations are structured into phrases of varying lengths.
... (Miller et al. 2009)). It was shown by Sellers (1974) that edit distance and optimal global alignment are equivalent problems. Bartz et al. (2008) designed a logistic regression to calculate the probability of crash reports being duplicate. ...
Article
Full-text available
Software systems can automatically submit crash reports to a repository for investigation when program failures occur. A significant portion of these crash reports are duplicate, i.e., they are caused by the same software issue. Therefore, if the volume of submitted reports is very large, automatic grouping of duplicate crash reports can significantly ease and speed up analysis of software failures. This task is known as crash report deduplication. Given a huge volume of incoming reports, increasing quality of deduplication is an important task. The majority of studies address it via information retrieval or sequence matching methods based on the similarity of stack traces from two crash reports. While information retrieval methods disregard the position of a frame in a stack trace, the existing works based on sequence matching algorithms do not fully consider subroutine global frequency and unmatched frames. Besides, due to data distribution differences among software projects, parameters that are learned using machine learning algorithms are necessary to provide more flexibility to the methods. In this paper, we propose TraceSim – an approach for crash report deduplication which combines TF-IDF, optimum global alignment, and machine learning (ML) in a novel way. Moreover, we propose a new evaluation methodology for this task that is more comprehensive and robust than previously used evaluation approaches. TraceSim significantly outperforms seven baselines and state-of-the-art methods in the majority of the scenarios. It is the only approach that achieves competitive results on all datasets regarding all considered metrics. Moreover, we conduct an extensive ablation study that demonstrates the importance of each TraceSim’s element to its final performance and robustness. Finally, we provide the source code for all considered methods and evaluation methodology as well as the created datasets.
... If a given criterion v depends on a scoring matrix γ and it is a metric on Σ * , we say that the scoring matrix γ induces a v-distance on Σ * . Sellers [22] showed that matrices in M C induce an optA γ -distance on Σ * and Araujo and Soares [5] showed that γ ∈ M A if and only if γ induces an optA γ -distance on Σ * . Furthermore, γ ∈ M N if and only if γ induces an optN γ -distance on Σ * . ...
Conference Paper
Full-text available
Sequence alignment supports numerous tasks in bioinformatics, natural language processing, pattern recognition, social sciences, and other fields. While the alignment of two sequences may be performed swiftly in many applications, the simultaneous alignment of multiple sequences proved to be naturally more intricate. Although most multiple sequence alignment (MSA) formulations are NP-hard, several approaches have been developed, as they can outperform pairwise alignment methods or are necessary for some applications. Taking into account not only similarities but also the lengths of the compared sequences (i.e. normalization) can provide better alignment results than both unnormalized or post-normalized approaches. While some normalized methods have been developed for pairwise sequence alignment, none have been proposed for MSA. This work is a first effort towards the development of normalized methods for MSA. We discuss multiple aspects of normalized multiple sequence alignment (NMSA). We define three new criteria for computing normalized scores when aligning multiple sequences, showing the NP-hardness and exact algorithms for solving the NMSA using those criteria. In addition, we provide approximation algorithms for MSA and NMSA for some classes of scoring matrices. Acknowledgements We thank the three anonymous reviewers for their valuable comments and suggestions on our manuscript.
... If a given criterion v, depending on a scoring matrix γ, is a metric on Σ * , we say that the scoring matrix γ induces a v-distance. Sellers [Sel74] showed that matrices in M C induce an optA γ -metric on Σ * and Araujo and Soares [AS06] showed that γ ∈ M A if and only if γ induces an optA γ -metric on Σ * . Moreover, γ ∈ M N if and only if γ induces an optN γ -metric on Σ * . ...
Preprint
Full-text available
Sequence alignment supports numerous tasks in bioinformatics, natural language processing, pattern recognition, social sciences, and others fields. While the alignment of two sequences may be performed swiftly in many applications, the simultaneous alignment of multiple sequences proved to be naturally more intricate. Although most multiple sequence alignment (MSA) formulations are NP-hard, several approaches have been developed, as they can outperform pairwise alignment methods or are necessary for some applications. Taking into account not only similarities but also the lengths of the compared sequences (i.e. normalization) can provide better alignment results than both unnormalized or post-normalized approaches. While some normalized methods have been developed for pairwise sequence alignment, none have been proposed for MSA. This work is a first effort towards the development of normalized methods for MSA. We discuss multiple aspects of normalized multiple sequence alignment (NMSA). We define three new criteria for computing normalized scores when aligning multiple sequences, showing the NP-hardness and exact algorithms for solving the NMSA using those criteria. In addition, we provide approximation algorithms for MSA and NMSA for some classes of scoring matrices.
... In this thesis, we propose a method based on a guided process model simulation to generate model behaviors similar to the behaviors recorded in the event log. Since the proposed method is based on a simulation and we use the Edit distance metric [12], it is notationfree and does not require computing alignments. By using both simulated and recorded behaviors, we are able to compute alignment approximations. ...
Thesis
Full-text available
Many organizations are relying on information systems that support the execution of their business processes. Process mining exploits information from these systems to obtain valuable insights. In the last few years, conformance checking has been gaining more importance within the process mining branches. Conformance checking measures the degree to which business process execution is in accordance with a process model. Alignments have been widely used due to its deterministic nature, which ensures exact conformance statistics. One of the major drawbacks of alignments is that they are computationally expensive. In addition, with the recent growth in data volume, alignment computation is sometimes practically infeasible. In many applications, knowing the exact cost of the alignment is unnecessary, and an approximation is sufficient. In this thesis, we present a method for computing the cost of approximate alignments through a guided process model simulation. The proposed method is independent of the process model notation, and it also provides upper and lower bounds to ensure the accuracy of the approximation value. Extensive experiments were performed using real-life event logs obtaining significant execution time improvements compared to state-of-the-art alignment techniques. Moreover, the results show in some experiments that the approximation error is negligible.
... The algorithm depended mainly on the heuristic homology algorithm of Needleman & Wunsch [9]. The calculation matrix was then developed by Sellers [12] to include true measures of the actual distance which was more generalized by Waterman [13]. The latter included insertions and deletions of arbitrary length which eluded the minimum numbers of mutational events needed for conversion between two sequences. ...
... Sellers only allowed insertions/deletions of length one, but the generalization to insertions/deletions of length k was later made by Waterman et al. [6]. Deletions of length k are assigned weight xk 2 0. The distance measure between a and b is then The next theorem was given in [6] and generalizes the work of Sellers [2]. The proof follows the general lines of Theorem 1, but the inclusion of longer insertions/deletions is more difficult. ...
Article
Homology and distance measures have been routinely used to compare two biological sequences, such as proteins or nucleic acids. The homology measure of Needleman and Wunsch is shown, under general conditions, to be equivalent to the distance measure of Sellers. A new algorithm is given to find similar pairs of segments, one segment from each sequence. The new algorithm, based on homology measures, is compared to an earlier one due to Sellers.
... In order to obtain comparable results, we evaluate both NWA and EDR applying the above equations. Sellers showed that approaches formulated in terms of maximizing similarity (NWA) and minimizing edit distance (EDR) are equivalent (Sellers, 1974). For that reason, we assume that they can be evaluated in the same way. ...
Article
Full-text available
We evaluate whether the Needleman-Wunsch algorithm is suitable for user trajectory comparison. The problem that we aim to solve is pair-wise user trajectory comparison. Similar user trajectories are then clustered with respect to their similarity, where clusters emerge in a non-supervised way. We assume that user position, provided by GPS (Global positioning system), is normally distributed around user actual position. This assumption allows us to derive a model for setting score for match, penalty for gap and penalty for mismatch, which are an input to the Needleman-Wunsch algorithm. Our model implies that, in scenarios where actual user position is unknown and must be thus estimated from measured positions, the Needleman-Wunsch algorithm may be prevented from applying mismatches. In an experimental evaluation, we apply two data sets that contain recorded user positions and we show that our approach based on the Needleman-Wunsch algorithm is capable of correct classification of user trajectories into groups. Unlike in existing literature, we show that in GPS based user trajectory comparison, it is indeed not necessary to consider mismatches when applying the Needleman-Wunsch algorithms. This leads to a simplified string editing problem known as Longest Common Subsequence (LCS). We compare our approach with Edit Distance on Real sequence (EDR) in order to provide an insight into the performance of our approach. Applying the Needleman-Wunsch algorithm has helped to solve several problems that emerge in GPS based user trajectory comparison such as interrupted GPS service due to satellite occlusion and various signal propagation phenomena such as signal reflection, fading etc. In order to improve the efficiency of the Needleman-Wunsch algorithm, we apply Move ability to identify when such detrimental conditions could occur. We also apply linear approximation in order to enhance user GPS trajectories with missing points, what further improves the efficiency of user trajectory comparison.
... Victor and Purpura [33] proposed a single-unit algorithm to compute the spike train distance between two spike trains based on the dynamical programming algorithm introduced by Sellers [34]. The algorithm is briefly reviewed here. ...
Article
Tactile sensing with spiking neural networks (SNNs) has attracted increasing attention in the past decades. In this article, a novel SNN framework is proposed for the tactile surface roughness categorization task. In contrast to supervised SNN methods such as ReSuMe and Tempotron that require prespecifying target spike trains, the presented method performs the classification through directly comparing the distance between multineuron spike trains. Unlike simple spike train fusion methods using average pairwise spike train distance or pooled spike train distance, the proposed method merges spike trains from different neurons with the multineuron spike train distance, which can capture the complex correlation of multiple spike trains. Specifically, the spike trains are generated via the Izhikevich neurons from tactile signals. The similarity of the multineuron spike trains is computed using the multineuron Victor-Purpura spike train distance, which can be efficiently implemented in an inductive manner. The classification can be performed by incorporating k-nearest neighbors and the multineuron spike train distance as a similarity metric. The proposed framework is quite general, i.e., other multineuron spike train distances and spike train kernel-based methods can be readily incorporated. The effectiveness of the proposed method has been demonstrated on a tactile data set by comparing it with various feature- and spike-based methods.
... To achieve the most efficient parallelization of the algorithm, using as many parallel operations as possible (i.e., 16 simultaneous pairwise comparisons per CPU core), the magnitude of the score values of the highest scoring alignments to be calculated should be small, preferably fitting in one byte of memory (a byte can store one integer value ranging from 0 to 255). Swarm calculates the alignment score as follows: instead of computing the optimal global alignment similarity score as in the Needleman-Wunsch algorithm, Swarm identifies the alignments with the minimum edit distance, as described by Sellers (1974), by transforming the given similarity scoring system into an equivalent edit distance system. In the default scoring system a match is given a score of +5, a mismatch −4, gap opening −12, and gap extension −4. ...
Preprint
Full-text available
Popular de novo amplicon clustering methods suffer from two fundamental flaws: arbitrary global clustering thresholds, and input-order dependency induced by centroid selection. Swarm was developed to address these issues by first clustering nearly identical amplicons iteratively using a local threshold, and then by using clusters' internal structure and amplicon abundances to refine its results. This fast, scalable, and input-order independent approach reduces the influence of clustering parameters and produces robust operational taxonomic units, improving the amount of meaningful biological information that can be extracted from amplicon-based studies.
... We are going to introduce the Sellers' algorithm [298]. Let Σ be a nite set of symbols, and let Σ * denote the set of nite long sequences over Σ. ...
... So generally there are many possible combinations and for large sequences and arbitrary gap sizes we have a combinatorial explosion. Fortunately, to cope with this problem, a very simple and effective computer algorithm has been developed called Dynamic Programming (DP) (Needleman and Wunsch, 1970;Sellers, 1974;Smith and Waterman, 1981b). DP is used (in one form or another) in many methods that align sequences (and even structures (Taylor and Orengo, 1989;Orengo and Tay lor, 1996)). ...
Thesis
The prediction of protein structures from their amino acid sequence alone is a very challenging problem. Using the variety of methods available, it is often possible to achieve good models or at least to gain some more information, to aid scientists in their research. This thesis uses many of the widely available methods for the prediction and modelling of protein structures and proposes some new ideas for aiding the process. A new method for measuring the buriedness (or exposure) of residues is discussed which may lead to a potential way of assessing proteins' individual amino acid placement and whether they have a standard profile. This may become useful in assessing predicted models. Threading analysis and modelling of structures for the Critical Assessment of Techniques for Protein Structure Prediction (CASP2) highlights inaccuracies in the current state of protein prediction, particularly with the alignment predictions of sequence on structure. An in depth analysis of the placement of gaps within a multiple sequence threading method is discussed, with ideas for the improvement of threading predictions by the construction of an improved gap penalty. A threading based homology model was constructed with an RMSD of 6.2A, showing how combinations of methods can give usable results. Using a distance geometry method, DRAGON, the ab initio prediction of a protein (NK Lysin) for the CASP2 assessment was achieved with an accuracy of 4.6Å. This highlighted several ideas in disulphide prediction and a novel method for predicting which cysteine residues might form disulphide bonds in proteins. Using a combination of all the methods, with some like threading and homology modelling proving inadequate, an ab initio model of the N-terminal domain of a GPCR was built based on secondary structure and predictions of disulphide bonds. Use of multiple sequences in comparing sequences to structures in threading should give enough information to enable the improvements required before threading can be-come a major way of building homology models. Furthermore, with the ability to predict disulphide bonds: restraints can be placed when building models, ab initio or otherwise.
... In an influential 1983 book about this topic, Sankoff and Kruskal [7] present the following list of independent discoveries of dynamic programming algorithms for sequence comparison ranging from 1968 to 1975 ([8]- [16]). We add to this list Sellers' 1974 paper [17]. Additionally, in [18] it was shown that edit distance and alignment similarity are dual; that is, given the weights for minimum distance alignment, there are weights for a corresponding similarity alignment and they have identical optimal alignments. ...
Article
Full-text available
Levenshtein edit distance has played a central role—both past and present—in sequence alignment in particular and biological database similarity search in general. We start our review with a history of dynamic programming algorithms for computing Levenshtein distance and sequence alignments. Following, we describe how those algorithms led to heuristics employed in the most widely used software in bioinformatics, BLAST, a program to search DNA and protein databases for evolutionarily relevant similarities. More recently, the advent of modern genomic sequencing and the volume of data it generates has resulted in a return to the problem of local alignment. We conclude with how the mathematical formulation of Levenshtein distance as a metric made possible additional optimizations to similarity search in biological contexts. These modern optimizations are built around the low metric entropy and fractional dimensionality of biological databases, enabling orders of magnitude acceleration of biological similarity search.
... In its original implementation (Needleman and Wunsch, 1970) this gap penalty was defined as a fixed value for the inclusion of a gap of any length. This idea was soon replaced by the concept of length-proportional gap costs (Sellers, 1974) where a gap of length X would have the penalty P(x) = Gx, where G is the cost of a gap of length 1. ...
Thesis
Sequence database searching is a key tool in current bioinformatics. To improve accuracy, sequence database searches are often performed iteratively: taking the results of one search as input for the next. The object of this approach being to progressively isolate increasingly distant relations of the original query sequence. In practice this method works well when it is supervised by an 'expert eye' which can determine when an alignment is good and when sequences should be excluded from it, but attempts to automate this process have proven difficult. At present PSI-BLAST is one of the few effective attempts, but a misalignment of sequences or the wrongful inclusion of a sequence will still rapidly destroy the specificity of the probe, making incorrect matches more likely. By combining the search program Quest, which is capable of searching a database using full length multiple sequence alignments, with independent sequence alignment and assessment programs, we have been able to reduce the occurrence of this problem. We use a multiple alignment package to generate an accurate alignment of all hits generated by the Quest program. Sequences that do not appear to 'fit' with the rest of the alignment are automatically removed by the separate alignment assessment program Mulfil. The resulting alignment is fed back to Quest for the next iteration. This scheme has shown to generate results significantly better than those of PSI-BLAST. Whilst the total number of correct homologues identified was not increased, the number of incorrect ones dropped significantly. In addition, further work demonstrated that equally good quality results are possible without the use of multiple alignment or profile searching. The Cascade-and-Cluster scheme uses intermediate sequences and a simple clustering procedure and is able to produce a result almost equally sensitive and selective as our previous scheme, whilst running upto ten-fold faster.
Chapter
In search by pattern in GPS trajectories, user draws a trajectory, the pattern query, and then receives a set of trajectories ranked by their similarity to the pattern query. We argue that when user draws a pattern query, an initial part of this query (prefix of chosen length) should have more weight than the rest of query. We assume that after receiving a set of similar trajectories, user can refine the pattern query in order to receive more relevant results. We give explanation of our approach by means of web search, where a user searches, for example, for “bratislava castle” and then adds a refinement to this query “opening hours”, where removing the initial part of query does not make sense, as search for “opening hour” alone would return irrelevant results. This idea has led us to considering pattern search that is weighted toward query prefix. We experimentally evaluate this approach, in our experimentation we apply the Geolife data set (Microsoft Research Asia).KeywordsGPS trajectoryGeolife data setpattern searchNeedleman-Wunsch algorithmSmith-Waterman algorithmGeohashHausdorff distance
Preprint
Full-text available
Motivation Sequence alignment has been a core problem in computational biology for the last half-century. It is an open problem whether exact pairwise alignment is possible in linear time for related sequences (Medvedev, 2022b). Methods We solve exact global pairwise alignment with respect to edit distance by using the A* shortest path algorithm on the edit graph. In order to efficiently align long sequences with high error rate, we extend the seed heuristic for A* (Ivanov et al ., 2022) with match chaining, inexact matches , and the novel match pruning optimization. We prove the correctness of our algorithm and provide an efficient implementation in A*PA. Results We evaluate A*PA on synthetic data (random sequences of length n with uniform mutations with error rate e ) and on real long ONT reads of human data. On the synthetic data with e =5% and n ≤10 ⁷ bp, A*PA exhibits a near-linear empirical runtime scaling of n 1.08 and achieves > 250 × speedup compared to the leading exact aligners EDLIB and BIWFA. Even for a high error rate of e =15%, the empirical scaling is n 1.28 for n ≤10 ⁷ bp. On two real datasets, A*PA is the fastest aligner for 58% of the alignments when the reads contain only sequencing errors, and for 17% of the alignments when the reads also include biological variation. Availability github.com/RagnarGrootKoerkamp/astar-pairwise-aligner Contact ragnar.grootkoerkamp@inf.ethz.ch , pesho@inf.ethz.ch
Chapter
Modern techniques for drug discovery relay on multidisciplinary approaches that combine the advances in artificial intelligence, combinatorial methods, mathematical techniques, computational quantum and molecular dynamics techniques, molecular docking, hybrid techniques, and target based therapeutic techniques in order to assist medicinal chemists and synthetic chemists in their efforts to discover new drugs. In the current overview, we have outlined primarily in silico and mathematical techniques for computer assisted drug discovery. In particular we focus on combinatorial cum topological methods, quantum chemical, docking and molecular dynamics methods, nanomaterials, and artificial intelligence methods shape perception, radiomics, proteomics and genomics for aiding drug discovery. We consider specific cases of ovarian cancer, toxicological studies, hepatitis type(c) viral infections, and neurodegenerative diseases. We also consider holistic approaches that include bioactives and natural products in drug discovery. We provide an overview of mathematical modelling methods and artificial intelligence tools to facilitate detection, progression, administration and in design of tailor-made target therapies based on AI. Furthermore, the use of topological indices for structure-activity relations, combinatorial and graph and group theoretical tools including for phylogenetic trees is also emphasized. The current overview encompasses a multidisciplinary and multidirectional content including computer-assisted artificial intelligence techniques for target based drug discovery and delivery.
Article
Full-text available
Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has caused the global pandemic, coronavirus disease-2019 (COVID-19) which has resulted in 60.4 million infections and 1.42 million deaths worldwide. Mathematical models as an integral part of artificial intelligence are designed for contact tracing, genetic network analysis for uncovering the biological evolution of the virus, understanding the underlying mechanisms of the observed disease dynamics, evaluating mitigation strategies, and predicting the COVID-19 pandemic dynamics. This paper describes mathematical techniques to exploit and understand the progression of the pandemic through a topological characterization of underlying graphs. We have obtained several topological indices for various graphs of biological interest such as pandemic trees, Cayley trees, Christmas trees, and the corona product of Christmas trees and paths. We have also obtained an analytical expression for the thermodynamic entropies of pandemic trees as a function of R⁰, the reproduction number, and the level of spread, using the nested wreath product groups. Our plots of entropy and logarithms of topological indices of pandemic trees accentuate the underlying severity of COVID-19 over the 1918 Spanish flu pandemic.
Article
Ovarian cancer is one of the leading gynecologic diseases with a high mortality rate worldwide. Current statistical studies on cancer reveals that over the past two decades the fifth most common cause of death related to cancer in females of the western world is this ovarian cancer. In spite of significant strides made in genomics, proteomics and radiomics, there has been little progress in transitioning these research advances into effective clinical administration of ovarian cancer. Consequently, researchers have diverted their attention to find various molecular processes involved in the development of this cancer and how these processes can be exploited to develop potential chemotherapeutics to treat this cancer. The present review gives an overview of these studies which may update the researchers where we stand and where to go further. Un-fortunate situation with ovarian cancer that still exists is that most patients with it do not show any symptoms until the disease has moved to an advanced stage. Undoubtedly, several targets-based drugs have been developed to treat it, but drug-resistance and the recurrence of this disease is still a problem. For the development of potential chemotherapeutics for ovarian cancer, however, some theoretical approaches have also been applied. A description of such methods and their success in this direction is also covered in this review.
Article
Full-text available
Given two finite sequences, we wish to find the longest common subsequences satisfying certain deletion/insertion constraints. Consider two successive terms in the desired subsequence. The distance between their positions must be the same in the two original sequences for all but a limited number of such pairs of successive terms. Needleman and Wunsch gave an algorithm for finding longest common subsequences without constraints. This is improved from the viewpoint of computational economy. An economical algorithm is then elaborated for finding subsequences satisfying deletion/insertion constraints. This result is useful in the study of genetic homology based on nucleotide or amino-acid sequences.
Article
This chapter discusses a few combinatorial problems studied experimentally on computing machines. By sampling experiments with examples on computing machines, hints and suggestions can be obtained about more general solutions. Combinatorial in nature, these problems range from questions of a number theoretical character to problems suggested by situations encountered in certain schemata in biology. The chapter presents two classes of problems. In the first group, the main questions concern the behavior of sequences of symbols, coding physical or biological properties. The second group of problems concerns the behavior of sequences of points distributed on an infinite line or on a plane. The chapter describes the properties of randomly distributed sequences of points in a Euclidean plane and presents a few problems for the 3-space.
Article
A method based on mutation distances as estimated from cytochrome c sequences is of general applicability.
Article
A computer adaptable method for finding similarities in the amino acid sequences of two proteins has been developed. From these findings it is possible to determine whether significant homology exists between the proteins. This information is used to trace their possible evolutionary development.The maximum match is a number dependent upon the similarity of the sequences. One of its definitions is the largest number of amino acids of one protein that can be matched with those of a second protein allowing for all possible interruptions in either of the sequences. While the interruptions give rise to a very large number of comparisons, the method efficiently excludes from consideration those comparisons that cannot contribute to the maximum match.Comparisons are made from the smallest unit of significance, a pair of amino acids, one from each protein. All possible pairs are represented by a two-dimensional array, and all possible comparisons are represented by pathways through the array. For this maximum match only certain of the possible pathways must be evaluated. A numerical value, one in this case, is assigned to every cell in the array representing like amino acids. The maximum match is the largest number that would result from summing the cell values of every pathway.