Conference Paper

A survey of longest common subsequence algorithms

Dept. of Comput. Sci., Turku Univ.
DOI: 10.1109/SPIRE.2000.878178 Conference: String Processing and Information Retrieval, 2000. SPIRE 2000. Proceedings. Seventh International Symposium on
Source: IEEE Xplore

ABSTRACT The aim of this paper is to give a comprehensive comparison of well-known longest common subsequence algorithms (for two input strings) and study their behaviour in various application environments. The performance of the methods depends heavily on the properties of the problem instance as well as the supporting data structures used in the implementation. We want to make also a clear distinction between methods that determine the actual lcs and those calculating only its length, since the execution time and more importantly, the space demand depends crucially on the type of the task. To our knowledge, this is the first time this kind of survey has been done. Due to the page limits, the paper gives only a coarse overview of the performance of the algorithms; more detailed studies are reported elsewhere

1 Bookmark
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Recently, the amount of string data generated has increased dramatically. Consequently, statistical methods of analysing string data are required in many fields. However, few studies have been conducted of statistical methods for string data based on probability theory. In this study, by developing a theory of parametric statistical inference for string data on the basis of a probability theory on a metric space of strings developed in Koyano[2010], we address the problem of clustering string data in an unsupervised manner. First, we introduce a Laplace-like distribution on a metric space of strings and show its basic properties. We then construct maximum likelihood estimators of location and dispersion parameters of the introduced distribution and examine their asymptotic behavior by applying limit theorems demonstrated in Koyano [2014]. After that, we derive an EM algorithm for the mixture model of the distributions and investigate its accuracy in the framework of statistical asymptotic theory.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We study the longest common subsequence of two given strings and the dynamic time warping distance of time-series. Both are classic similarity measures of sequences that have a wealth of applications and can be computed in time $O(n^2)$. We prove that both measures do not have strongly subquadratic time algorithms, i.e., no algorithms with running time $O(n^{2-\varepsilon})$ for any $\varepsilon > 0$, unless the Strong Exponential Time Hypothesis fails. This adds two important problems to a recent line of research showing conditional lower bounds for a growing number of quadratic time problems. As our main technical contribution, we introduce a framework for proving quadratic lower bounds for similarity measures. To apply the framework it suffices to construct a single gadget, which encapsulates all the expressive power necessary to emulate a reduction from satisfiability that is similar to a recent reduction for the edit distance. We prove a quadratic lower bound for any similarity measure admitting such a gadget, and then design such gadgets for the problems under consideration.
  • [Show abstract] [Hide abstract]
    ABSTRACT: Given a set S={S 1,S 2,…,S l } of l strings, a text T, and a natural number k, find a string M, which is a concatenation of k strings (not necessarily distinct, i.e., a string in S may occur more than once in M) from S, whose longest common subsequence with T is largest, where a string in S may occur more than once in M. Such a string is called a k-inlay. The resequencing longest common subsequence problem (resequencing LCS problem for short) is to find a k-inlay for each query with parameter k after T and S are given. In this paper, we propose an algorithm for solving this problem which takes O(nml) preprocessing time and O(ϑ k k) query time for each query with parameter k, where n is the length of T, m is the maximal length of strings in S, and ϑ k is the length of the longest common subsequence between a k-inlay and T.
    Algorithmica 01/2013; · 0.57 Impact Factor


Available from