Conference Paper

A survey of longest common subsequence algorithms

Dept. of Comput. Sci., Turku Univ.
DOI: 10.1109/SPIRE.2000.878178 Conference: String Processing and Information Retrieval, 2000. SPIRE 2000. Proceedings. Seventh International Symposium on
Source: IEEE Xplore


The aim of this paper is to give a comprehensive comparison of well-known longest common subsequence algorithms (for two input strings) and study their behaviour in various application environments. The performance of the methods depends heavily on the properties of the problem instance as well as the supporting data structures used in the implementation. We want to make also a clear distinction between methods that determine the actual lcs and those calculating only its length, since the execution time and more importantly, the space demand depends crucially on the type of the task. To our knowledge, this is the first time this kind of survey has been done. Due to the page limits, the paper gives only a coarse overview of the performance of the algorithms; more detailed studies are reported elsewhere

  • Source
    • "In these applications, the similarity between two time series (or trajectories) is expressed as the number of matched elements between two sequences. Both the size of the longest common subsequence and the subsequence itself can be obtained using a dynamic programming algorithm in ) ( 2 n O time (Bergroth et al., 2000). "
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents a trajectory clustering method to discover spatial and temporal travel patterns in a traffic network. The study focuses on identifying spatially distinct traffic flow groups using trajectory clustering and investigating temporal traffic patterns of each spatial group. The main contribution of this paper is the development of a systematic framework for clustering and classifying vehicle trajectory data, which does not require a pre-processing step known as map-matching and directly applies to trajectory data without requiring the information on the underlying road network. The framework consists of four steps: similarity measurement, trajectory clustering, generation of cluster representative subsequences, and trajectory classification. First, we propose the use of the Longest Common Subsequence (LCS) between two vehicle trajectories as their similarity measure, assuming that the extent to which vehicles’ routes overlap indicates the level of closeness and relatedness as well as potential interactions between these vehicles. We then extend a density-based clustering algorithm, DBSCAN, to incorporate the LCS-based distance in our trajectory clustering problem. The output of the proposed clustering approach is a few spatially distinct traffic stream clusters, which together provide an informative and succinct representation of major network traffic streams. Next, we introduce the notion of cluster representative subsequence (CRS), which reflects dense road segments shared by trajectories belonging to a given traffic stream cluster, and present the procedure of generating a set of CRSs by merging the pairwise LCSs via hierarchical agglomerative clustering. The CRSs are then used in the trajectory classification step to measure the similarity between a new trajectory and a cluster. The proposed framework is demonstrated using actual vehicle trajectory data collected from New York City, USA. A simple experiment was performed to illustrate the use of the proposed spatial traffic stream clustering in application areas such as network-level traffic flow pattern analysis and travel time reliability analysis.
    Transportation Research Part C Emerging Technologies 12/2015; 9:164-184. DOI:10.1016/j.trpro.2015.07.010 · 2.82 Impact Factor
    • "The most intuitive way of measuring the differences among these sequences is to calculate the number of (mis)matched symbols in terms of symbol positions between two sequences. For example, the longest common subsequence [10] measures the similarity between sequences by finding the common subsequence, and the edit distance [11] evaluates the difference between sequences by counting the required operations to match two sequences. On the other hand, those distribution-based approaches, such as Kullback-Leibler Divergence (DKL) [12], Bhattacharyya distance[13] "
    [Show abstract] [Hide abstract]
    ABSTRACT: Process Monitoring involves tracking a system's behaviors, evaluating the current state of the system, and discovering interesting events that require immediate actions. In this paper, we propose a process monitoring approach that helps detect the changes of dynamic systems, monitor the divergence of the system development, and evaluate the significance of the deviation. We begin with the discussion of the data reduction and symbolic data representation. Timeseries representation methods are also discussed and used as examples in the proposed approach to discretize the raw data into sequences of system states. Markov Chains and stationary state distributions are continuously generated for sequences to represent the snapshots of the system dynamics in different time frames. We use the Generalized Jensen-Shannon Divergence as a measure to monitor the changes of the stationary symbol probability distributions and evaluate the significance of the system deviation. We prove that the proposed approach is able to detect the deviation of the systems we monitor and assess the deviation significance in probabilistic manner
    Knowledge and Information Systems 06/2015; DOI:10.1007/s10115-015-0858-z · 1.78 Impact Factor
  • Source
    • "Longest common subsequence distance (see, for example, [2]): Insertions and deletions are allowed. (4) Levenshtein distance [25]: Substitutions, insertions, and deletions are allowed. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Recently, the amount of string data generated has increased dramatically. Consequently, statistical methods of analysing string data are required in many fields. However, few studies have been conducted of statistical methods for string data based on probability theory. In this study, by developing a theory of parametric statistical inference for string data on the basis of a probability theory on a metric space of strings developed in Koyano[2010], we address the problem of clustering string data in an unsupervised manner. First, we introduce a Laplace-like distribution on a metric space of strings and show its basic properties. We then construct maximum likelihood estimators of location and dispersion parameters of the introduced distribution and examine their asymptotic behavior by applying limit theorems demonstrated in Koyano [2014]. After that, we derive an EM algorithm for the mixture model of the distributions and investigate its accuracy in the framework of statistical asymptotic theory.
Show more