Page 1
A Survey of Longest Common Subsequence Algorithms
L. Bergroth
bergroth @ cs.utu .fi
H. Hakonen
hat @cs.utu.fi
Department of Computer Science
University of Turku
20520 Turku
Finland
Abstract
The aim of this paper is to give a comprehensive com-
parison o f well-known longest common subsequence algo-
rithms (for two input strings) and study their behaviour in
various application environments. The pegormance of the
methods depends heavily on the properties of the problem
instance as well as the supporting data structures used in
the implementation. We want to make also a clear dis-
tinction between methods that determine the actual ICs and
those calculating only its length, since the execution time
and more importantly, the space demand depends crucially
on the type of the task. To our knowledge, this is theJirst
time this kind of survey has been done. Due to the page lim-
its, we can give here only a coarse overview of the pegor-
mance of the algorithms; more detailed studies are reported
elsewhere [ I I].
1. Introduction
String comparison is a central operation in various envi-
ronments: a spelling error correction program tries to find
the dictionary entry which resembles most a given word, in
molecular biology [2, 341 we want to compare two DNA
or protein sequences to learn how homologous they are, in
a file archive we want to store several versions of a source
program compactly by storing only the original version as
such; all other versions are constructed from the differ-
ences to the previous one, etc. An obvious measure for the
closeness of two strings is to find the maximum number of
identical symbols in them (preserving the symbol order).
This, by definition, is the longest common subsequence of
the strings. Formally, we are comparing two input strings,
X[l..m] and Y[l..n], which are elements of the set C*; here
C demtes the input alphabet containing (T symbols. A sub-
sequence s[]..~] of X[l..m] is obtained by deleting m - s
T. Raita
rai ta @ cs.utu. fi
symbols from X . A common subsequence (cs) of X [ l..m]
and Y[l .A] - designated cs(X,Y) - is a subsequence which
occurs in both strings (w.l.0.g we will assume that m 5 n
in what follows). The longest common subsequence (ICs) of
strings X and Y, lcs(X, Y), is a common subsequence of max-
imal length. The length of lcs(X, Y) is denoted by r(X, Y), or
when the input strings are known, by r. An introduction
to the most important Ics methods can be found in books
on text algorithms (see e.g. [6, 14, 371) and on molecular
biology (see e.g. [ 17, 32,361).
The lcs problem is a special case of the edit distance
problem. The distance between X and Y, Dist(X,Y), is
defined as the minimal number of elementury operations
needed to transform the source string X to the target string
Y. In practical applications, the operations are restricted to
insertions, deletions and substitutions subjected to single in-
put symbols. For each operation, an application dependent
cost is assigned, whence the distance is defined as the min-
imum sum of transformation costs. Normalizing the costs
of insertion and deletion to 1 and excluding the substitution
operation (which can be accomplished by defining its cost
to be 2 2), we have in fact the lcs problem: in order to trans-
form X to Y , we first delete m - r symbols from the source
to obtain lcs(X, Y). To this string we insert n - r symbols to
form the target. Therefore,
Dist(X,Y) =n+m-2.r(X,Y)
For a more detailed discussion of the edit distance, its prop-
erties and applications, see [ 17, 361.
Another closely related problem is the shortest common
supersequence problem (scs), in which we want to find the
shortest string containing X and Y as subsequences. In this
case,
r(X,Y) =m+n-r(scs(X,Y))
where r(scs(X, Y ) ) denotes the length of the shortest com-
mon supersequence. After constructing the supersequence,
0-7695-0746-8/00
$10.00 0 2000 IEEE
39
Page 2
we can extract the Ics in a linear scan through the scs. How-
ever, if the input consists of more than two strings (N-lcsm-
scs), there is no obvious symmetry between the problems
anymore [ 151.
The Ics problem can be reduced to two other well-known
problems also. Lcs(X,Y) is typically solved with the dy-
namic programming technique and filling an m x n table.
The table elements can be regarded as vertices in a graph
and the simple dependencies between the table values de-
fine the edges. The task is to find the longest path between
the vertices in the upper left and lower right corner of the ta-
ble. Another reduction can be done to the longest increasing
subsequence problem (lis). In the lis problem, we are given
a sequence Z of z integers, and our task is to find longest
subsequence of Z which is strictly increasing. The input
to the lis problem is obtained from the Ics input as follows:
takeX[l] and find all positions j in Y, for whichX[I] = Y[j].
Sort these indices into decreasing order. Build a sorted list
forX[2] in a similar way and append it to the one we got for
X[l]. Repeat this for all symbols in X. As a result, we have
a sequence of integers having decreasing runs ('saw-tooth'
form). Solving the lis of this sequence gives the Y-indices
of those symbols which belong to the ICS.
The studies on theoretical complexity of the Ics problem
[I, 16, 20, 22, 27, 391 give the lower bound Q(n2), if the
elementary comparison operation is of type 'equalhnequal'
and the alphabet size is unrestricted. However, if the input
alphabet is fixed, we reach the lower bound Q(on). In prac-
tice, the underlying encoding scheme for the symbols of the
input alphabet implies a topological ordering between them
and the <-comparison gives us more information reducing
the lower bound to R(n1ogm). Of the current methods, the
O(n2/ logn) algorithm of Masek and Paterson [28] is theo-
retically the fastest known.
Although the time and space complexity of the dynamic
programming approach is 'only' quadratic, it tends to be
too large for many applications. Due to this, several heuris-
tics [2, 9, 10, 12, 15, 231 as well as exact algorithms han-
dling special inputs [5, 13, 19, 31, 33, 401 have been de-
vised. Often, it is not the time complexity that is crucial
but space becomes the limiting resource [17]. The basic
dynamic programming method is easily implemented using
(m) space [ 191, if only the length of the Ics is needed. The
same bound can be reached also when the actual sequence
has to be found [5,7, 18,251 at the cost of roughly doubling
the time of the basic algorithm from which the variant was
derived (but preserving the same asymptotic bounds). In
this survey, the speed of the algorithms is the most impor-
tant quality and we have excluded the linear-space variants.
Still, it must be emphasized that the performance figures
are very different for the algorithms included, whether they
search the lcs or only its length.
The paper is organized as follows. First, we define some
basic concepts in Section 2 and describe then the various
approaches to solve the Ics problem in Section 3. After that,
we discuss the influence of the data structures selected to
the heart of the algorithm and the influence of lower level
decisions, such as memory management. The actual com-
parison is given in Section 5 and the paper is concluded with
a discussion.
2. Basic Concepts
The traditional technique for solving
Ics(X[l..m],Y[l..n]) is to determine the longest com-
mon subsequence for all possible prefix combinations of
the input strings. The recurrence relation for extending the
length of the Ics for each prefix pair (X[ 1 ..i], Y [ 1 .. j]) is the
following [38]:
i f i = 0 or j = O
if X[i] = Y[j]
max {R[i- 1, j],R[i, j- 11) ifX[i] # Y[j]
where we have used the tabular notation R[i,j] =
r(X[l..i],Y[l..j]). Intrinsic to this recurrence is that a sin-
gle R-value depends only on three neighbouring values in
the table, i.e. the calculation is very localized. After
the table has been filled, the length is found in R[m,n] =
r(X[l..m],Y[l .A]). The common subsequence is found by
backtracking from R[m,n] and at each step, either (a) follow
pointers which were set during the calculation of the values
or (b) recalculate the predecessor which yielded the value to
the current table entry. Each time a match is found (the mid-
dle rule applies), we have found a symbol to the ICS. In this
way, we can traverse a path through the table until a length
of zero is found. In general, there may be several such paths
because the Ics is not necessarily unique. To find all of
them, we traverse systematically through the graph induced
by the pointers (using e.g. breadth-first search). As an ex-
ample, the table corresponding to r(abcdbb, cbacbaaba) is
shown on the next page. In this case, r(X,Y) = 4 and the
longest common subsequence corresponding to the depicted
path is bcbb (acbb is another).
Let us then introduce some concepts which are needed
for the algorithmic descriptions to follow. A pair (i, j) de-
fines a match, if X[i] = Y[j]. The set of all matches is
M = {(i, j) I X[i] = Y[j], 1 5 i 5 m, I 5 j 5 n}.
Each match belongs to a class c k = {(i,j) I (i,j) E
M and R[i, j] = k } , 1 5 k 5 r. Often, it is also convenient to
define a pseudoclass CO = { (O,O)}. A match which belongs
to c k is called a k-match. In the preceding figure, the encir-
cled and the boxed matches having value k define the class
ck. Evidently, each match belongs exactly to one class.
Thus, the classes partition all matches of M . Some of the
40
Page 10
[8] Apostolico, A. & Guerra, C.: The Longest Common Sub-
sequence Problem Revisited, Algorithmica, No. 2, 1987,
pp. 315-336
[9] Argos, P., Vingron, M. & Vogt, G.: Protein Sequence
Comparison: Methods and Significance, Protein Engineer-
ing, Vol. 4, No. 4, 1991, pp. 375-383
Bergroth, L., Hakonen, H. & Raita, T.: New Approxima-
tion Algorithms for Longest Common Subsequences, Proc.
of SPlRE'98, String Processing and Information Retrieval:
A South American Symposium, Santa Cruz de La Sierra, Bo-
livia September 9 - 11, 1998, pp. 32-40.
Bergroth, L., Hakonen, H. & Raita, T.: Performance
Evaluation of the Longest Common Subsequence Algo-
rithms, in preparation
Chin, F. & Poon, C.K.: Performance Analysis of Some
Simple Heuristics for Computing Longest Common Subse-
quences, Algorithmica, Vol. 12, 1994, pp. 293-31 I
Chin, F.Y.L. & Poon, C.K.: A Fast Algorithm for Com-
puting Longest Common Subsequences of Small Alphabets
Size, J. of In$ Proc., Vol. 13, No. 4, 1990, pp. 463469
Crochemore, C. & Rytter, W.: Text Algorithms, Oxford
University Press, 1994
Fraser, C.B.:
Subsequences and Supersequences of
Strings, P1i.D. Thesis, University of Glasgow, 1995
Fredman, M.L.: On Computing the Length of Longest In-
creasing Subsequences, Disc. Math., Vol. 11, 1975, pp. 29-
35
Gusfield, D.: Algorithms on Strings, Trees and Sequences,
Cambridge University Press, New York, 1997
Hirschberg, D.S.: A Linear Space Algorithm for Comput-
ing Maximal Common Subsequences, Comm. of the ACM,
Vol. 18, NO. 6, 1975, pp. 341-343
[25] Kumar, S.K. & Rangan, C.P.: A Linear-Space Algorithm
for the LCS problem, Acta Informatica. vol. 24, 1987, pp.
353-362
[26] Kuo, S. & Cross, G.R.: An Improved Algorithm to Find
the Length of the Longest Common Subsequence of ' b o
Strings, ACM SlGlR Forum, Vol. 23, No. 3 4 , 1989, pp.
89-99
[27] Maier, D.: The Complexity of Some Problems on Subse-
quences and Supersequences, JACM, Vol. 25, No. 2, 1978,
pp. 322-336
Hirschberg, D.S.: Algorithms for the Longest Common
Subsequence Problem, Journal of the ACM, Vol. 24, No. 4,
1977, pp. 664-675
Hirschberg, D.S.: An Information Theoretic Lower Bound
for the Longest Common Subsequence Problem, In$ Proc.
Letters, Vol. 7, No. I , 1978, pp. 40-41
Hsu, W.J. & Du, M.W.: New Algorithms for the LCS
Problem, J. Comp. and Svstem Sc., Vol. 29, 1984, pp. 133-
152
Hunt, J.W. & Szymanski, T.G.: A Fast Algorithm for
Computing Longest Common Subsequences, Comm. of the
ACM, Vol. 20, NO. 5, 1977, pp. 350-353
Jiang, T. & Li, M.: On the Approximation of Shortest
Common Supersequences and Longest Common Subse-
quences, SIAM J. of Comp., Vol. 24, No. 5, 1995, pp. 1122-
1139
Johtela, T., Smed, J., Hakonen, H. & Raita, T.: An Ef-
ficient Heuristic for the LCS Problem, Proc. of the Third
South American Workshop on String Processing, WSP '96,
Recife, Brazil, August 1996, pp. 126-140
[28] Masek, W.J. & Paterson, M.S.: A Faster Algorithm for
Computing String-Edit Distances, Journal o f Computer and
System Sciences, Vol. 20, No. I , pp. 18-31
[29] Miller, W. & Myers, E.W.: A File Comparison Program,
Softw. Pract. Exp., Vol. 15, No. 11, November 1985, pp.
1025-1040
[30] Mukhopadhyay, A.: A Fast Algorithm for the Longest-
Common-Subsequence Problem, Information Sciences,
Vol. 20, 1980, pp. 69-82
[31] Myers, E.W.: An O(ND) Difference Algorithm and its
Variations, Algorithmica, Vol. 1. 1986, pp. 251-266
[32] Myers, E.W.: An Overview of Sequence Comparison Al-
gorithms in Molecular Biology, Technical Report, TR 91-
29, Dept. of CS, The University of Arizona, Tucson, Arizona
[33] Nakatsu, N., Kambayashi, Y. & Yajima, S.: A Longest
Common Subsequence Algorithm Suitable for Similar
Texts, Acta Informatica. Vol. 18, 1982, pp. 171-179
[34] Pearson, W.R. & Lipman, DJ.: Improved Tools for Bio-
logical Sequence Comparison, Proc. Natl. Acad. Sci. USA,
Vol. 85, April 1988, pp. 2444-2448
[35] Rick, C.: A New Flexible Algorithm for the Longest Com-
mon Subsequence Problem, in Galil, Z. & Ukkonen, E.
(eds): Proc. of Combinatorial Pattern Matching, 6th An-
nual Symposium, Espoo, Finland, July 1995, pp. 340-351.
Appeared also as Lecture Notes in Computer Science, Vol.
937
[36] Sankoff, D. & Kruskal, J.B. (eds.): Time Warps, String
Edits, and Macromolecules: The Theory and Practice of
Sequence Comparison, Addison- Wesley, 1983
I371 Stephen, G.A.: String Searching Algorithms, World Scien-
t $ ~ , 1994
[38] Wagner, R.A. & Fischer, M.J.: The String-to-String Cor-
rection Problem, Journal of the ACM, Vol. 21, No. I , Jan-
uary 1975, pp. 168-173
[39] Wong, C.K. & Chandra, A.K.: The String-to-String Cor-
rection Problem, Joumal of the ACM, Vol. 21. 1976, pp.
13-16
[40] Wu, S., Manber, U., Myers, G. & Miller, W.: An O(NP)
Sequence Comparison Algorithm, In$ Proc. Lett., Vol. 35,
September 1990, pp. 317-323
48