PreprintPDF Available
Preprints and early-stage research may not have been peer reviewed yet.

Abstract

In this paper we introduce a new family of string processing problems. We are given two or more strings and we are asked to compute a factor common to all strings that preserves a specific property and has maximal length. Here we consider three fundamental string properties: square-free factors, periodic factors, and palindromic factors under three different settings, one per property. In the first setting, we are given a string $x$ and we are asked to construct a data structure over $x$ answering the following type of on-line queries: given string $y$, find a longest square-free factor common to $x$ and $y$. In the second setting, we are given $k$ strings and an integer $1 < k'\leq k$ and we are asked to find a longest periodic factor common to at least $k'$ strings. In the third setting, we are given two strings and we are asked to find a longest palindromic factor common to the two strings. We present linear-time solutions for all settings. We anticipate that our paradigm can be extended to other string properties or settings.
Longest Property-Preserved Common Factor
Lorraine A.K Ayad1, Giulia Bernardini2, Roberto Grossi3, Costas S. Iliopoulos4, Nadia
Pisanti5, Solon P. Pissis6, and Giovanna Rosone7
1Department of Informatics, King’s College London, London, UK,
lorraine.ayad@kcl.ac.uk
2Department of Informatics, Systems and Communication (DISCo), University of
Milan-Bicocca, Italy, giulia.bernardini@unimib.it
3Department of Computer Science, University of Pisa, Italy and ERABLE Team, INRIA,
France, grossi@di.unipi.it
4
Department of Informatics, King’s College London, London, UK,
c.iliopoulos@kcl.ac.uk
5Department of Computer Science, University of Pisa, Italy and ERABLE Team, INRIA,
France, pisanti@di.unipi.it
6
Department of Informatics, King’s College London, London, UK,
solon.pissis@kcl.ac.uk
7Department of Computer Science, University of Pisa, Italy, giovanna.rosone@unipi.it
Abstract
In this paper we introduce a new family of string processing problems. We are given two
or more strings and we are asked to compute a factor common to all strings that preserves a
specific property and has maximal length. Here we consider three fundamental string properties:
square-free factors, periodic factors, and palindromic factors under three different settings, one
per property. In the first setting, we are given a string
x
and we are asked to construct a data
structure over
x
answering the following type of on-line queries: given string
y
, find a longest
square-free factor common to
x
and
y
. In the second setting, we are given
k
strings and an integer
1
< k0k
and we are asked to find a longest periodic factor common to at least
k0
strings. In
the third setting, we are given two strings and we are asked to find a longest palindromic factor
common to the two strings. We present linear-time solutions for all settings. We anticipate that
our paradigm can be extended to other string properties or settings.
1 Introduction
In the longest common factor problem, also known as longest common substring problem, we are
given two strings
x
and
y
, each of length at most
n
, and we are asked to find a maximal-length
string occurring in both
x
and
y
. This is a classical and well-studied problem in computer science
arising out of different practical scenarios. It can be solved in
O
(
n
) time and space [
10
,
18
] (see
also [
21
,
26
]). Recently, the same problem has been extensively studied under distance metrics; that
is, the sought factors (one from
x
and one from
y
) must be at distance at most
k
and have maximal
length [8,28,27,2,25,24] (and references therein).
In this paper we initiate a new related line of research. We are given two or more strings and
our goal is to compute a factor common to all strings that preserves a specific property and has
maximal length. An analogous line of research was introduced in [
11
]. It focuses on computing a
subsequence (rather than a factor) common to all strings that preserves a specific property and has
1
arXiv:1810.02099v1 [cs.DS] 4 Oct 2018
maximal length. Specifically, in [
11
,
3
,
19
], the authors considered computing a longest common
palindromic subsequence and in [20] computing a longest common square subsequence.
We consider three fundamental string properties: square-free factors, periodic, and palindromic
factors [
23
] under three different settings, one per property. In the first setting, we are given a string
x
and we are asked to construct a data structure over
x
answering the following type of on-line
queries: given string
y
, find a longest square-free factor common to
x
and
y
. In the second setting,
we are given
k
strings and an integer 1
< k0k
and we are asked to find a longest periodic factor
common to at least
k0
strings. In the third setting, we are given two strings and we are asked to
find a longest palindromic factor common to the two strings. We present linear-time solutions for
all settings. We anticipate that our paradigm can be extended to other string properties or settings.
1.1 Definitions and Notation
An alphabet Σ is a non-empty finite ordered set of letters of size
σ
=
|
Σ
|
. In this work we consider
that
σ
=
O
(1) or that Σ is a linearly-sortable integer alphabet. A string
x
on an alphabet Σ is a
sequence of elements of Σ. The set of all strings on an alphabet Σ, including the empty string
ε
of
length 0, is denoted by Σ
. For any string
x
, we denote by
x
[
i..j
] the substring (sometimes called
factor) of
x
that starts at position
i
and ends at position
j
. In particular,
x
[0
..j
] is the prefix of
x
that ends at position
j
, and
x
[
i..|x| −
1] is the suffix of
x
that starts at position
i
, where
|x|
denotes
the length of
x
. A string
uu
,
u
Σ
, is called a square. A square-free string is a string that does
not contain a square as a factor.
Aperiod of
x
[0
..|x| −
1] is a positive integer
p
such that
x
[
i
] =
x
[
i
+
p
] holds for all 0
i < |x| − p
.
The smallest period of
x
is denoted by
per
(
x
). String
u
is called periodic if and only if
per
(
u
)
≤ |u|/
2.
Arun of string
x
is an interval [
i, j
] such that for the smallest period
p
=
per
(
x
[
i..j
]) it holds
that 2
pji
+ 1 and the periodicity cannot be extended to the left or right, i.e.,
i
= 0 or
x[i1] 6=x[i+p1], and, j=|x| − 1 or x[jp+ 1] 6=x[j+ 1].
We denote the reversal of
x
by string
xR
, i.e.
xR
=
x
[
|x| −
1]
x
[
|x| −
2]
. . . x
[0]. A string
p
is
said to be a palindrome if and only if
p
=
pR
. If factor
x
[
i..j
], 0
ijn
1, of string
x
of
length
n
is a palindrome, then
i+j
2
is the center of
x
[
i..j
] in
x
and
ji+1
2
is the radius of
x
[
i..j
]. In
other words, a palindrome is a string that reads the same forward and backward, i.e. a string
p
is a
palindrome if
p
=
yayR
where
y
is a string,
yR
is the reversal of
y
and
a
is either a single letter or
the empty string. Moreover,
x
[
i..j
] is called a palindromic factor of
x
. It is said to be a maximal
palindrome if there is no other palindrome in
x
with center
i+j
2
and larger radius. Hence
x
has
exactly 2
n
1 maximal palindromes. A maximal palindrome
p
of
x
can be encoded as a pair (
c, r
),
where cis the center of pin xand ris the radius of p.
1.2 Algorithmic Toolbox
The maximum number of runs in a string of length
n
is less than
n
[
4
], and, moreover, all runs can
be computed in O(n) time [22,4].
The suffix tree
ST
(
x
) of a non-empty string
x
of length
n
is a compact trie representing all
suffixes of
x
.
ST
(
x
) can be constructed in
O
(
n
) time [
14
]. We can analogously define and construct
the generalised suffix tree
GST
(
x0, x1, . . . , xk1
) for a set of
k
strings. We assume the reader is
familiar with these data structures.
The matching statistics capture all matches between two strings
x
and
y
[
7
]. More formally,
the matching statistics of a string
y
[0
..|y| −
1] with respect to a string
x
is an array
MSy
[0
..|y| −
1],
where
MSy
[
i
] is a pair (
`i, pi
) such that (i)
y
[
i..i
+
`i
1] is the longest prefix of
y
[
i..|y| −
1] that is
2
a factor of
x
; and (ii)
x
[
pi..pi
+
`i
1] =
y
[
i..i
+
`i
1]. Matching statistics can be computed in
O(|y|) time for σ=O(1) by using ST(x) [18,6,16].
Given a rooted tree
T
with
n
leaves coloured from 0 to
k
1, 1
< k n
, the colour set size
problem is finding, for each internal node
u
of
T
, the number of different leaf colours in the subtree
rooted at u. In [10], the authors present an O(n)-time solution to this problem.
In the weighted ancestor problem, introduced in [
15
], we consider a rooted tree
T
with an integer
weight function
µ
defined on the nodes. We require that the weight of the root is zero and the
weight of any other node is strictly larger than the weight of its parent. A weighted ancestor query,
given a node
v
and an integer value
`µ
(
v
), asks for the highest ancestor
u
of
v
such that
µ
(
u
)
`
,
i.e., such an ancestor
u
that
µ
(
u
)
`
and
µ
(
u
) is the smallest possible. When
T
is the suffix tree
of a string
x
of length
n
, we can locate the locus of any factor of
x
[
i..j
] using a weighted ancestor
query. We define the weight of a node of the suffix tree as the length of the string it represents.
Thus a weighted ancestor query can be used for the terminal node corresponding to
x
[
i..n
1] to
create (if necessary) and mark the node that corresponds to
x
[
i..j
]. Given a collection
Q
of weighted
ancestor queries on a weighted tree
T
on
n
nodes with integer weights up to
nO(1)
, all the queries in
Qcan be answered off-line in O(n+|Q|) time [5].
2 Square-Free-Preserved Matching Statistics
In this section, we introduce the square-free-preserved matching statistics problem and provide a
linear-time solution. In the square-free-preserved matching statistics problem we are given a string
x
of length
n
and we are asked to construct a data structure over
x
answering the following type of
on-line queries: given string
y
, find the longest square-free prefix of
y
[
i..|y| −
1] that is a factor of
x
,
for all 0
i < |y| −
1. (For related work see [
12
].) We represent the answer using an integer array
SQMSy
[0
..|y| −
1] of lengths, but we can trivially modify our algorithm to report the actual factors.
It should be clear that a maximum element in
SQMS
gives the length of some longest square-free
factor common to xand y.
Construction. Our data structure over string xconsists of the following:
An integer array
Lx
[0
..n
1], where
Lx
[
i
] stores the length of the longest square-free factor
starting at position iof string x.
The suffix tree ST(x) of string x.
The idea for constructing array Lxefficiently is based on the following crucial observation.
Observation 1.
If
x
[
i..n
1] contains a square then
Lx
[
i
] + 1, for all 0
i<n
, is the length
of the shortest prefix of
x
[
i..n
1] (factor
f
) containing a square. In fact, the square is a suffix
of
f
, otherwise
f
would not have been the shortest. If
x
[
i..n
1] does not contain a square then
Lx[i] = ni.
We thus shift our focus to computing the shortest such prefixes. We start by considering the
runs of
x
. Specifically, we consider squares in
x
observing that a run [
`, r
] with period
p
contains
r`
2
p
+ 2 squares of length 2
p
with the leftmost one starting at position
`
. Let
r0
=
`
+2
p
1
denote the ending position of the leftmost such square of the run. In order to find, for all
i
’s, the
shortest prefix of x[i..n 1] containing a square s, and thus compute Lx[i], we have two cases:
1. s
is part of a run [
`, r
] in
x
that starts after
i
. In particular,
s
=
x
[
`..r0
] such that
r0r
,
`>i
,
and
r0
is minimal. In this case the shortest factor has length
`
+ 2
pi
; we store this value in
an integer array
C
[0
..n
1]. If no run starts after position
i
we set
C
[
i
] =
. To compute
C
,
3
after computing in
O
(
n
) time all the runs of
x
with their
p
and
r0
[
22
,
4
], we sort them by
r0
.
A right-to-left scan after this sorting associates to ithe closest r0with ` > i.
2. s
is part of a run [
`, r
] in
x
and
i
[
`, r
]. This implies that if
ir
2
p
+1 then a square starts
at
i
and we store the length of the shortest such square in an integer array
S
[0
..n
1]. If
no square starts at position
i
we set
S
[
i
] =
. Array
S
can be constructed in
O
(
n
) time by
applying the algorithm of [13].
Since we do not know which of the two cases holds, we compute both
C
and
S
. By Observation 1,
if
C
[
i
] =
S
[
i
] =
(
x
[
i..n
1] does not contain a square) we set
Lx
[
i
] =
ni
; otherwise (
x
[
i..n
1]
contains a square) we set Lx[i] = min{C[i], S[i]} − 1.
Finally, we build the suffix tree
ST
(
x
) of string
x
in
O
(
n
) time [
14
]. This completes our
construction.
Querying. We rely on the following fact for answering the queries efficiently.
Fact 1. Every factor of a square-free string is square-free.
Let string
y
be an on-line query. Using
ST
(
x
), we compute the matching statistics
MSy
of
y
with
respect to
x
. For each
j
[0
,|y| −
1],
MSy
[
j
] = (
`i, i
) indicates that
x
[
i..i
+
`i
1] =
y
[
j..j
+
`i
1].
This computation can be done in
O
(
|y|
) time [
18
,
6
]. By applying Fact 1, we can answer any query
yin O(|y|) time for σ=O(1) by setting SQMSy[j] = min{`i, Lx[i]}, for all 0 j≤ |y| − 1.
We arrive at the following result.
Theorem 1.
Given a string
x
of length
n
over an alphabet of size
σ
=
O
(1), we can construct a
data structure of size O(n)in time O(n), answering SQMSyon-line queries in O(|y|)time.
Proof. The time complexity of our algorithm follows from the above discussion.
We next show the correctness of our algorithm. Let us first show the correctness of computing
array
Lx
. The square contained in the shortest prefix of
x
[
i..n
1] (containing a square) starts by
definition either at
i
or after
i
. If it starts at
i
this is correctly computed by the algorithm of [
13
]
which assigns the length of the shortest such square in
S
[
i
]. If it starts after
i
it must be the leftmost
square of another run by the runs definition.
C
[
i
] stores the length of the shortest prefix containing
such a square. Then by Observation 1,Lx[i] is computed correctly.
It suffices to show that, if
w
is the longest square-free substring common to
x
and
y
occurring at
position
ix
in
x
and at position
iy
in
y
, then (i)
MSy
[
iy
] = (
`, ix
) with
`≥ |w|
and
x
[
ix..ix
+
`
1] =
y
[
iy..iy
+
`
1]; (ii)
w
is a prefix of
x
[
ix..ix
+
Lx
[
ix
]
1]; and (iii)
SQMSy
[
iy
] =
|w|
. Case (i)
directly follows from the correctness of the matching statistics algorithm. For Case (ii), since
w
occurs at
ix
and
w
is square-free,
Lx
[
ix
]
≥ |w|
. For Case (iii), since
w
is square-free we have to
show that
|w|
=
min{`i, Lx
[
i
]
}
. We know from (i) that
`≥ |w|
and from (ii) that
Lx
[
ix
]
≥ |w|
. If
min{`i, Lx
[
i
]
}
=
`
, then
w
cannot be extended because the possibly longer than
|w|
square-free string
occurring at
ix
does not occur in
y
, and in this case
|w|
=
`
. Otherwise, if
min{`i, Lx
[
i
]
}
=
Lx
[
ix
]
then
w
cannot be extended because it is no longer square-free, and in this case
|w|
=
Lx
[
ix
]. Hence
we conclude that SQMSy[iy] = |w|. The statement follows.
The following example provides a complete overview of the workings of our algorithm.
Example 1.
Let
x
=
aababaababb
and
y
=
babababbaaab
. The length of a longest common
square-free factor is 3, and the factors are bab and aba.
4
i0 1 2 3 4 5 6 7 8 9 10
x[i]aababaababb
C[i] 5 6 5 4 3 5 5 4 3 ∞ ∞
S[i] 2 4 4 6 2 4 ∞ ∞ 2
Lx[i] 1 3 3 3 2 1 3 3 2 1 1
j0 1 2 3 4 5 6 7 8 9 10 11
y[j]b a b a b a b b a a a b
MSy[j] (4,2) (5,1) (4,2) (5,6) (4,7) (3,8) (2,9) (3,4) (2,0) (3,0) (2,1) (1,2)
SQMSy[j] 3 3 3 3 3 2 1 2 1 1 2 1
3 Longest Periodic-Preserved Common Factor
In this section, we introduce the longest periodic-preserved common factor problem and provide a
linear-time solution. In the longest periodic-preserved common factor problem, we are given
k
2
strings
x0, x1, . . . , xk1
of total length
N
and an integer 1
< k0k
, and we are asked to find a
longest periodic factor common to at least
k0
strings. In what follows we present two different
algorithms to solve this problem. We represent the answer
LPCFk0
by the length of a longest factor,
but we can trivially modify our algorithms to report an actual factor. Our first algorithm, denoted
by lPcf, works as follows.
1. Compute the runs of string xj, for all 0 j < k.
2. Construct the generalised suffix tree GST(x0, x1, . . . , xk1) of x0, x1, . . . , xk1.
3.
For each string
xj
and for each run [
`, r
] with period
p`
of
xj
, augment GST with the explicit
node spelling
xj
[
`..r
], decorate it with
p`
, and mark it as a candidate node. This can be done
as follows: for each run [
`, r
] of
xj
, for all 0
j < k
, find the leaf corresponding to
xj
[
`..|xj|
1]
and answer the weighted ancestor query in GST with weight
r`
+1. Moreover, mark as
candidates all explicit nodes spelling a prefix of length dof any run [`, r] with 2p`d.
4.
Mark as good the nodes of the tree having at least
k0
different colours on the leaves of the
subtree rooted there. Let aGST be this augmented tree.
5.
Return as
LPCFk0
the string depth of a candidate node in aGST which is also a good node,
and that has maximal string depth (if any, otherwise return 0).
Theorem 2.
Given
k
strings of total length
N
on alphabet Σ =
{
1
, . . . , N O(1)}
, and an integer
1< k0k, algorithm lPcf returns LPCFk0in time O(N).
Proof.
Let us assume wlog that
k0
=
k
, and let
w
with period
p
be the longest periodic factor
common to all strings. By the construction of aGST (Steps 1-4), the path spelling
w
leads to a good
node nwas woccurs in all the strings. We make the following observation.
Observation 2.
Each periodic factor with period
p
of string
x
is a factor of
x
[
i..j
], where [
i, j
]is a
run with period p.
By Observation 2, in all strings,
w
is included in a run having the same period. Observe that
for at least one of the strings, there is a run ending with
w
, otherwise we could extend
w
obtaining
a longer periodic common factor (similarly, for at least one of the strings, there is a run starting
with
w
). Therefore
nw
is both a good and a candidate node. By definition,
nw
is at string depth at
5
Figure 1: aGST for x=ababbabba,y=ababaab, and k=k0= 2.
least 2
p
and, by construction,
LPCFk0
is the string depth of a deepest such node; thus
|w|
will be
returned by Step 5.
As for the time complexity, Step 1 [
22
,
4
] and Step 2 [
14
] can be done in
O
(
N
) time. Since the
total number of runs is less than
N
[
4
], Step 3 can be done in
O
(
N
) time using off-line weighted
ancestor queries [
5
] to mark the runs as candidate nodes; and then a post-order traversal to mark
their ancestor explicit nodes as candidates, if their string-depth is at least 2
p`
for any run [
`, r
] with
period
p`
. The size of the aGST is still in
O
(
N
). Step 4 can be done in
O
(
N
) time [
10
]. Step 5 can
be done in O(N) by a post-order traversal of aGST.
The following example provides a complete overview of the workings of our algorithm.
Example 2.
Consider
x
=
ababbabba
,
y
=
ababaab
, and
k
=
k0
=2. The runs of
x
are:
r0
= [0
,
3],
per
(
abab
) = 2,
r1
= [1
,
8],
per
(
babbabba
) = 3,
r2
= [3
,
4],
per
(
bb
) = 1, and
r3
= [6
,
7],
per
(
bb
) = 1;
those of
y
are
r4
= [0
,
4],
per
(
ababa
) = 2 and
r5
= [4
,
5],
per
(
aa
) = 1. Fig 1shows aGST for
x
,
y
,
and
k
=
k0
=2. Algorithm lPcf outputs 4 =
|abab|
, with
per
(
abab
) = 2, as the node spelling
abab
is the deepest good one that is also a candidate.
We next present a second algorithm to solve this problem with the same time complexity but
without the use of off-line weighted ancestor queries. The algorithm works as follows.
1. Compute the runs of string xj, for all 0 j < k.
2. Construct the generalised suffix tree GST(x0, x1, . . . , xk1) of x0, x1, . . . , xk1.
3.
Mark as good the nodes of
GST
having at least
k0
different colours on the leaves of the subtree
rooted there.
4. Compute and store, for every leaf node, the nearest ancestor that is good.
5.
For each string
xj
and for each run [
`, r
] with period
p`
of
xj
, check the nearest good ancestor
for the leaf corresponding to
xj
[
`..|xj| −
1]. Let
d
be the string-depth of the nearest good
ancestor. Then:
(a) If r`+ 1 d, the entire run is also good.
(b) If r`+ 1 > d, check if 2p`d, and if so the string for the good ancestor is periodic.
6
a
b
$x
a$x
b
$y
a
a
a$x
b
a$x
b
aa$x
a
a$x
bb$y
5
4
3
2
0
1
5
4
3
2
1
0
Figure 2: GST for x=ababaa,y=bababb, and k=k0= 2. Good nodes are marked red.
6. Return as LPCFk0the maximal string depth found in Step 5 (if any, otherwise return 0).
Let us analyse this algorithm. Let us assume wlog that
k0
=
k
, and let
w
with period
p
be the
longest periodic factor common to all strings. By the construction of
GST
(Steps 1-3), the path
spelling wleads to a good node nwas woccurs in all the strings.
By Observation 2, in all strings,
w
is included in a run having the same period. Observe
that for at least one of the strings, there is a run starting with
w
, otherwise we could extend
w
obtaining a longer periodic common factor. So the algorithm should check, for each run, if there
is a periodic-preserved common prefix of the run and take the longest such prefix.
LPCFk0
is the
string depth of a deepest good node spelling a periodic factor; thus
|w|
will be returned by Step 6.
As for the time complexity, Step 1 [
22
,
4
] and Step 2 [
14
] can be done in
O
(
N
) time. Step 3 can
be done in
O
(
N
) time [
10
] and Step 4 can be done in
O
(
N
) time by using a tree traversal. Since
the total number of runs is less than
N
[
4
], Step 5 can be done in
O
(
N
) time. We thus arrive at
Theorem 2with a different algorithm.
The following example provides a complete overview of the workings of our algorithm.
Example 3.
Consider
x
=
ababaa
,
y
=
bababb
, and
k
=
k0
= 2. The runs of
x
are:
r0
= [0
,
4],
per
(
ababa
) = 2,
r1
= [4
,
5],
per
(
aa
) = 1; those of
y
are
r2
= [0
,
4],
per
(
babab
) = 2 and
r3
= [4
,
5],
per
(
bb
) = 1. Fig 2shows
GST
for
x
,
y
, and
k
=
k0
= 2. Consider the run
r0
= [0
,
4]. The nearest
good node of leaf spelling
x
[0
..|x| −
1] is the node spelling
abab
. We have that
r`
+ 1 = 5
> d
= 4,
and 2
p
= 4
d
= 4. The algorithm outputs 4 =
|abab|
as
abab
is a longest periodic-preserved
common factor. Another longest periodic-preserved common factor is baba.
7
4 Longest Palindromic-Preserved Common Factor
In this section, we introduce the longest palindromic-preserved common factor problem and provide
a linear-time solution. In the longest palindromic-preserved common factor problem, we are given
two strings
x
and
y
, and we are asked to find a longest palindromic factor common to the two
strings. (For related work in a dynamic setting see [
17
,
1
].) We represent the answer
LPALCF
by
the length of a longest factor, but we can trivially modify our algorithm to report an actual factor.
Our algorithm is denoted by lPalcf. In the description below, for clarity, we consider odd-length
palindromes only. (Even-length palindromes can be handled in an analogous manner.)
1.
Compute the maximal odd-length palindromes of
x
and the maximal odd-length palindromes
of y.
2.
Collect the factors
x
[
i..i0
] of
x
(resp. the factors
y
[
j..j0
] of
y
) such that
i
(
j
) is the center of an
odd-length maximal palindrome of
x
(
y
) and
i0
(
j0
) is the ending position of the odd-length
maximal palindrome centered at i(j).
3. Create a lexicographically sorted list Lof these strings from xand y.
4. Compute the longest common prefix of consecutive entries (strings) in L.
5.
Let
`
be the maximal length of longest common prefixes between any string from
x
and any
string from y. For odd lengths, return LPALCF= 2`1.
Theorem 3.
Given two strings
x
and
y
on alphabet Σ =
{
1
,...,
(
|x|
+
|y|
)
O(1)}
, algorithm lPalcf
returns LPALCF in time O(|x|+|y|).
Proof. The correctness of our algorithm follows directly from the following observation.
Observation 3.
Any longest palindromic-preserved common factor is a factor of a maximal
palindrome of
x
with the same center and a factor of a maximal palindrome of
y
with the same
center.
Step 1 can be done in
O
(
|x|
+
|y|
) time [
18
]. Step 2 can be done in
O
(
|x|
+
|y|
) time by going
through the set of maximal palindromes computed in Step 1. Step 3 and Step 4 can be done in
O
(
|x|
+
|y|
) time by constructing the data structure of [
9
]. Step 5 can be done in
O
(
|x|
+
|y|
) time
by going through the list of computed longest common prefixes.
The following example provides a complete overview of the workings of our algorithm.
Example 4.
Consider
x
=
ababaa
and
y
=
bababb
. In Step 1 we compute all maximal palindromes of
x
and
y
. Considering odd-length palindromes gives the following factors (Step 2) from
x
:
x
[0
..
0] =
a
,
x
[1
..
2] =
ba
,
x
[2
..
4] =
aba
,
x
[3
..
4] =
ba
,
x
[4
..
4] =
a
, and
x
[5
..
5] =
a
. The analogous factors from
y
are:
y
[0
..
0] =
b
,
y
[1
..
2] =
ab
,
y
[2
..
4] =
bab
,
y
[3
..
4] =
ab
,
y
[4
..
4] =
b
, and
y
[5
..
5] =
b
. We sort these
strings lexicographically and compute the longest common prefix information (Steps 3-4). We find
that
`
= 2: the maximal longest common prefixes are
ba
and
ab
, denoting that
aba
and
bab
are the
longest palindromic-preserved common factors of odd length. In fact, algorithm lPalcf outputs
2`1 = 3 as aba and bab are the longest palindromic-preserved common factors of any length.
8
5 Final Remarks
In this paper, we introduced a new family of string processing problems. The goal is to compute
factors common to a set of strings preserving a specific property and having maximal length. We
showed linear-time algorithms for square-free, periodic, and palindromic factors under three different
settings. We anticipate that our paradigm can be extended to other string properties or settings.
Acknowledgements
We would like to acknowledge an anonymous reviewer of a previous version of this paper who sug-
gested the second linear-time algorithm for computing the longest periodic-preserved common factor.
Solon P. Pissis and Giovanna Rosone are partially supported by the Royal Society project IE 161274
“Processing uncertain sequences: combinatorics and applications”. Giovanna Rosone and Nadia
Pisanti are partially supported by the project Italian MIUR-SIR CMACBioSeq (“Combinatorial
methods for analysis and compression of biological sequences”) grant n. RBSI146R5L.
References
[1]
Amihood Amir, Panagiotis Charalampopoulos, Solon P. Pissis, and Jakub Radoszewski. Longest
common factor made fully dynamic. CoRR, abs/1804.08731, 2018.
[2]
Lorraine A. K. Ayad, Carl Barton, Panagiotis Charalampopoulos, Costas S. Iliopoulos, and
Solon P. Pissis. Longest common prefixes with
k
-errors and applications. In SPIRE, volume
11147 of LNCS, pages 27–41. Springer, 2018.
[3]
Sang Won Bae and Inbok Lee. On finding a longest common palindromic subsequence.
Theoretical Computer Science, 710:29–34, 2018. Advances in Algorithms & Combinatorics on
Strings (Honoring 60th birthday for Prof. Costas S. Iliopoulos).
[4]
Hideo Bannai, Tomohiro I, Shunsuke Inenaga, Yuto Nakashima, Masayuki Takeda, and Kazuya
Tsuruta. The “runs” theorem. SIAM Journal on Computing, 46(5):1501–1514, 2017.
[5]
Carl Barton, Tomasz Kociumaka, Chang Liu, Solon P. Pissis, and Jakub Radoszewski. Indexing
weighted sequences: Neat and efficient. CoRR, abs/1704.07625, 2017.
[6]
Djamal Belazzougui and Fabio Cunial. Indexed matching statistics and shortest unique
substrings. In Edleno Silva de Moura and Maxime Crochemore, editors, 21st International
Symposium on String Processing and Information Retrieval (SPIRE), volume 8799 of LNCS,
pages 179–190, 2014.
[7]
W. I. Chang and E. L. Lawler. Sublinear approximate string matching and biological applications.
Algorithmica, 12(4):327–344, 1994.
[8]
Panagiotis Charalampopoulos, Maxime Crochemore, Costas S. Iliopoulos, Tomasz Kociumaka,
Solon P. Pissis, Jakub Radoszewski, Wojciech Rytter, and Tomasz Walen. Linear-time algorithm
for long LCF with k mismatches. In CPM, volume 105 of LIPIcs, pages 23:1–23:16. Schloss
Dagstuhl - Leibniz-Zentrum fuer Informatik, 2018.
[9]
Panagiotis Charalampopoulos, Costas S. Iliopoulos, Chang Liu, and Solon P. Pissis. Property
suffix array with applications. In Michael A. Bender, Martin Farach-Colton, and Miguel A.
9
Mosteiro, editors, LATIN 2018: Theoretical Informatics - 13th Latin American Symposium,
Buenos Aires, Argentina, April 16-19, 2018, Proceedings, volume 10807 of Lecture Notes in
Computer Science, pages 290–302. Springer, 2018.
[10]
Lucas Chi and Kwong Hui. Color set size problem with applications to string matching. In
Combinatorial Pattern Matching, pages 230–243. Springer Berlin Heidelberg, 1992.
[11]
Shihabur Rahman Chowdhury, Md. Mahbubul Hasan, Sumaiya Iqbal, and M. Sohel Rahman.
Computing a longest common palindromic subsequence. Fundam. Inf., 129(4):329–340, 2014.
[12]
Marius Dumitran, Florin Manea, and Dirk Nowotka. On prefix/suffix-square free words.
In Costas S. Iliopoulos, Simon J. Puglisi, and Emine Yilmaz, editors, 22nd International
Symposium, on String Processing and Information Retrieval (SPIRE), volume 9309 of LNCS,
pages 54–66, 2015.
[13]
Jean-Pierre Duval, Roman Kolpakov, Gregory Kucherov, Thierry Lecroq, and Arnaud Lefebvre.
Linear-time computation of local periods. Theoretical Computer Science, 326(1):229–240, 2004.
[14]
Martin Farach. Optimal suffix tree construction with large alphabets. In 38th Annual Symposium
on Foundations of Computer Science (FOCS), pages 137–143, 1997.
[15]
Martin Farach and S. Muthukrishnan. Perfect hashing for strings: Formalization and algorithms.
In 7th Symposium on Combinatorial Pattern Matching (CPM), pages 130–140. 1996.
[16]
Maria Federico and Nadia Pisanti. Suffix tree characterization of maximal motifs in biological
sequences. Theor. Comput. Sci., 410(43):4391–4401, 2009.
[17]
Mitsuru Funakoshi, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, and Masayuki Takeda.
Longest substring palindrome after edit. In Gonzalo Navarro, David Sankoff, and Binhai
Zhu, editors, Annual Symposium on Combinatorial Pattern Matching (CPM 2018), volume
105 of Leibniz International Proceedings in Informatics (LIPIcs), pages 12:1–12:14, Dagstuhl,
Germany, 2018. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik.
[18]
Dan Gusfield. Algorithms on Strings, Trees, and Sequences - Computer Science and Computa-
tional Biology. Cambridge University Press, 1997.
[19]
Shunsuke Inenaga and Heikki Hyyr¨o. A hardness result and new algorithm for the longest
common palindromic subsequence problem. Information Processing Letters, 129:11–15, 2018.
[20]
Takafumi Inoue, Shunsuke Inenaga, Heikki Hyyr¨o, Hideo Bannai, and Masayuki Takeda.
Computing longest common square subsequences. In 29th Symposium on Combinatorial
Pattern Matching (CPM), volume 105 of LIPIcs, pages 15:1–15:13, 2018.
[21]
Tomasz Kociumaka, Tatiana A. Starikovskaya, and Hjalte Wedel Vildhøj. Sublinear space
algorithms for the longest common substring problem. In Algorithms - ESA 2014 - 22th Annual
European Symposium, Wroclaw, Poland, September 8-10, 2014. Proceedings, pages 605–617,
2014.
[22]
Roman Kolpakov and Gregory Kucherov. Finding maximal repetitions in a word in linear time.
In 40th Symposium on Foundations of Comp Science, pages 596–604, 1999.
[23]
M. Lothaire. Applied Combinatorics on Words. Encyclopedia of Mathematics and its Applica-
tions. Cambridge University Press, 2005.
10
[24]
Pierre Peterlongo, Nadia Pisanti, Fed´eric Boyer, Alair Pereira do Lago, and Marie-France Sagot.
Lossless filter for multiple repetitions with hamming distance. J. Discr. Alg., 6(3):497–509,
2008.
[25]
Pierre Peterlongo, Nadia Pisanti, Fed´eric Boyer, and Marie-France Sagot. Lossless filter for
finding long multiple approximate repetitions using a new data structure, the bi-factor array. In
12th International Symposium String Processing and Information Retrieval, 12th International
Conference (SPIRE), pages 179–190, 2005.
[26]
Tatiana A. Starikovskaya and Hjalte Wedel Vildhøj. Time-space trade-offs for the longest
common substring problem. In 24th Symposium on Combinatorial Pattern Matching (CPM),
pages 223–234, 2013.
[27]
Sharma V. Thankachan, Chaitanya Aluru, Sriram P. Chockalingam, and Srinivas Aluru.
Algorithmic framework for approximate matching under bounded edits with applications to
sequence analysis. In RECOMB, volume 10812 of LNCS, pages 211–224, 2018.
[28]
Sharma V. Thankachan, Alberto Apostolico, and Srinivas Aluru. A provably efficient algorithm
for the k-mismatch average common substring problem. Journal of Computational Biology,
23(6):472–482, 2016.
11
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
In the Longest Common Factor with $k$ Mismatches (LCF$_k$) problem, we are given two strings $X$ and $Y$ of total length $n$, and we are asked to find a pair of maximal-length factors, one of $X$ and the other of $Y$, such that their Hamming distance is at most $k$. Thankachan et al. show that this problem can be solved in $\mathcal{O}(n \log^k n)$ time and $\mathcal{O}(n)$ space for constant $k$. We consider the LCF$_k$($\ell$) problem in which we assume that the sought factors have length at least $\ell$, and the LCF$_k$($\ell$) problem for $\ell=\Omega(\log^{2k+2} n)$, which we call the Long LCF$_k$ problem. We use difference covers to reduce the Long LCF$_k$ problem to a task involving $m=\mathcal{O}(n/\log^{k+1}n)$ synchronized factors. The latter can be solved in $\mathcal{O}(m \log^{k+1}m)$ time, which results in a linear-time algorithm for Long LCF$_k$. In general, our solution to LCF$_k$($\ell$) for arbitrary $\ell$ takes $\mathcal{O}(n + n \log^{k+1} n/\sqrt{\ell})$ time.
Article
Full-text available
Although real-world text datasets, such as DNA sequences, are far from being uniformly random, average-case string searching algorithms perform significantly better than worst-case ones in most applications of interest. In this paper, we study the problem of computing the longest prefix of each suffix of a given string of length $n$ over a constant-sized alphabet that occurs elsewhere in the string with $k$-errors. This problem has already been studied under the Hamming distance model. Our first result is an improvement upon the state-of-the-art average-case time complexity for non-constant $k$ and using only linear space under the Hamming distance model. Notably, we show that our technique can be extended to the edit distance model with the same time and space complexities. Specifically, our algorithms run in $\mathcal{O}(n \log^k n \log \log n)$ time on average using $\mathcal{O}(n)$ space. We show that our technique is applicable to several algorithmic problems in computational biology and elsewhere.
Article
Recently, Chowdhury et al. [5] proposed the longest common palindromic subsequence problem. It is a variant of the well-known LCS problem, which refers to finding a palindromic LCS between two strings T1 and T2. In this paper, we present a new O(n+R2)-time algorithm where n=|T1|=|T2| and R is the number of matches between T1 and T2. We also show that the average running time of our algorithm is O(n4/|Σ|2), where Σ is the alphabet of T1 and T2. This improves the previously best algorithms whose running times are O(n4) and O(R2log2⁡nlog⁡log⁡n).
Article
In the longest common factor (LCF) problem, we are given two strings $S$ and $T$, each of length at most $n$, and we are asked to find a longest string occurring in both $S$ and $T$. This is a classical and well-studied problem in computer science. The LCF length for two strings can vary greatly even when a single character is changed. A data structure that can be built in $\tilde{\mathcal{O}}(n)$ (The $\tilde{\mathcal{O}}$ notation suppresses $\log^{\mathcal{O}(1)} n$ factors.) time and can return an LCF of the two strings after a single edit operation (that is reverted afterwards) in $\tilde{\mathcal{O}}(1)$ time was very recently proposed as a first step towards the study of the fully dynamic LCF problem. In the fully dynamic version, edit operations are allowed in any of the two strings, and we are to report an LCF after each such operation. We present the first algorithm that requires strongly sublinear time per edit operation. In particular, we show how to return an LCF in $\tilde{\mathcal{O}}(n^{3/4})$ time after each operation using $\tilde{\mathcal{O}}(n)$ space. We also present an algorithm with $\tilde{\mathcal{O}}(\sqrt{n})$ query time for the restricted case where edits are allowed only in one of the two strings and faster algorithms for several restricted variants of dynamic and internal LCF problems (here `internal' means that we are to answer queries about LCF on multiple factors of a given text).
Article
We give a new characterization of maximal repetitions (or runs) in strings based on Lyndon words. The characterization leads to a proof of what was known as the "runs" conjecture [R. M. Kolpakov andG.Kucherov, Proceedings of the IEEE Symposium on Foundations of Computer Science (FOCS), IEEE Computer Society, Los Alamitos, CA, 1999, pp. 596-604]), which states that the maximum number of runs ρ(n) in a string of length n is less than n. The proof is remarkably simple, considering the numerous endeavors to tackle this problem in the last 15 years, and significantly improves our understanding of how runs can occur in strings. In addition, we obtain an upper bound of 3n for the maximum sum of exponents σ(n) of runs in a string of length n, improving on the best known bound of 4.1n by Crochemore et al. [J. Discrete Algorithms, 14 (2012), pp. 29-36], as well as other improved bounds on related problems. The characterization also gives rise to a new, conceptually simple linear-time algorithm for computing all the runs in a string. A notable characteristic of our algorithm is that, unlike all existing linear-time algorithms, it does not utilize the Lempel-Ziv factorization of the string. We also establish a relationship between runs and nodes of the Lyndon tree, which gives a simple optimal solution to the 2-period query problem that was recently solved by Kociumaka et al. [Proceedings of the Twenty-Sixth Annual ACM-SIAM Symposium on Discrete Algorithms, (SODA) 2015, San Diego, CA, SIAM, Philadelphia, 2015, pp. 532-551].
Article
In a weighted sequence, for every position of the sequence and every letter of the alphabet a probability of occurrence of this letter at this position is specified. Weighted sequences are commonly used to represent imprecise or uncertain data, for example, in molecular biology where they are known under the name of Position-Weight Matrices. Given a probability threshold $\frac1z$, we say that a string $P$ of length $m$ matches a weighted sequence $X$ at starting position $i$ if the product of probabilities of the letters of $P$ at positions $i,\ldots,i+m-1$ in $X$ is at least $\frac1z$. In this article, we consider an indexing variant of the problem, in which we are to preprocess a weighted sequence to answer multiple pattern matching queries. We present an $O(nz)$-time construction of an $O(nz)$-sized index for a weighted sequence of length $n$ over an integer alphabet that answers pattern matching queries in optimal, $O(m+\mathit{Occ})$ time, where $\mathit{Occ}$ is the number of occurrences reported. Our new index is based on a non-trivial construction of a family of $\lfloor z \rfloor$ weighted sequences of an especially simple form that are equivalent to a general weighted sequence. This new combinatorial insight allowed us to obtain: a construction of the index in the case of a constant-sized alphabet with the same complexities as in (Barton et al., CPM 2016) but with a simple implementation; a deterministic construction in the case of a general integer alphabet (the construction of Barton et al. in this case was randomised); an improvement of the space complexity from $O(nz^2)$ to $O(nz)$ of a more general index for weighted sequences that was presented in (Biswas et al., EDBT 2016); and a significant improvement of the complexities of the approximate variant of the index of Biswas et al.
Article
The 2-LCPS problem, first introduced by Chowdhury et al. [Fundam. Inform., 129(4):329-340, 2014], asks one to compute (the length of) a longest palindromic common subsequence between two given strings $A$ and $B$. We show that the 2-LCPS problem is at least as hard as the well-studied longest common subsequence problem for four strings (the 4-LCS problem). Then, we present a new algorithm which solves the 2-LCPS problem in $O(\sigma M^2 + n)$ time, where $n$ denotes the length of $A$ and $B$, $M$ denotes the number of matching positions between $A$ and $B$, and $\sigma$ denotes the number of distinct characters occurring in both $A$ and $B$. Our new algorithm is faster than Chowdhury et al.'s sparse algorithm when $\sigma = o(\log^2n \log\log n)$.
Article
Alignment-free sequence comparison methods are attracting persistent interest, driven by data-intensive applications in genome-wide molecular taxonomy and phylogenetic reconstruction. Among all the methods based on substring composition, the average common substring (ACS) measure admits a straightforward linear time sequence comparison algorithm, while yielding impressive results in multiple applications. An important direction of this research is to extend the approach to permit a bounded edit/hamming distance between substrings, so as to reflect more accurately the evolutionary process. To date, however, algorithms designed to incorporate k ≥ 1 mismatches have O(n(2)) worst-case time complexity, where n is the total length of the input sequences. On the other hand, accounting for mismatches has shown to lead to much improved classification, while heuristics can improve practical performance. In this article, we close the gap by presenting the first provably efficient algorithm for the k-mismatch average common string (ACSk) problem that takes O(n) space and O(n log(k) n) time in the worst case for any constant k. Our method extends the generalized suffix tree model to incorporate a carefully selected bounded set of perturbed suffixes, and can be applied to other complex approximate sequence matching problems.