-
[show abstract]
[hide abstract]
ABSTRACT: We present an algorithm for computing the Lyndon factorization of a string
that is given in grammar compressed form, namely, a Straight Line Program
(SLP). The algorithm runs in $O(n^4 + mn^3h)$ time and $O(n^2)$ space, where
$m$ is the size of the Lyndon factorization, $n$ is the size of the SLP, and
$h$ is the height of the derivation tree of the SLP. Since the length of the
decompressed string can be exponentially large w.r.t. $n, m$ and $h$, our
result is the first polynomial time solution when the string is given as SLP.
04/2013;
-
[show abstract]
[hide abstract]
ABSTRACT: We solve the problems of detecting and counting various forms of regularities
in a string represented as a Straight Line Program (SLP). Given an SLP of size
$n$ that represents a string $s$ of length $N$, our algorithm compute all runs
and squares in $s$ in $O(n^3h)$ time and $O(n^2)$ space, where $h$ is the
height of the derivation tree of the SLP. We also show an algorithm to compute
all gapped-palindromes in $O(n^3h + gnh\log N)$ time and $O(n^2)$ space, where
$g$ is the length of the gap. The key technique of the above solution also
allows us to compute the periods and covers of the string in $O(n^2 h)$ time
and $O(nh(n+\log^2 N))$ time, respectively.
04/2013;
-
[show abstract]
[hide abstract]
ABSTRACT: We present an efficient algorithm for computing the LZ78 factorization of a
text, where the text is represented as a straight line program (SLP), which is
a context free grammar in the Chomsky normal form that generates a single
string. Given an SLP of size $n$ representing a text $S$ of length $N$, our
algorithm computes the LZ78 factorization of $T$ in $O(n\sqrt{N}+m\log N)$ time
and $O(n\sqrt{N}+m)$ space, where $m$ is the number of resulting LZ78 factors.
We also show how to improve the algorithm so that the $n\sqrt{N}$ term in the
time and space complexities becomes either $nL$, where $L$ is the length of the
longest LZ78 factor, or $(N - \alpha)$ where $\alpha \geq 0$ is a quantity
which depends on the amount of redundancy that the SLP captures with respect to
substrings of $S$ of a certain length. Since $m = O(N/\log_\sigma N)$ where
$\sigma$ is the alphabet size, the latter is asymptotically at least as fast as
a linear time algorithm which runs on the uncompressed string when $\sigma$ is
constant, and can be more efficient when the text is compressible, i.e. when
$m$ and $n$ are small.
07/2012;
-
[show abstract]
[hide abstract]
ABSTRACT: We propose a new approach for calculating the Lempel-Ziv factorization of a
string, based on run length encoding (RLE). We present a conceptually simple
off-line algorithm based on a variant of suffix arrays, as well as an on-line
algorithm based on a variant of directed acyclic word graphs (DAWGs). Both
algorithms run in $O(N+n\log n)$ time and O(n) extra space, where N is the size
of the string, $n\leq N$ is the number of RLE factors. The time dependency on N
is only in the conversion of the string to RLE, which can be computed very
efficiently in O(N) time and O(1) extra space (excluding the output). When the
string is compressible via RLE, i.e., $n = o(N)$, our algorithms are, to the
best of our knowledge, the first algorithms which require only o(N) extra space
while running in $o(N\log N)$ time.
04/2012;
-
[show abstract]
[hide abstract]
ABSTRACT: We present an efficient algorithm for calculating $q$-gram frequencies on
strings represented in compressed form, namely, as a straight line program
(SLP). Given an SLP $\mathcal{T}$ of size $n$ that represents string $T$, the
algorithm computes the occurrence frequencies of all $q$-grams in $T$, by
reducing the problem to the weighted $q$-gram frequencies problem on a
trie-like structure of size $m = |T|-\mathit{dup}(q,\mathcal{T})$, where
$\mathit{dup}(q,\mathcal{T})$ is a quantity that represents the amount of
redundancy that the SLP captures with respect to $q$-grams. The reduced problem
can be solved in linear time. Since $m = O(qn)$, the running time of our
algorithm is $O(\min\{|T|-\mathit{dup}(q,\mathcal{T}),qn\})$, improving our
previous $O(qn)$ algorithm when $q = \Omega(|T|/n)$.
02/2012;
-
[show abstract]
[hide abstract]
ABSTRACT: Collage systems are a general framework for representing outputs of various
text compression algorithms. We consider the all $q$-gram frequency problem on
compressed string represented as a collage system, and present an $O((q+h\log
n)n)$-time $O(qn)$-space algorithm for calculating the frequencies for all
$q$-grams that occur in the string. Here, $n$ and $h$ are respectively the size
and height of the collage system.
07/2011;
-
[show abstract]
[hide abstract]
ABSTRACT: Length-$q$ substrings, or $q$-grams, can represent important characteristics
of text data, and determining the frequencies of all $q$-grams contained in the
data is an important problem with many applications in the field of data mining
and machine learning. In this paper, we consider the problem of calculating the
{\em non-overlapping frequencies} of all $q$-grams in a text given in
compressed form, namely, as a straight line program (SLP). We show that the
problem can be solved in $O(q^2n)$ time and $O(qn)$ space where $n$ is the size
of the SLP. This generalizes and greatly improves previous work (Inenaga &
Bannai, 2009) which solved the problem only for $q=2$ in $O(n^4\log n)$ time
and $O(n^3)$ space.
07/2011;
-
[show abstract]
[hide abstract]
ABSTRACT: We consider the problem of {\em restructuring} compressed texts without
explicit decompression. We present algorithms which allow conversions from
compressed representations of a string $T$ produced by any grammar-based
compression algorithm, to representations produced by several specific
compression algorithms including LZ77, LZ78, run length encoding, and some
grammar based compression algorithms. These are the first algorithms that
achieve running times polynomial in the size of the compressed input and output
representations of $T$. Since most of the representations we consider can
achieve exponential compression, our algorithms are theoretically faster in the
worst case, than any algorithm which first decompresses the string for the
conversion.
07/2011;
-
[show abstract]
[hide abstract]
ABSTRACT: Subsequence pattern matching problems on compressed text were first considered by Cégielski et al. (Window Subsequence Problems for Compressed Texts, Proc. CSR 2006, LNCS 3967, pp. 127–136), where the principal problem is:
given a string T represented as a straight line program (SLP) T\mathcal{T} of size n, a string P of size m, compute the number of minimal subsequence occurrences of P in T. We present an O(nm) time algorithm for solving all variations of the problem introduced by Cégielski et al.. This improves the previous best known algorithm of Tiskin (Towards approximate matching in compressed strings: Local subsequence
recognition, Proc. CSR 2011), which runs in O(nmlogm) time. We further show that our algorithms can be modified to solve a wider range of problems in the same O(nm) time complexity, and present the first matching algorithms for patterns containing VLDC (variable length don’t care) symbols,
as well as for patterns containing FLDC (fixed length don’t care) symbols, on SLP compressed texts.
06/2011: pages 309-322;
-
[show abstract]
[hide abstract]
ABSTRACT: We present simple and efficient algorithms for calculating $q$-gram
frequencies on strings represented in compressed form, namely, as a straight
line program (SLP). Given an SLP of size $n$ that represents string $T$, we
present an $O(qn)$ time and space algorithm that computes the occurrence
frequencies of $q$-grams in $T$. Computational experiments show that our
algorithm and its variation are practical for small $q$, actually running
faster on various real string data, compared to algorithms that work on the
uncompressed text. We also discuss applications in data mining and
classification of string data, for which our algorithms can be useful.
03/2011;
-
Combinatorial Pattern Matching - 22nd Annual Symposium, CPM 2011, Palermo, Italy, June 27-29, 2011. Proceedings; 01/2011
-
Combinatorial Pattern Matching - 22nd Annual Symposium, CPM 2011, Palermo, Italy, June 27-29, 2011. Proceedings; 01/2011
-
Theor. Comput. Sci. 01/2011; 412:6959-6981.
-
[show abstract]
[hide abstract]
ABSTRACT: The purpose of this paper is to realize an authentication system which satisfies four requirements for security, privacy protection,
and usability, that is, impersonation resistance against insiders, personalization, unlinkability in multi-service environment, and memory efficiency. The proposed system is the first system which satisfies all the properties. In the proposed system, transactions of a user
within a single service can be linked (personalization), while transactions of a user among distinct services can not be linked
(unlinkability in multi-service environment). The proposed system can be used with smart cards since the amount of memory
required by the system does not depend on the number of services. First, this paper formalizes the property of unlinkability
in multi-service environment, which has not been formalized in the literatures. Next, this paper extends an identification
scheme with a pseudorandom function in order to realize an authentication system which satisfies all the requirements. This
extension can be done with any identification scheme and any pseudorandom function. Finally, this paper shows an implementation
with the Schnorr identification scheme and a collision-free hash function as an example of the proposed systems.
04/2010: pages 236-251;
-
Computational Science and Its Applications - ICCSA 2010, International Conference, Fukuoka, Japan, March 23-26, 2010, Proceedings, Part IV; 01/2010
-
Combinatorial Pattern Matching, 21st Annual Symposium, CPM 2010, New York, NY, USA, June 21-23, 2010. Proceedings; 01/2010
-
Chicago J. Theor. Comput. Sci. 01/2010; 2010.
-
String Processing and Information Retrieval - 17th International Symposium, SPIRE 2010, Los Cabos, Mexico, October 11-13, 2010. Proceedings; 01/2010
-
Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications, PDPTA 2009, Las Vegas, Nevada, USA, July 13-17, 2009, 2 Volumes; 01/2009
-
World Congress on Nature & Biologically Inspired Computing, NaBIC 2009, 9-11 December 2009, Coimbatore, India; 01/2009