Keisuke Goto

Keisuke Goto
LegalOn Technologies

PhD

About

30
Publications
1,427
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
232
Citations
Introduction
Additional affiliations
April 2011 - March 2014
Kyushu University
Position
  • PhD Student

Publications

Publications (30)
Preprint
Full-text available
The suffix trees are fundamental data structures for various kinds of string processing. The suffix tree of a text string $T$ of length $n$ has $O(n)$ nodes and edges, and the string label of each edge is encoded by a pair of positions in $T$. Thus, even after the tree is built, the input string $T$ needs to be kept stored and random access to $T$...
Preprint
Repetitiveness measures reveal profound characteristics of datasets, and give rise to compressed data structures and algorithms working in compressed space. Alas, the computation of some of these measures is NP-hard, and straight-forward computation is infeasible for datasets of even small sizes. Three such measures are the smallest size of a strin...
Article
In practical machine learning, models are frequently updated, or corrected, to adapt to new datasets. In this study, we pose two challenges to model correction. First, the effects of corrections to the end-users need to be described explicitly, similar to standard software where the corrections are described as release notes. Second, the amount of...
Article
Full-text available
Re-Pairis a grammar compression scheme with favorably good compression rates. The computation of Re-Pair comes with the cost of maintaining large frequency tables, which makes it hard to compute Re-Pair on large-scale data sets. As a solution for this problem, we present, given a text of length n whose characters are drawn from an integer alphabet...
Preprint
Learning of interpretable classification models has been attracting much attention for the last few years. Discovery of succinct and contrasting patterns that can highlight the differences between the two classes is very important. Such patterns are useful for human experts, and can be used to construct powerful classifiers. In this paper, we consi...
Preprint
Re-Pair is a grammar compression scheme with favorably good compression rates. The computation of Re-Pair comes with the cost of maintaining large frequency tables, which makes it hard to compute Re-Pair on large scale data sets. As a solution for this problem we present, given a text of length $n$ whose characters are drawn from an integer alphabe...
Preprint
Full-text available
Given a string $T$ of length $N$, the goal of grammar compression is to construct a small context-free grammar generating only $T$. Among existing grammar compression methods, RePair (recursive paring) [Larsson and Moffat, 1999] is notable for achieving good compression ratios in practice. Although the original paper already achieved a time-optimal...
Chapter
We propose a new LZ78-style grammar compression algorithm, named LZ-ABT, which is a simple online algorithm to create, given a string of length N over an alphabet of size \(\sigma \), an \(\alpha \)-balanced grammar in \(O(N \log N \log \sigma )\) time and O(n) space in addition to the input string, where n is the grammar size to output. LZ-ABT can...
Preprint
We propose a new generalization of palindromes and gapped palindromes called block palindromes. A block palindrome is a string becomes a palindrome when identical substrings are replaced with a distinct character. We investigate several properties of block palindromes and in particular, study substrings of a string which are block palindromes. In s...
Article
How can we classify multi-way data such as network traffic logs with multi-way relations between source IPs, destination IPs, and ports? Multi-way data can be represented as a tensor, and there have been several studies on classification of tensors to date. One critical issue in the classification of multi-way relations is how to extract important...
Article
When and why can evolutionary multi-objective optimization (EMO) algorithms cover the entire Pareto set? That is a major concern for EMO researchers and practitioners. A recent theoretical study revealed that (roughly speaking) if the Pareto set forms a topological simplex (a curved line, a curved triangle, a curved tetrahedron, etc.), then decompo...
Chapter
We study a new generalization of palindromes and gapped palindromes called block palindromes. A block palindrome is a string that becomes a palindrome when identical substrings are replaced with a distinct character. We investigate several properties of block palindromes and in particular, study substrings of a string which are block palindromes. I...
Article
Initializing all elements of a given array to a specified value is basic operation that frequently appears in numerous algorithms and programs. Initializable arrays are abstract arrays that support initialization as well as reading and writing of any element of the array in constant worst case time. On the word RAM model, we propose an in-place alg...
Conference Paper
In this paper, we propose a novel approach to combine compact directed acyclic word graphs (CDAWGs) and grammar-based compression. This leads us to an efficient self-index, called Linear-size CDAWGs (L-CDAWGs), which can be represented with \(O(\tilde{e}_T \log n)\) bits of space allowing for \(O(\log n)\)-time random and O(1)-time sequential acces...
Article
Full-text available
In this paper, we propose a novel approach to combine \emph{compact directed acyclic word graphs} (CDAWGs) and grammar-based compression. This leads us to an efficient self-index, called Linear-size CDAWGs (L-CDAWGs), which can be represented with $O(\tilde e_T \log n)$ bits of space allowing for $O(\log n)$-time random and $O(1)$-time sequential a...
Article
Suffix arrays and LCP arrays are one of the most fundamental data structures widely used for various kinds of string processing. Many problems can be solved efficiently by using suffix arrays, or a pair of suffix arrays and LCP arrays. In this paper, we consider two problems for a string of length $N$, the characters of which are represented as int...
Article
A closed string is a string with a proper substring that occurs in the string as a prefix and a suffix, but not elsewhere. Closed strings were introduced by Fici (2011) as objects of combinatorial interest in the study of Trapezoidal and Sturmian words. In this paper we present algorithms for computing closed factors (substrings) in strings. First,...
Conference Paper
We propose a new variant of the LZ78 factorization which we call the LZ Double-factor factorization (LZD factorization). Each factor of the LZD factorization of a string is the concatenation of the two longest previous factors, while each factor of the LZ78 factorization is that of the longest previous factor and the following character. Interestin...
Conference Paper
We present a new text indexing structure based on the run length encoding (RLE) of a text string T which, given the RLE of a query pattern P, reports all the occ occurrences of P in T in O(m+occ+log n) time, where n and m are the sizes of the RLEs of T and P, respectively. The data structure requires n(2 logN+log n+log σ)+O(n) bits of space, where...
Conference Paper
We present a new linear time algorithm for computing the Lempel-Ziv Factorization (LZ77) of a given string of length N on an alphabet of size σ, that utilizes only N log N + O(σ log N) bits of working space. When the alphabet size is small, this greatly improves the previous best space requirement for linear time LZ77 factorization (Karkkainen et a...
Article
We present a new algorithm for computing the Lempel-Ziv Factorization (LZ77) of a given string of length $N$ in linear time, that utilizes only $N\log N + O(1)$ bits of working space, i.e., a single integer array, for constant size integer alphabets. This greatly improves the previous best space requirement for linear time LZ77 factorization (K\"ar...
Article
We present a new, simple, and efficient approach for computing the Lempel-Ziv (LZ77) factorization of a string in linear time, based on suffix arrays. Computational experiments on various data sets show that our approach constantly outperforms the currently fastest algorithm LZ OG (Ohlebusch and Gog 2011), and can be up to 2 to 3 times faster in th...
Conference Paper
We present an efficient algorithm for calculating $q$-gram frequencies on strings represented in compressed form, namely, as a straight line program (SLP). Given an SLP $\mathcal{T}$ of size $n$ that represents string $T$, the algorithm computes the occurrence frequencies of all $q$-grams in $T$, by reducing the problem to the weighted $q$-gram fre...
Article
Collage systems are a general framework for representing outputs of various text compression algorithms. We consider the all $q$-gram frequency problem on compressed string represented as a collage system, and present an $O((q+h\log n)n)$-time $O(qn)$-space algorithm for calculating the frequencies for all $q$-grams that occur in the string. Here,...
Conference Paper
Length-$q$ substrings, or $q$-grams, can represent important characteristics of text data, and determining the frequencies of all $q$-grams contained in the data is an important problem with many applications in the field of data mining and machine learning. In this paper, we consider the problem of calculating the {\em non-overlapping frequencies}...
Article
We consider the problem of {\em restructuring} compressed texts without explicit decompression. We present algorithms which allow conversions from compressed representations of a string $T$ produced by any grammar-based compression algorithm, to representations produced by several specific compression algorithms including LZ77, LZ78, run length enc...
Article
We present simple and efficient algorithms for calculating $q$-gram frequencies on strings represented in compressed form, namely, as a straight line program (SLP). Given an SLP of size $n$ that represents string $T$, we present an $O(qn)$ time and space algorithm that computes the occurrence frequencies of $q$-grams in $T$. Computational experimen...

Network

Cited By