Inge Li Gørtz's research while affiliated with Technical University of Denmark and other places

Publications (73)

Preprint
Relative Lempel-Ziv (RLZ) parsing is a dictionary compression method in which a string $S$ is compressed relative to a second string $R$ (called the reference) by parsing $S$ into a sequence of substrings that occur in $R$. RLZ is particularly effective at compressing sets of strings that have a high degree of similarity to the reference string, su...
Preprint
Full-text available
Let $S$ be a string of length $n$ over an alphabet $\Sigma$ and let $Q$ be a subset of $\Sigma$ of size $q \geq 2$. The 'co-occurrence problem' is to construct a compact data structure that supports the following query: given an integer $w$ return the number of length-$w$ substrings of $S$ that contain each character of $Q$ at least once. This is a...
Article
The classic string indexing problem is to preprocess a string S into a compact data structure that supports efficient subsequent pattern matching queries, that is, given a pattern string P, report all occurrences of P within S. In this paper, we study a basic and natural extension of string indexing called the string indexing for top-k close consec...
Article
Full-text available
Given a regular expression $R$ and a string $Q$ the regular expression matching problem is to determine if $Q$ is a member of the language generated by $R$. The classic textbook algorithm by Thompson [C. ACM 1968] constructs and simulates a non-deterministic finite automaton in $O(nm)$ time and $O(m)$ space, where $n$ and $m$ are the lengths of the...
Preprint
Full-text available
We consider the predecessor problem on the ultra-wide word RAM model of computation, which extends the word RAM model with 'ultrawords' consisting of $w^2$ bits [TAMC, 2015]. The model supports arithmetic and boolean operations on ultrawords, in addition to 'scattered' memory operations that access or modify $w$ (potentially non-contiguous) memory...
Article
We consider the classic partial sums problem on the ultra-wide word RAM model of computation. This model extends the classic w-bit word RAM model with special ultrawords of length w2 bits that support standard arithmetic and boolean operation and scattered memory access operations that can access w (non-contiguous) locations in memory. The ultra-wi...
Article
Full-text available
We present a compressed representation of tries based on top tree compression [ICALP 2013] that works on a standard, comparison-based, pointer machine model of computation and supports efficient prefix search queries. Namely, we show how to preprocess a set of strings of total length n over an alphabet of size \(\sigma\) into a compressed data stru...
Preprint
Given two strings $S$ and $P$, the Episode Matching problem is to compute the length of the shortest substring of $S$ that contains $P$ as a subsequence. The best known upper bound for this problem is $\tilde O(nm)$ by Das et al. (1997), where $n,m$ are the lengths of $S$ and $P$, respectively. Although the problem is well studied and has many appl...
Preprint
The classic string indexing problem is to preprocess a string S into a compact data structure that supports efficient pattern matching queries. Typical queries include existential queries (decide if the pattern occurs in S), reporting queries (return all positions where the pattern occurs), and counting queries (return the number of occurrences of...
Chapter
We consider the classic partial sums problem on the ultra-wide word RAM model of computation. This model extends the classic w-bit word RAM model with special ultrawords of length bits that support standard arithmetic and boolean operation and scattered memory access operations that can access w (non-contiguous) locations in memory. The ultra-wide...
Preprint
The classic string indexing problem is to preprocess a string $S$ into a compact data structure that supports efficient subsequent pattern matching queries, that is, given a pattern string $P$, report all occurrences of $P$ within $S$. In this paper, we study a basic and natural extension of string indexing called the string indexing for top-$k$ cl...
Article
We consider the a priori traveling repairman problem, which is a stochastic version of the classic traveling repairman problem. Given a metric (V,d) with a root r∈V, the traveling repairman problem (TRP) involves finding a tour originating from r that minimizes the sum of arrival-times at all vertices. In its a priori version, we are also given ind...
Preprint
We consider compact representations of collections of similar strings that support random access queries. The collection of strings is given by a rooted tree where edges are labeled by an edit operation (inserting, deleting, or replacing a character) and a node represents the string obtained by applying the sequence of edit operations on the path f...
Preprint
We present the first linear time algorithm to construct the $2n$-bit version of the Lyndon array using only $o(n)$ bits of working space. A simpler variant of this algorithm computes the plain ($n\lg n$-bit) version of the Lyndon array using only $\mathcal{O}(1)$ words of additional working space. All previous algorithms are either not linear, or u...
Preprint
Given a string $S$ of length $n$, the classic string indexing problem is to preprocess $S$ into a compact data structure that supports efficient subsequent pattern queries. In this paper we consider the basic variant where the pattern is given in compressed form and the goal is to achieve query time that is fast in terms of the compressed size of t...
Preprint
We consider the classic partial sums problem on the ultra-wide word RAM model of computation. This model extends the classic $w$-bit word RAM model with special ultrawords of length $w^2$ bits that support standard arithmetic and boolean operation and scattered memory access operations that can access $w$ (non-contiguous) locations in memory. The u...
Preprint
We present the first algorithm for regular expression matching that can take advantage of sparsity in the input instance. Our main result is a new algorithm that solves regular expression matching in $O\left(\Delta \log \log \frac{nm}{\Delta} + n + m\right)$ time, where $m$ is the number of positions in the regular expression, $n$ is the length of...
Preprint
We present a compressed representation of tries based on top tree compression [ICALP 2013] that works on a standard, comparison-based, pointer machine model of computation and supports efficient prefix search queries. Namely, we show how to preprocess a set of strings of total length $n$ over an alphabet of size $\sigma$ into a compressed data stru...
Preprint
We consider the a priori traveling repairman problem, which is a stochastic version of the classic traveling repairman problem (also called the traveling deliveryman or minimum latency problem). Given a metric $(V,d)$ with a root $r\in V$, the traveling repairman problem (TRP) involves finding a tour originating from $r$ that minimizes the sum of a...
Preprint
We revisit the mergeable dictionaries with shift problem, where the goal is to maintain a family of sets subject to search, split, merge, make-set, and shift operations. The search, split, and make-set operations are the usual well-known textbook operations. The merge operation merges two sets and the shift operation adds or subtracts an integer fr...
Article
Full-text available
Grammar-based compression, where one replaces a long string by a small context-free grammar that generates the string, is a simple and powerful paradigm that captures many popular compression schemes. Given a grammar, the random access problem is to compactly represent the grammar while supporting random access, that is, given a position in the ori...
Article
In their ground-breaking paper on grammar-based compression, Charikar et al. (2005) gave a separation between straight-line programs (SLPs) and Lempel–Ziv '77 (LZ77): they described an infinite family of strings such that the size of the smallest SLP generating a string of length n in that family, is an Ω(log⁡n/log⁡log⁡n)-factor larger than the siz...
Preprint
We consider the communication complexity of fundamental longest common prefix (Lcp) problems. In the simplest version, two parties, Alice and Bob, each hold a string, $A$ and $B$, and we want to determine the length of their longest common prefix $l=\text{Lcp}(A,B)$ using as few rounds and bits of communication as possible. We show that if the long...
Article
Full-text available
We consider the problem of decompressing the Lempel-Ziv 77 representation of a string $S\in[\sigma]^n$ using a working space as close as possible to the size $z$ of the input. The folklore solution for the problem runs in optimal $O(n)$ time but requires random access to the whole decompressed text. A better solution is to convert LZ77 into a gramm...
Article
Full-text available
In this paper we give an infinite family of strings for which the length of the Lempel-Ziv'77 parse is a factor $\Omega(\log n/\log\log n)$ smaller than the smallest run-length grammar.
Article
We present a highly optimized implementation of tiered vectors, a data structure for maintaining a sequence of $n$ elements supporting access in time $O(1)$ and insertion and deletion in time $O(n^\epsilon)$ for $\epsilon > 0$ while using $o(n)$ extra space. We consider several different implementation optimizations in C++ and compare their perform...
Article
Given a static reference string R and a source string S, a relative compression of S with respect to R is an encoding of S as a sequence of references to substrings of R. Relative compression schemes are a classic model of compression and have recently proved very successful for compressing highly-repetitive massive data sets such as genomes and we...
Conference Paper
We consider compressing labeled, ordered and rooted trees using DAG compression and top tree compression. We show that there exists a family of trees such that the size of the DAG compression is always a logarithmic factor smaller than the size of the top tree compression (even for an alphabet of size 1). The result settles an open problem from Bil...
Article
Given a string $S$, the compressed indexing problem is to preprocess $S$ into a compressed representation that supports fast substring queries. The goal is to use little space relative to the compressed size of $S$ while supporting fast queries. We present a compressed index based on the Lempel-Ziv 1977 compression scheme. Let $n$, and $z$ denote t...
Conference Paper
Visualizing algorithms, such as drawings, slideshow presentations, animations, videos, and software tools, is a key concept to enhance and support student learning. A typical visualization of an algorithm show the data and then perform computation on the data. For instance, a standard visualization of a standard binary search on an array shows an a...
Article
Full-text available
Re-Pair is an efficient grammar compressor that operates by recursively replacing high-frequency character pairs with new grammar symbols. The most space-efficient linear-time algorithm computing Re-Pair uses $(1+\epsilon)n+\sqrt n$ words on top of the re-writable text (of length $n$ and stored in $n$ words), for any constant $\epsilon>0$; in pract...
Article
We present a new algorithm for subsequence matching in grammar compressed strings. Given a grammar of size n compressing a string of size N and a pattern string of size m over an alphabet of size , our algorithm uses space and or time. Here w is the word size and occ is the number of minimal occurrences of the pattern. Our algorithm uses less space...
Article
In this paper we show how to construct a data structure for a string S of size N compressed into a context-free grammar of size n that supports efficient Karp–Rabin fingerprint queries to any substring of S. That is, given indices i and j, the answer to a query is the fingerprint of the substring . We present the first space data structures that an...
Article
Given a string $S$ of length $n$, the classic string indexing problem is to preprocess $S$ into a compact data structure that supports efficient subsequent pattern queries. In the \emph{deterministic} variant the goal is to solve the string indexing problem without any randomization (at preprocessing time or query time). In the \emph{packed} varian...
Article
Full-text available
Re-Pair~\cite{larsson2000off} is an effective grammar-based compression scheme achieving strong compression rates in practice. Let $n$, $\sigma$, and $d$ be the text length, alphabet size, and dictionary size of the final grammar, respectively. In their original paper, the authors show how to compute the Re-Pair grammar in expected linear time and...
Article
We study a location-routing problem in the context of capacitated vehicle routing. The input to the k-location capacitated vehicle routing problem (k-LocVRP) consists of a set of demand locations in a metric space and a fleet of k identical vehicles, each of capacity Q. The objective is to locate k depots, one for each vehicle, and compute routes f...
Article
In this work, we present efficient algorithms for constructing sparse suffix trees, sparse suffix arrays, and sparse position heaps for b arbitrary positions of a text T of length n while using only O(b) words of space during the construction. Attempts at breaking the naïve bound of Ω(nb) time for constructing sparse suffix trees in O(b) space can...
Conference Paper
Let S be a string of length n with characters from an alphabet of size \(\sigma \). The subsequence automaton of S (often called the directed acyclic subsequence graph) is the minimal deterministic finite automaton accepting all subsequences of S. A straightforward construction shows that the size (number of states and transitions) of the subsequen...
Article
Let $S$ be a string of length $n$ with characters from an alphabet of size $\sigma$. The \emph{subsequence automaton} of $S$ (often called the \emph{directed acyclic subsequence graph}) is the minimal deterministic finite automaton accepting all subsequences of $S$. A straightforward construction shows that the size (number of states and transition...
Article
Full-text available
We consider distance labeling schemes for trees: given a tree with $n$ nodes, label the nodes with binary strings such that, given the labels of any two nodes, one can determine, by looking only at the labels, the distance in the tree between the two nodes. A lower bound by Gavoille et. al. (J. Alg. 2004) and an upper bound by Peleg (J. Graph Theor...
Article
Full-text available
Grammar-based compression, where one replaces a long string by a small context-free grammar that generates the string, is a simple and powerful paradigm that captures many popular compression schemes. In this paper, we present new representations of grammars that supports efficient finger search style access, random access, and longest common exten...
Article
Full-text available
Given a static reference string $R$ and a source string $S$, a relative compression of $S$ with respect to $R$ is an encoding of $S$ as a sequence of references to substrings of $R$. Relative compression schemes are a classic model of compression and have recently proved very successful for compressing highly-repetitive massive data set such as gen...
Conference Paper
Full-text available
The longest common extension problem (LCE problem) is to construct a data structure for an input string $T$ of length $n$ that supports LCE$(i,j)$ queries. Such a query returns the length of the longest common prefix of the suffixes starting at positions $i$ and $j$ in $T$. This classic problem has a well-known solution that uses $O(n)$ space and $...
Conference Paper
We study the orthogonal range searching problem on points that have a significant number of geometric repetitions, that is, subsets of points that are identical under translation. Such repetitions occur in scenarios such as image compression, GIS applications and in compactly representing sparse matrices and web graphs. Our contribution is twofold....
Article
We show how to compactly index video data to support fast motion detection queries. A query specifies a time interval T, a area A in the video and two thresholds v and p. The answer to a query is a list of timestamps in T where = p% of A has changed by = v values. Our results show that by building a small index, we can support queries with a speedu...
Conference Paper
Full-text available
We present a new algorithm for subsequence matching in grammar compressed strings. Given a grammar of size $n$ compressing a string of size $N$ and a pattern string of size $m$ over an alphabet of size $\sigma$, our algorithm uses $O(n+\frac{n\sigma}{w})$ space and $O(n+\frac{n\sigma}{w}+m\log N\log w\cdot occ)$ or $O(n+\frac{n\sigma}{w}\log w+m\lo...
Conference Paper
Full-text available
The Karp-Rabin fingerprint of a string is a type of hash value that due to its strong properties has been used in many string algorithms. In this paper we show how to construct a data structure for a string $S$ of size $N$ compressed by a context-free grammar of size $n$ that answers fingerprint queries. That is, given indices $i$ and $j$, the answ...
Conference Paper
We consider the problem of computing the q-gram profile of a string T of size N compressed by a context-free grammar with n production rules. We present an algorithm that runs in O(N − α) expected time and uses O(n + k T , q ) space, where N − α ≤ qn is the exact number of characters decompressed by the algorithm and k T , q ≤ N − α is the number o...
Conference Paper
Full-text available
We consider the problem of constructing a sparse suffix tree (or suffix array) for b suffixes of a given text T of length n, using only O(b) words of space during construction. Attempts at breaking the naive bound of Ω(nb) time for this problem can be traced back to the origins of string indexing in 1968. First results were only obtained in 1996, b...
Conference Paper
Full-text available
The longest common extension (LCE) problem is to preprocess a string in order to allow for a large number of LCE queries, such that the queries are efficient. The LCE value, LCE s (i,j), is the length of the longest common prefix of the pair of suffixes starting at index i and j in the string s. The LCE problem can be solved in linear space with co...
Conference Paper
We study a location-routing problem in the context of capacitated vehicle routing. The input to k-LocVRP is a set of demand locations in a metric space and a fleet of k vehicles each of capacity Q. The objective is to locate k depots, one for each vehicle, and compute routes for the vehicles so that all demands are satisfied and the total cost is m...
Conference Paper
Full-text available
We revisit various string indexing problems with range reporting features, namely, position-restricted substring searching, indexing substrings with gaps, and indexing substrings with intervals. We obtain the following main results. We give efficient reductions for each of the above problems to a new problem, which we call substring range reportin...
Conference Paper
The capacitated vehicle routing problem (CVRP) [21] involves distributing (identical) items from a depot to a set of demand locations in the shortest possible time, using a single capacitated vehicle. We study a generalization of this problem to the setting of multiple vehicles having non-uniform speeds (that we call Heterogenous CVRP), and present...
Conference Paper
Full-text available
We consider string matching with variable length gaps. Given a string T and a pattern P consisting of strings separated by variable length gaps (arbitrary strings of length in a specified range), the problem is to find all ending positions of substrings in T that match P. This problem is a basic primitive in computational biology applications. Let...
Conference Paper
Full-text available
An arc-annotated string is a string of characters, called bases, augmented with a set of pairs, called arcs, each connecting two bases. Given arc-annotated strings P and Q the arc-preserving subsequence problem is to determine if P can be obtained from Q by deleting bases from Q. Whenever a base is deleted any arc with an endpoint in that base is a...
Article
Full-text available
We study the approximate string matching and regular expressionmatching problemfor the casewhen the text to be searched is compressedwith the Ziv-Lempel adaptive dictionary compression schemes. We present a time-space trade-off that leads to algorithms improving the previously known complexities for both problems. In particular, we significantly im...
Conference Paper
Dial-a-Ride problems consist of a set V of n vertices in a metric space (denoting travel time between vertices) and a set of m objects represented as source-destination pairs \(\{(s_i,t_i)\}^m_{i=1}\), where each object requires to be moved from its source to destination vertex. In the multi-vehicle Dial-a-Ride problem, there are q vehicles each ha...
Article
Abstract In this paper we give approximation,algorithms and inapproximability results for various asymmetric k-center with minimum coverage problems. In thek-center with minimum coverage problem, each center is required to serve a minimum,number,of clients. These problems have been studied by Lim et al. [Theor. Comput. Sci. 2005] in the symmetric,s...
Article
The well-known number partition problem is NP-hard even in the following version: Given a set S of n non-negative integers; partition S into two sets X and Y such that |X|=|Y| and the sum of the elements in X is as close as possible to the sum of the elements in Y (or equivalently, minimize the maximum of the two sums). In this paper we study the f...
Conference Paper
Full-text available
We study the approximate string matching and regular ex- pression matching problem for the case when the text to be searched is compressed with the Ziv-Lempel adaptive dictionary compression schemes. We present a time-space trade-o! that leads to algorithms improving the previously known complexities for both problems. In particular, we sig- nifica...
Conference Paper
Given two rooted, labeled trees P and T the tree path subsequence problem is to determine which paths in P are subsequences of which paths in T. Here a path begins at the root and ends at a leaf. In this paper we propose this problem as a useful query primitive for XML data, and provide new algorithms improving the previously best known time and sp...
Conference Paper
In the Finite Capacity Dial-a-Ride problem the input is a metric space, a set of objects {d i }, each specifying a source s i and a destination t i , and an integer k—the capacity of the vehicle used for making the deliveries. The goal is to compute a shortest tour for the vehicle in which all objects can be delivered from their sources to their de...
Conference Paper
Given two rooted, ordered, and labeled trees P and T the tree inclusion problem is to determine if P can be obtained from T by deleting nodes in T. This problem has recently been recognized as an important query primitive in XML databases. Kilpelainen and Mannila (SIAM J. of Comp. 1995) presented the first polynomial time algorithm using quadratic...
Conference Paper
Full-text available
A union-find data structure maintains a collection of disjoint sets under the operations makeset, union, and find. Kaplan, Shafrir, and Tarjan [SODA 2002] designed data structures for an extension of the union-find problem in which items of the sets maintained may be deleted. The cost of a delete operation in their implementations is essentially th...
Article
This paper offers a systematic account of techniques to infer strong normalization from weak normalization that make use of syntactic translations from λ-terms to λI-terms. We present variants of such techniques due to Klop, Sørensen, Xi, Gandy, and Loader. We show that all the translations, in some cases via adjustments, are special cases of a gen...
Conference Paper
This paper explores three concepts: the k-center problem, some of its variants, and asymmetry. The k-center problem is a fundamental clustering problem, similar to the k-median problem. Variants of k-center may more accurately model real-life problems than the original formulation. Asymmetry is a significant impediment to approximation in many grap...
Conference Paper
Full-text available
The dispatching problem for object oriented languages is the problem of determining the most specialized method to invoke for calls at run-time. This can be a critical component of execution performance. A number of recent results, including [Muthukrishnan and Müller SODA’96, Ferragina and Muthukrishnan ESA’96, Alstrup et al. FOCS’98], have studied...

Citations

... Another variant is string indexing for consecutive occurrences [9,40]. Here, the goal is to compactly represent the string such that given a pattern P and a gap range [α, β] we can quickly find consecutive occurrences of P with distance in [α, β], i.e., pairs of occurrences immediately following each other and with distance within the range. ...
... , σ} with σ ∈ o log n (log log n) 2 then u ∈ o n log n (log log n) 2 and Arroyuelo and Raman's space bound is nH 0 (S) + o(n). Although there are many other searchable partial-sums data structures (see, e.g., [2,4] and references therein), as far as we know Arroyuelo and Raman's is the first to fit in this space, even for a sequence of sublogarithmic positive integers. In this paper we slightly improve their bound for this special case, to nH k (S)+ o(n) bits for k ∈ o log n (log log n) 2 , where H k (S) ≤ H 0 (S) is the kth-order empirical entropy of S. ...
... By allowing each node in to be identified by ′ 's hash, we can ensure only one node for each ′ ∈ is inserted into . This method of compressing through repeated subtrees is commonly known as Directed Acyclic Graph (DAG) compression [5], and aims at creating the most minimal representation of tree in the form of a DAG. ...
... Except for componentwise multiplication, all of the above componentwise operations can be implemented in constant time on the restricted UWRAM using standard word-level parallelism techniques [12,23] (see Appendix A for details on blend). For our purposes, we will need componentwise multiplication as an instruction (for evaluating hash functions in parallel) and thus we include this in the instruction set of the UWRAM. ...
... Algoritma apriori merupakan algoritma market basket analysis yang digunakan untuk menghasilkan association rule [3], dan pada algoritma apriori merupakan solusi yang menguntungkan dalam pemecahan sebuah masalah [4]. Association rule dapat digunakan untuk menemukan hubungan atau sebab akibat. ...
... Grammar-based compression is a loss-less data compression scheme that represents a string w by an SLP for w. We are aware of more powerful compression schemes such as run-length SLPs [23,35,5], composition systems [18], collage systems [25], NU-systems [34], the Lempel-Ziv 77 family [40,37,11,12], and bidirectional schemes [37]. Nevertheless, since SLPs exhibit simpler structures than those, a number of efficient algorithms that can work directly on SLPs have been proposed, including pattern matching [24,23], convolutions [38], random access [7], detection of repeats and palindromes [21], Lyndon factorizations [22], longest common extension queries [20], longest common substrings [33], finger searches [4], and balancing the grammar [16]. ...
... i computational theoretic studies on the space and time needed for matching and parsing ii parsing algorithms with different coverage of syntax trees (total vs partial) iii RE software libraries Since our focus is on practical and provably correct algorithms, for brevity we only discuss category (ii), with one exception in category (i), i.e., [5], but we recall that we have experimentally found that the BSP parsing speed compares favorably with the popular RE2 library. A representative list of parsing algorithms is in Table 3, where each one is accompanied by a short description, to which we add a few comments. ...
... We are aware of more powerful compression schemes such as run-length SLPs [23,35,5], composition systems [18], collage systems [25], NU-systems [34], the Lempel-Ziv 77 family [40,37,11,12], and bidirectional schemes [37]. Nevertheless, since SLPs exhibit simpler structures than those, a number of efficient algorithms that can work directly on SLPs have been proposed, including pattern matching [24,23], convolutions [38], random access [7], detection of repeats and palindromes [21], Lyndon factorizations [22], longest common extension queries [20], longest common substrings [33], finger searches [4], and balancing the grammar [16]. More examples of algorithms directly working on SLPs can be found in references therein and the survey [30]. ...
... Similar compression ratios are reported in Wikipedia. 4 Despite the obvious practical relevance of these compression methods, there is not a clear entropy measure useful for highly repetitive texts. The number z of phrases generated by the Lempel-Ziv parse [32] is often used as a gold standard, possibly because it can be implemented in linear time [40] and is never larger than g, the size of the smallest context-free grammar that generates the text [41,7]. ...
... Beyond genomics applications, RLZ has also found wider use as a compressor for large text corpora in contexts where random-access support for individual documents is needed [14,32,33,24,19,2] and as as a general data compressor [17,16]. In those contexts, S 1 is usually first constructed using substrings sampled from other strings in the collection (Hoobin et al. [14] show that random sampling works well) in a preprocessing phase. ...