Benjamin Sach

The University of Warwick, Coventry, England, United Kingdom

Are you Benjamin Sach?

Claim your profile

Publications (20)0.48 Total impact

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We give cell-probe bounds for the computation of edit distance, Hamming distance, convolution and longest common subsequence in a stream. In this model, a fixed string of $n$ symbols is given and one $\delta$-bit symbol arrives at a time in a stream. After each symbol arrives, the distance between the fixed string and a suffix of most recent symbols of the stream is reported. The cell-probe model is perhaps the strongest model of computation for showing data structure lower bounds, subsuming in particular the popular word-RAM model. * We first give an $\Omega((\delta \log n)/(w+\log\log n))$ lower bound for the time to give each output for both online Hamming distance and convolution, where $w$ is the word size. This bound relies on a new encoding scheme and for the first time holds even when $w$ is as small as a single bit. * We then consider the online edit distance and longest common subsequence problems in the bit-probe model ($w=1$) with a constant sized input alphabet. We give a lower bound of $\Omega(\sqrt{\log n}/(\log\log n)^{3/2})$ which applies for both problems. This second set of results relies both on our new encoding scheme as well as a carefully constructed hard distribution. * Finally, for the online edit distance problem we show that there is an $O((\log n)^2/w)$ upper bound in the cell-probe model. This bound gives a contrast to our new lower bound and also establishes an exponential gap between the known cell-probe and RAM model complexities.
    07/2014;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The Karp-Rabin fingerprint of a string is a type of hash value that due to its strong properties has been used in many string algorithms. In this paper we show how to construct a data structure for a string $S$ of size $N$ compressed by a context-free grammar of size $n$ that answers fingerprint queries. That is, given indices $i$ and $j$, the answer to a query is the fingerprint of the substring $S[i,j]$. We present the first O(n) space data structures that answer fingerprint queries without decompressing any characters. For Straight Line Programs (SLP) we get $O(\log N)$ query time, and for Linear SLPs (an SLP derivative that captures LZ78 compression and its variations) we get $O(\log \log N)$ query time. Hence, our data structures has the same time and space complexity as for random access in SLPs. We utilize the fingerprint data structures to solve the longest common extension problem in query time $O(\log N \log \lce)$ and $O(\log \lce \log\log \lce + \log\log N)$ for SLPs and Linear SLPs, respectively. Here, $\lce$ denotes the length of the LCE.
    05/2013;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We revisit the longest common extension (LCE) problem, that is, preprocess a string $T$ into a compact data structure that supports fast LCE queries. An LCE query takes a pair $(i,j)$ of indices in $T$ and returns the length of the longest common prefix of the suffixes of $T$ starting at positions $i$ and $j$. We study the time-space trade-offs for the problem, that is, the space used for the data structure vs. the worst-case time for answering an LCE query. Let $n$ be the length of $T$. Given a parameter $\tau$, $1 \leq \tau \leq n$, we show how to achieve either $O(\infrac{n}{\sqrt{\tau}})$ space and $O(\tau)$ query time, or $O(\infrac{n}{\tau})$ space and $O(\tau \log({|\LCE(i,j)|}/{\tau}))$ query time, where $|\LCE(i,j)|$ denotes the length of the LCE returned by the query. These bounds provide the first smooth trade-offs for the LCE problem and almost match the previously known bounds at the extremes when $\tau=1$ or $\tau=n$. We apply the result to obtain improved bounds for several applications where the LCE problem is the computational bottleneck, including approximate string matching and computing palindromes. We also present an efficient technique to reduce LCE queries on two strings to one string. Finally, we give a lower bound on the time-space product for LCE data structures in the non-uniform cell probe model showing that our second trade-off is nearly optimal.
    11/2012;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We show tight bounds for online Hamming distance computation in the cell-probe model with word size w. The task is to output the Hamming distance between a fixed string of length n and the last n symbols of a stream. We give a lower bound of Omega((d/w)*log n) time on average per output, where d is the number of bits needed to represent an input symbol. We argue that this bound is tight within the model. The lower bound holds under randomisation and amortisation.
    07/2012;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We consider the problem of constructing a sparse suffix tree (or suffix array) for $b$ suffixes of a given text $T$ of size $n$, using only $O(b)$ words of space during construction time. Breaking the naive bound of $\Omega(nb)$ time for this problem has occupied many algorithmic researchers since a different structure, the (evenly spaced) sparse suffix tree, was introduced by K{\"a}rkk{\"a}inen and Ukkonen in 1996. While in the evenly spaced sparse suffix tree the suffixes considered must be evenly spaced in $T$, here there is no constraint on the locations of the suffixes. We show that the sparse suffix tree can be constructed in $O(n\log^2b)$ time. To achieve this we develop a technique, which may be of independent interest, that allows to efficiently answer $b$ longest common prefix queries on suffixes of $T$, using only $O(b)$ space. We expect that this technique will prove useful in many other applications in which space usage is a concern. Furthermore, additional tradeoffs between the space usage and the construction time are given.
    07/2012;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We investigate the problem of deterministic pattern matching in multiple streams. In this model, one symbol arrives at a time and is associated with one of s streaming texts. The task at each time step is to report if there is a new match between a fixed pattern of length m and a newly updated stream. As is usual in the streaming context, the goal is to use as little space as possible while still reporting matches quickly. We give almost matching upper and lower space bounds for three distinct pattern matching problems. For exact matching we show that the problem can be solved in constant time per arriving symbol and O(m+s) words of space. For the k-mismatch and k-difference problems we give O(k) time solutions that require O(m+ks) words of space. In all three cases we also give space lower bounds which show our methods are optimal up to a single logarithmic factor. Finally we set out a number of open problems related to this new model for pattern matching.
    02/2012;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We study the complexity of the popular one player combinatorial game known as Flood-It. In this game the player is given an n×n board of tiles where each tile is allocated one of c colours. The goal is to make the colours of all tiles equal via the shortest possible sequence of flooding operations. In the standard version, a flooding operation consists of the player choosing a colourk, which then changes the colour of all the tiles in the monochromatic region connected to the top left tile to k. After this operation has been performed, neighbouring regions which are already of the chosen colour k will then also become connected, thereby extending the monochromatic region of the board. We show that finding the minimum number of flooding operations is NP-hard for c≥3 and that this even holds when the player can perform flooding operations from any position on the board. However, we show that this ‘free’ variant is in P for c=2. We also prove that for an unbounded number of colours, Flood-It remains NP-hard for boards of height at least 3, but is in P for boards of height 2. Next we show how a (c−1) approximation and a randomised 2c/3 approximation algorithm can be derived, and that no polynomial time constant factor, independent of c, approximation algorithm exists unless P=NP. We then investigate how many moves are required for the ‘most demanding’ n×n boards (those requiring the most moves) and show that the number grows as fast as Q(Öcn)\Theta(\sqrt{c}\, n). Finally, we consider boards where the colours of the tiles are chosen at random and show that for c≥2, the number of moves required to flood the whole board is Ω(n) with high probability. KeywordsNP-completeness–Flood-filling–Combinatorial games–Percolation
    Theory of Computing Systems 01/2012; 50(1):72-92. · 0.48 Impact Factor
  • Source
    Markus Jalsenius, Benny Porat, Benjamin Sach
    [Show abstract] [Hide abstract]
    ABSTRACT: We study the problem of parameterized matching in a stream where we want to output matches between a pattern of length m and the last m symbols of the stream before the next symbol arrives. Parameterized matching is a natural generalisation of exact matching where an arbitrary one-to-one relabelling of pattern symbols is allowed. We show how this problem can be solved in constant time per arriving stream symbol and sublinear, near optimal space with high probability. Our results are surprising and important: it has been shown that almost no streaming pattern matching problems can be solved (not even randomised) in less than Theta(m) space, with exact matching as the only known problem to have a sublinear, near optimal space solution. Here we demonstrate that a similar sublinear, near optimal space solution is achievable for an even more challenging problem. The proof is considerably more complex than that for exact matching.
    09/2011;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We consider a class of pattern matching problems where a normalising transformation is applied at every alignment. Normalised pattern matching plays a key role in fields as diverse as image processing and musical information processing where application specific transformations are often applied to the input. By considering the class of polynomial transformations of the input, we provide fast algorithms and the first lower bounds for both new and old problems. Given a pattern of length m and a longer text of length n where both are assumed to contain integer values only, we first show O(n log m) time algorithms for pattern matching under linear transformations even when wildcard symbols can occur in the input. We then show how to extend the technique to polynomial transformations of arbitrary degree. Next we consider the problem of finding the minimum Hamming distance under polynomial transformation. We show that, for any epsilon>0, there cannot exist an O(n m^(1-epsilon)) time algorithm for additive and linear transformations conditional on the hardness of the classic 3SUM problem. Finally, we consider a version of the Hamming distance problem under additive transformations with a bound k on the maximum distance that need be reported. We give a deterministic O(nk log k) time solution which we then improve by careful use of randomisation to O(n sqrt(k log k) log n) time for sufficiently small k. Our randomised solution outputs the correct answer at every position with high probability.
    09/2011;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We present space lower bounds for online pattern matching under a number of different distance measures. Given a pattern of length m and a text that arrives one character at a time, the online pattern matching problem is to report the distance between the pattern and a sliding window of the text as soon as the new character arrives. We require that the correct answer is given at each position with constant probability. We give Omega(m) bit space lower bounds for L_1, L_2, L_\infty, Hamming, edit and swap distances as well as for any algorithm that computes the cross-correlation/convolution. We then show a dichotomy between distance functions that have wildcard-like properties and those that do not. In the former case which includes, as an example, pattern matching with character classes, we give Omega(m) bit space lower bounds. For other distance functions, we show that there exist space bounds of Omega(log m) and O(log^2 m) bits. Finally we discuss space lower bounds for non-binary inputs and show how in some cases they can be improved.
    06/2011;
  • Source
    Raphaël Clifford, Benjamin Sach
    [Show abstract] [Hide abstract]
    ABSTRACT: It has recently been shown how to construct online, non-amortised approximate pattern matching algorithms for a class of problems whose distance functions can be classified as being local. Informally, a distance function is said to be local if for a pattern P of length m and any substring T[i,i+m−1] of a text T, the distance between P and T[i,i+m−1] can be expressed as j∑Δ(P[j],T[i+j]), where Δ is any distance function between individual characters. We show in this work how to tackle online approximate matching when the distance function is non-local. We give new solutions which are applicable to a wide variety of matching problems including function and parameterised matching, swap matching, swap-mismatch, k-difference, k-difference with transpositions, overlap matching, edit distance/LCS and L1 and L2 rearrangement distances. The resulting online algorithms bound the worst case running time per input character to within a log factor of their comparable offline counterpart.
    Journal of Discrete Algorithms. 03/2011;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We study the problem of pattern matching in a stream where an arbitrary relabelling has been applied to the symbols of the stream. Pattern P of length m is said to match a substring of the stream T at position i if there is an injective (one-to-one) function f such that T[i+j]=f(P[j]) for all j=0...m-1. Such a mapping corresponds to a relabelling of the symbols and may be distinct for each alignment of the pattern and streaming text. We first present a real-time deterministic algorithm which requires O(|Sigma|+rho) space, where |Sigma| is the number of distinct characters in the pattern and rho is the parameterised period of the pattern. We then show how to improve the working space to O(|Sigma|log m) words while still finding all matches under relabelling with high probability. Our improved algorithm can be implemented to run in either O(log |Sigma|) worst case time or expected constant time. Finally we show that the working space is optimal up to logarithmic factors.
    01/2011;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We study the complexity of the popular one player combinatorial game known as Flood-It. In this game the player is given an n ×n board of tiles, each of which is allocated one of c colours. The goal is to fill the whole board with the same colour via the shortest possible sequence of flood filling operations from the top left. We show that Flood-It is NP-hard for c ≥ 3, as is a variant where the player can flood fill from any position on the board. We present deterministic (c − 1) and randomised 2c/3 approximation algorithms and show that no polynomial time constant factor approximation algorithm exists unless P=NP. We then demonstrate that the number of moves required for the ‘most difficult’ boards grows like Q(Öcn)\Theta(\sqrt{c}\, n). Finally, we prove that for random boards with c ≥ 3, the number of moves required to flood the whole board is Ω(n) with high probability.
    05/2010: pages 307-318;
  • Source
    Fun with Algorithms, 5th International Conference, FUN 2010, Ischia, Italy, June 2-4, 2010. Proceedings; 01/2010
  • Raphaël Clifford, Benjamin Sach
    [Show abstract] [Hide abstract]
    ABSTRACT: We consider the k-difference and k-mismatch problems in the pseudo-realtime model where the text arrives online and the time complexity measure is per arriving character and unamortised. The well-known k-difference/k-mismatch problems are those of finding all alignments of a pattern of length m with a text of length n where the edit/Hamming distance is at most k. Offline, the literature gives efficient solutions in O(nk) and O(n sqrt{k log k}) time, respectively. More recently, a pseudo-realtime solution was given for the former in O(k logm) time and the latter in O(sqrt{k log k}log m) time per arriving text character. Our work improves these complexities to O(k) time for the k-difference problem and O(sqrt{k}log k + log m) for the k-mismatch problem. In the process of developing the main results, we also give a simple solution with optimal time complexity for performing longest common extension queries in the same pseudo-realtime setting which may be of independent interest.
    Combinatorial Pattern Matching, 21st Annual Symposium, CPM 2010, New York, NY, USA, June 21-23, 2010. Proceedings; 01/2010
  • Source
    Raphaël Clifford, Benjamin Sach
    [Show abstract] [Hide abstract]
    ABSTRACT: We consider the combination of function and permuted matching, each of which has fast solutions in their own right. Given a pattern p of length m and a text t of length n, a function match at position i of the text is a mapping f from Σp to Σt with the property that f(pj)=ti+j−1 for all j. We show that the problem of determining for each substring of the text, if any permutation of the pattern has a function match is in general NP-Complete. However where the mapping is also injective, so-called parameterised matching, the problem can be solved efficiently in O(nlog|Σp|) time. We then give a 1/2-approximation for a Hamming distance based optimisation variant by reduction to multiple knapsack with colour constraints.
    Inf. Process. Lett. 01/2010; 110:1012-1015.
  • Source
    Raphaël Clifford, Benjamin Sach
    [Show abstract] [Hide abstract]
    ABSTRACT: A black box method was recently given that solves the prob- lem of online approximate matching for a class of problems whose dis- tance functions can be classified as being local. A distance function is said to be local if for a pattern P of length m and any substring T(i,i+m 1) of a text T, the distance between P and T(i,i + m 1) is equal to j (P(j),T(i + j 1)), where is any distance function between indi- vidual characters. We extend this line of work by showing how to tackle online approximate matching when the distance function is non-local. We give solutions which are applicable to a wide variety of matching problems including function and parameterised matching, swap match- ing, swap-mismatch, k-dierence, k-dierence with transpositions, over- lap matching, edit distance/LCS, flipped bit, faulty bit and L1 and L2 rearrangement distances. The resulting unamortised online algorithms bound the worst case running time per input character to within a log factor of their comparable oine counterpart.
    Combinatorial Pattern Matching, 20th Annual Symposium, CPM 2009, Lille, France, June 22-24, 2009, Proceedings; 01/2009
  • Source
    Conference Paper: Generalised Matching.
    [Show abstract] [Hide abstract]
    ABSTRACT: Given a pattern p over an alphabet p and a text t over an alphabet t, we consider the problem of determining a mapping f from p to + t such that t = f(p1)f(p2)...f(pm). This class of problems, which was first introduced by Amir and Nor in 2004, is defined by dif- ferent constraints on the mapping f. We give NP-Completeness results for a wide range of conditions. These include when f is either many-to- one or one-to-one, when t is binary and when the range of f is limited to strings of constant length. We then introduce a related problem we term pattern matching with string classes which we show to be solv- able eciently. Finally, we discuss an optimisation variant of generalised matching and give a polynomial-time min(1, p k/OPT)-approximation algorithm for fixed k.
    String Processing and Information Retrieval, 16th International Symposium, SPIRE 2009, Saariselkä, Finland, August 25-27, 2009, Proceedings; 01/2009
  • Source
    Benjamin Sach, Raphaël Clifford
    [Show abstract] [Hide abstract]
    ABSTRACT: In recent years the Cache-Oblivious model of external memory computation has provided an attractive theoretical basis for the analysis of algorithms on massive datasets. Much progress has been made in discovering algorithms that are asymptotically optimal or near optimal. However, to date there are still relatively few successful experimental studies. In this paper we compare two different Cache-Oblivious priority queues based on the Funnel and Bucket Heap and apply them to the single source shortest path problem on graphs with positive edge weights. Our results show that when RAM is limited and data is swapping to external storage, the Cache-Oblivious priority queues achieve orders of magnitude speedups over standard internal memory techniques. However, for the single source shortest path problem both on simulated and real world graph data, these speedups are markedly lower due to the time required to access the graph adjacency list itself.
    03/2008;
  • Source
    Raphaël Clifford, Benjamin Sach
    [Show abstract] [Hide abstract]
    ABSTRACT: We investigate randomised algorithms for subset matching with spatial point sets—given two sets of d-dimensional points: a data set T consisting of n points and a pattern P consisting of m points, find the largest match for a subset of the pattern in the data set. This prob- lem is known to be 3-SUM hard and so unlikely to be solvable exactly in subquadratic time. We present an efficient bit-parallel O(nm) time algo- rithm and an O(n log m) time solution based on correlation calculations using fast Fourier transforms. Both methods are shown experimentally to give answers within a few percent of the exact solution and provide a considerable practical speedup over existing deterministic algorithms.
    SOFSEM 2007: Theory and Practice of Computer Science, 33rd Conference on Current Trends in Theory and Practice of Computer Science, Harrachov, Czech Republic, January 20-26, 2007, Proceedings; 01/2007