Page 1

The Sketching Complexity of Pattern Matching

Ziv Bar-Yossef, T. S. Jayram, Robert Krauthgamer, and Ravi Kumar

IBM Almaden Research Center

650 Harry Road, San Jose, CA 95120, USA.

{ziv,jayram,robi,ravi}@almaden.ibm.com

Abstract. We address the problems of pattern matching and approxi-

mate pattern matching in the sketching model. We show that it is im-

possible to compress the text into a small sketch and use only the sketch

to decide whether a given pattern occurs in the text. We also prove a

sketch size lower bound for approximate pattern matching, and show it

is tight up to a logarithmic factor.

1Introduction

Pattern matching is the problem of locating a given (smaller) pattern in a (larger)

text. It is one of the most fundamental problems studied in computer science,

having a wide range of uses in text processing, information retrieval, computa-

tional biology, compilers, and web search. These application areas typically deal

with large amounts of data and therefore necessitate highly efficient algorithms

in terms of time and space.

In order to save space, I/O, and bandwidth, large text files are frequently

stored in compressed form. The naive method for locating patterns in compressed

files is to first decompress the files, and then run one of the standard pattern

matching algorithms on them. Amir and Benson [2] initiated the study of pat-

tern matching in compressed files; their approach is to process the compressed

text directly, without first decompressing it. Their algorithm, as well as all the

subsequent work in this area [3,21,12,24,11,22,15], deal with lossless compres-

sion schemes, such as Huffman coding and the Lempel-Ziv algorithm. The main

focus of these results is the speedup gained by processing the compressed text

directly.

In this paper we investigate a closely related question: how succinctly can

one compress a text file into a small “sketch”, and yet allow locating patterns

in the text using the sketch alone? In this context we consider not only lossless

compression schemes but also lossy ones. In turn, we permit pattern matching

algorithms that are randomized and can make errors with some small constant

probability. Our main focus is not on the speed of the pattern matching al-

gorithms but rather on the succinctness of the compression. Highly succinct

compression schemes of this sort could be very appealing in domains where the

text is a massive data set or when the text needs to be sent over a network.

A fundamental and well known model that addresses problems of this kind

is the sketching model [8,14], which is a powerful paradigm in the context of

Page 2

computations over massive data sets. Given a function, the idea is to produce

a fingerprint (sketch) of the data that is succinct yet rich enough to let one

compute or approximate the function on the data. The parameters that play a

key role in the applications are the size of the sketch, the time needed to produce

the sketch and the time required to compute the function given the sketch.

Results. Our first main result is an impossibility theorem showing that in the

worst-case, no sketching algorithm can compress the text by more than a con-

stant factor and yet allow exact pattern matching. Specifically, any sketching

algorithm that compresses any text of length n into a sketch of size s and en-

ables determining from the sketch alone whether an input pattern of length

m = Ω(logn) matches the text or not with a constant probability of error re-

quires s ≥ Ω(n − m). We further show that the bound is tight, up to constant

factors.

The proof of this lower bound turns out to be more intricate than one might

expect. One of the peculiarities of the problem is that it exhibits completely

different behaviors for m ≤ (1 − o(1))logn and m ≥ logn. In the former case,

a simple compression of the text into a sketch of size 2mis possible. We prove

a matching lower bound for this range of m as well. These results are described

in Section 3.

Our second main result is a lower bound on the size of sketches for approxi-

mate pattern matching, which is a relaxed version of pattern matching: (i) if the

pattern occurs in the text, the output should be “a match”; (ii) if every substring

of the text is at Hamming distance at least k from the pattern, the output should

be “no match”. An arbitrary answer is allowed if neither of the two holds. We

prove that any sketching algorithm for approximate pattern matching, requires

sketch size Ω(n/m), where n is the length of the text, m is the length of the pat-

tern, and the Hamming distance at question is k = εm, for a fixed 0 < ε < 1. We

further show that this bound is tight, up to a logarithmic factor. These results

are described in Section 4.

Interestingly, Batu et al. [6] showed a sampling procedure that solves (a

restricted version of) approximate pattern matching using˜O(n/m) non-adaptive

samples from the text. In particular, their algorithm yields a sketching algorithm

with sketch size˜O(n/m). This procedure was the main building block in their

sub-linear time algorithm for weakly approximating the edit distance. The fact

that our sketching lower bound nearly matches their sampling upper bound

suggests that it might be hard to improve their edit distance algorithm, even in

the sketching model.

Techniques. A sketching algorithm naturally corresponds to the communication

complexity of a one-way protocol. Alice holds the text and Bob holds the pattern.

Alice needs to send a single message to Bob (the “sketch”), and Bob needs to

use this message as well as his input to determine whether there is a match or

not.1

1Usually, a sketching algorithm corresponds to the communication complexity of a

simultaneous messages protocol, which is equivalent to summarizing each of the text

Page 3

The most classical problem which is hard for one-way communication com-

plexity is the indexing function: Alice is given a string x ∈ {0,1}nand Bob is

given an index i ∈ {1,...,n}, and based on a single message from Alice, Bob has

to output xi. It is well known that in any protocol solving this problem, even

a randomized one, Alice’s message has to be of length Ω(n). Our lower bound

for approximate pattern matching is proved by a reduction from the indexing

function.

Our lower bound for exact pattern matching uses a reduction from a variant

of the indexing function. In this variant, Alice gets a string x ∈ {0,1}n; Bob gets

an index i ∈ [n] and also the m − 1 bits preceding xiin x; the goal is to output

xi. Using tools from information theory we prove an Ω(n − m) lower bound for

this problem in the one-way communication complexity model.

Related work. Pattern matching and approximate pattern matching have a rich

history and extensive literature—see, for instance, the excellent resource page

[20]. To the best of our knowledge, pattern matching, has not been considered in

the sketching model. For approximate pattern matching, the only relevant result

appears to be the above mentioned work of Batu et al. [6].

Sketching algorithms for various problems, such as estimation of similarity

between documents [8,7,9], approximation of Hamming distance [19,13] and

edit distance [4] between strings, and computation of Lpdistances between vec-

tors [1,14], have been proposed in the last few years. Sketching is also a useful

tool for approximate nearest-neighbor schemes [19,16], and it is related to low-

dimensional embeddings and to locality-sensitive hash functions [16].

2 Preliminaries

2.1 Communication complexity

A sketching algorithm is best viewed as a public-coin one-way communication

complexity protocol. Two players, Alice and Bob, would like to jointly compute

a two-argument function f : X × Y → Z. Alice is given x ∈ X and Bob is

given y ∈ Y. Based on her input and based on randomness that is shared with

Bob, Alice prepares a “sketch” sA(x) and sends it to Bob. Bob uses the sketch,

his own input y, and the shared randomness to determine the value f(x,y).

For every input (x,y) ∈ X × Y , the protocol is required to be correct with

probability at least 1 − δ, where 0 < δ < 1 is some small constant. Typically,

the error probability δ can be reduced by repeating the procedure several times

independently (in parallel).

The main measure of cost of a sketching algorithm is the length of the sketch

sA(x) on the worst-case choice of shared randomness and of the input x. Another

and the pattern into a small sketch. However, in the context of pattern matching, it

is reasonable to have a weaker requirement, namely, that only the text needs to be

summarized.

Page 4

important resource is the amount of randomness between Alice and Bob. New-

man [23] shows that the amount of shared randomness can always be reduced

to O(logn

δ′) at the cost of increasing the protocol’s error probability by δ′. In

one-way protocols, Alice can privately generate these O(logn

to Bob along with the sketch sA(x).

Some of our lower bounds use a reduction from the standard indexing prob-

lem, which we denote by INDt: Alice is given a string x ∈ {0,1}t, Bob is

given j ∈ [t], and the goal is to output xj. This problem has a lower bound

of t(1 − H2(δ)) in the one-way communication complexity model [18,5].

δ′) and send them

2.2Pattern matching and approximate pattern matching

For a Boolean string x ∈ {0,1}nand integer 1 ≤ j ≤ n, let xi denote the jth

bit in x. For integers 1 ≤ i ≤ j ≤ n, [i,j] denotes the corresponding integer

interval, [n] the interval [1,n] = {1,...,n}, and x[i,j] denotes the substring of x

that starts at position i and ends at position j. We define the pattern matching

and approximate pattern matching problems in the communication model.

Let 0 ≤ m ≤ n. In the (n,m) pattern matching problem, denoted PMn,m,

Alice gets a string x ∈ {0,1}nand Bob gets a string y ∈ {0,1}m. The goal is to

determine whether there exists an index i ∈ [n−m+1]such that x[i,i+m−1] = y.

For the purpose of lower bounds, we would consider the simple Boolean function

defined above. However, some of the algorithms we present can additionally find

the position of the match i, if it exists.

We denote the Hamming distance of two strings x,y ∈ {0,1}nby HD(x,y)

|{i ∈ [n] : xi ?= yi}|. A relaxed version of pattern matching is the (n,m,ε)

approximate pattern matching problem, denoted APMn,m,ε, in which Bob would

like to determine whether there exists an index i ∈ [n−m+1] such that x[i,i+

m−1] = y, or whether for all i ∈ [n], HD(x[i,i+m−1],y) ≥ εm, assuming that

one of the two holds.

def

=

Notation. Throughout the paper we denote random variables in upper case. For

a Boolean string x ∈ {0,1}n, |x| denotes the Hamming weight (i.e., the number of

1’s) of x. log denotes a logarithm to the base 2; ln denotes the natural logarithm.

H2(p) = −plogp − (1 − p)log(1 − p) is the binary entropy function.

3Exact pattern matching

In this section we obtain a simple sketching algorithm for exact pattern matching

and show almost matching lower bounds. Recall that we denote by PMn,mthe

problem in which Alice gets a string x ∈ {0,1}n, Bob gets a string y ∈ {0,1}m,

and their goal is to find whether there exists an index i ∈ [n − m + 1] such that

x[i,i + m − 1] = y.

Page 5

3.1Upper bounds

First, we show an efficient (randomized) sketching algorithm for the pattern

matching problem, based on the Karp–Rabin hash function [17]. Next, we show

a deterministic sketching algorithm for the Boolean version of the pattern match-

ing problem.

Proposition 1. For m ≤ n−logn, there is a one-sided error randomized sketch-

ing algorithm for the pattern matching problem PMn,m using a sketch of size

O(n − m).

Proof. The randomized algorithm is based on the Karp–Rabin method [17]. Let

t = n−m+1; we assume in the sequel that t ≤ n/3, as otherwise the proof follows

trivially by Alice sending x. Let x1,...,xtdenote the sequence of t substrings

of x of length m. Alice and Bob use the shared randomness to pick a (weak) 2-

universal hash function h : {0,1}m→ [n2]. Alice sends to Bob h(x1),...,h(xt).

Bob outputs “match found at i”, if h(xi) = h(y). If no such i exists, Bob outputs

“no match found”.

This is a one-sided error algorithm: if there is a match, it will surely be output.

There is a possibility for a false match, though: when xi?= y, but h(xi) = h(y).

The probability for a false match is thus at most the probability h has a collision

between y and any of {x1,...,xt}. A union bound shows that since the range of

h is large enough, the probability of a collision between y and any xiis at most

O(1/n).

The scheme described above uses a sketch of size O(tlogn) = O((n−m)logn).

Further improvement is possible using the Karp–Rabin hash function h(b) =

(?m

(here biis the ith bit in the binary representation of b). The advantage of this

hash function is that the value of h(xi+1) can be computed from the value of h(xi)

and from the two bits xiand xi+m: h(xi+1) = ((h(xi)−xi·2m−1)·2+xi+m) mod p.

Thus, what Alice needs to send is only h(x1), the first t bits of x, and the last t

bits of x. Thus, the sketch size goes down to 2t + O(logn) = O(n − m).

i=1bi· 2m−i) mod p, where p is a randomly chosen prime in the range [n3]

Proposition 2. There is a deterministic sketching algorithm for the pattern

matching problem PMn,musing a sketch of size 2m.

Proof. In the deterministic algorithm Alice sends to Bob a characteristic vector

of length 2mspecifying all the strings of length m that occur as substrings of x.

Bob outputs “match found” if and only if y is one of the substrings indicated by

the characteristic vector.

3.2Lower bounds

We show lower bounds on the sketch size for the pattern matching problem.

The first one, Theorem 1, deals with the case m ≥ Ω(logn). The second one,

Theorem 2, considers the case m ≤ O(logn).