The Sketching Complexity of Pattern Matching
Ziv Bar-Yossef, T. S. Jayram, Robert Krauthgamer, and Ravi Kumar
IBM Almaden Research Center
650 Harry Road, San Jose, CA 95120, USA.
Abstract. We address the problems of pattern matching and approxi-
mate pattern matching in the sketching model. We show that it is im-
possible to compress the text into a small sketch and use only the sketch
to decide whether a given pattern occurs in the text. We also prove a
sketch size lower bound for approximate pattern matching, and show it
is tight up to a logarithmic factor.
Pattern matching is the problem of locating a given (smaller) pattern in a (larger)
text. It is one of the most fundamental problems studied in computer science,
having a wide range of uses in text processing, information retrieval, computa-
tional biology, compilers, and web search. These application areas typically deal
with large amounts of data and therefore necessitate highly efficient algorithms
in terms of time and space.
In order to save space, I/O, and bandwidth, large text files are frequently
stored in compressed form. The naive method for locating patterns in compressed
files is to first decompress the files, and then run one of the standard pattern
matching algorithms on them. Amir and Benson  initiated the study of pat-
tern matching in compressed files; their approach is to process the compressed
text directly, without first decompressing it. Their algorithm, as well as all the
subsequent work in this area [3,21,12,24,11,22,15], deal with lossless compres-
sion schemes, such as Huffman coding and the Lempel-Ziv algorithm. The main
focus of these results is the speedup gained by processing the compressed text
In this paper we investigate a closely related question: how succinctly can
one compress a text file into a small “sketch”, and yet allow locating patterns
in the text using the sketch alone? In this context we consider not only lossless
compression schemes but also lossy ones. In turn, we permit pattern matching
algorithms that are randomized and can make errors with some small constant
probability. Our main focus is not on the speed of the pattern matching al-
gorithms but rather on the succinctness of the compression. Highly succinct
compression schemes of this sort could be very appealing in domains where the
text is a massive data set or when the text needs to be sent over a network.
A fundamental and well known model that addresses problems of this kind
is the sketching model [8,14], which is a powerful paradigm in the context of
computations over massive data sets. Given a function, the idea is to produce
a fingerprint (sketch) of the data that is succinct yet rich enough to let one
compute or approximate the function on the data. The parameters that play a
key role in the applications are the size of the sketch, the time needed to produce
the sketch and the time required to compute the function given the sketch.
Results. Our first main result is an impossibility theorem showing that in the
worst-case, no sketching algorithm can compress the text by more than a con-
stant factor and yet allow exact pattern matching. Specifically, any sketching
algorithm that compresses any text of length n into a sketch of size s and en-
ables determining from the sketch alone whether an input pattern of length
m = Ω(logn) matches the text or not with a constant probability of error re-
quires s ≥ Ω(n − m). We further show that the bound is tight, up to constant
The proof of this lower bound turns out to be more intricate than one might
expect. One of the peculiarities of the problem is that it exhibits completely
different behaviors for m ≤ (1 − o(1))logn and m ≥ logn. In the former case,
a simple compression of the text into a sketch of size 2mis possible. We prove
a matching lower bound for this range of m as well. These results are described
in Section 3.
Our second main result is a lower bound on the size of sketches for approxi-
mate pattern matching, which is a relaxed version of pattern matching: (i) if the
pattern occurs in the text, the output should be “a match”; (ii) if every substring
of the text is at Hamming distance at least k from the pattern, the output should
be “no match”. An arbitrary answer is allowed if neither of the two holds. We
prove that any sketching algorithm for approximate pattern matching, requires
sketch size Ω(n/m), where n is the length of the text, m is the length of the pat-
tern, and the Hamming distance at question is k = εm, for a fixed 0 < ε < 1. We
further show that this bound is tight, up to a logarithmic factor. These results
are described in Section 4.
Interestingly, Batu et al.  showed a sampling procedure that solves (a
restricted version of) approximate pattern matching using˜O(n/m) non-adaptive
samples from the text. In particular, their algorithm yields a sketching algorithm
with sketch size˜O(n/m). This procedure was the main building block in their
sub-linear time algorithm for weakly approximating the edit distance. The fact
that our sketching lower bound nearly matches their sampling upper bound
suggests that it might be hard to improve their edit distance algorithm, even in
the sketching model.
Techniques. A sketching algorithm naturally corresponds to the communication
complexity of a one-way protocol. Alice holds the text and Bob holds the pattern.
Alice needs to send a single message to Bob (the “sketch”), and Bob needs to
use this message as well as his input to determine whether there is a match or
1Usually, a sketching algorithm corresponds to the communication complexity of a
simultaneous messages protocol, which is equivalent to summarizing each of the text
Remark. The above sketching algorithm is actually stronger than claimed in
the theorem, as it determines, with high probability, whether there exists an
index i ∈ [n] such that HD(x[i,i + m − 1],y) ≤ εm/2, or whether for all i ∈ [n],
HD(x[i,i + m − 1],y) ≥ εm, assuming that one of the two holds.
1. N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the
frequency moments. Journal of Computer and System Sciences, 58(1):137–147,
2. A. Amir and G. Benson. Efficient two-dimensional compressed matching. In Pro-
ceedings of IEEE Data Compression Conference (DCC), pages 279–288, 1992.
3. A. Amir, G. Benson, and M. Farach. Let sleeping files lie: Pattern matching in
Z-compressed files. J. of Computer and System Sciences, 52(2):299–307, 1996.
4. Z. Bar-Yossef, T. S. Jayram, R. Krauthgamer, and R. Kumar. Approximating edit
distance efficiently. Manuscript, 2004.
5. Z. Bar-Yossef, T. S. Jayram, R. Kumar, and D. Sivakumar. Information theory
methods in communication complexity. In Proceedings of the 17th Annual IEEE
Conference on Computational Complexity, pages 93–102, 2002.
6. T. Batu, F. Erg¨ un, J. Kilian, A. Magen, S. Raskhodnikova, R. Rubinfeld, and
R. Sami. A sublinear algorithm for weakly approximating edit distance. In Pro-
ceedings of the 35th Annual ACM Symposium on Theory of Computing, pages
7. A. Broder, M. Charikar, A. Frieze, and M. Mitzenmacher. Min-wise independent
permutations. Journal of Computer and System Sciences, 60(3):630–659, 2000.
8. A. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of
the web. WWW6/Computer Networks, 29(8–13):1157–1166, 1997.
9. M. Charikar. Similarity estimation techniques from rounding algorithms. In Pro-
ceedings of the 34th Annual ACM Symposium on Theory of Computing, pages
10. T. M. Cover and J. A. Thomas. Elements of Information Theory. John Wiley &
Sons, Inc., 1991.
11. E. de Moura, G. Navarro, N. Ziviani, and R. Baeza-Yates.
word searching on compressed text. ACM Transactions on Information Systems,
12. M. Farach and M. Thorup. String matching in Lempel-Ziv compressed strings.
Algorithmica, 20(4):388–404, 1998.
13. J. Feigenbaum, Y. Ishai, T. Malkin, K. Nissim, M. J. Strauss, and R. N. Wright.
Secure multiparty computation of approximations. In 28th International Collo-
quium on Automata, Languages and Programming, volume 2076 of Lecture Notes
in Computer Science, pages 927–938. Springer, 2001.
14. J. Feigenbaum, S. Kannan, M. J. Strauss, and M. Viswanathan. An approximate
L1-difference algorithm for massive data streams. SIAM J. Comput., 32(1):131–
15. P. Ferragina and G. Manzini. Opportunistic data structures with applications. In
Proceedings of the 41st Annual Symposium on Foundations of Computer Science,
pages 390–398. IEEE Computer Society, 2000.
16. P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the
curse of dimensionality. In Proceedings of the 30th Annual ACM Symposium on
Theory of Computing (STOC), pages 604–613, 1998.
Fast and flexible
17. R. M. Karp and M. O. Rabin. Efficient randomized pattern-matching algorithms.
IBM Journal of Research and Development, 31(2):249–260, 1987.
18. I. Kremer, N. Nisan, and D. Ron. On randomized one-round communication com-
plexity. Computational Complexity, 8(1):21–49, 1999.
19. E. Kushilevitz, R. Ostrovsky, and Y. Rabani. Efficient search for approximate near-
est neighbor in high dimensional spaces. SIAM Journal on Computing, 30(2):457–
20. S. Lonardi. Pattern matching pointers.
21. U. Manber. A text compression scheme that allows fast searching directly in the
compressed file. ACM Transactions on Information Systems, 15(2):124–136, 1997.
22. G. Navarro and J. Tarhio. Boyer-Moore string matching over Ziv-Lempel com-
pressed text. In Proceedings of 11th Annual Symposium on Combinatorial Pattern
Matching (CPM), volume 1848 of Lecture Notes in Computer Science, pages 166–
180. Springer, 2000.
23. I. Newman. Private vs. common random bits in communication complexity. Inf.
Process. Lett., 39(2):67–71, 1991.
24. Y. Shibata, T. Matsumoto, M. Takeda, A. Shinohara, and S. Arikawa. A Boyer-
Moore type algorithm for compressed pattern matching. In Proceedings of 11th
Annual Symposium on Combinatorial Pattern Matching (CPM), volume 1848 of
Lecture Notes in Computer Science, pages 181–194. Springer, 2000.
25. A. C.-C. Yao. Lower bounds by probabilistic arguments. In Proceedings of the
24th Annual IEEE Symposium on Foundations of Computer Science, pages 420–