Page 1
Sketching in Adversarial Environments
Ilya Mironov∗
Moni Naor†
Gil Segev‡
Abstract
We formalize a realistic model for computations over massive data sets. The model, re-
ferred to as the adversarial sketch model, unifies the well-studied sketch and data stream models
together with a cryptographic flavor that considers the execution of protocols in “hostile en-
vironments”, and provides a framework for studying the complexity of many tasks involving
massive data sets.
In the adversarial sketch model several parties are interested in computing a joint function in
the presence of an adversary that dynamically chooses their inputs. These inputs are provided to
the parties in an on-line manner, and each party incrementally updates a compressed sketch of
its input. The parties are not allowed to communicate, they do not share any secret information,
and any public information they share is known to the adversary in advance. Then, the parties
engage in a protocol in order to evaluate the function on their current inputs using only the
compressed sketches.
In this paper we settle the complexity of two fundamental problems in this model: testing
whether two massive data sets are equal, and approximating the size of their symmetric differ-
ence. For these problems we construct explicit and efficient protocols that are optimal up to
poly-logarithmic factors. Our main technical contribution is an explicit and deterministic encod-
ing scheme that enjoys two seemingly conflicting properties: incrementality and high distance,
which may be of independent interest.
A preliminary version of this work appeared in Proceedings of the 40th Annual ACM Symposium on Theory of
Computing (STOC), pages 651-660, 2008.
∗Microsoft Research, Silicon Valley Campus, 1065 La Avenida, Mountain View, CA 94043. Email: mironov@
microsoft.com.
†Incumbent of the Judith Kleeman Professorial Chair, Department of Computer Science and Applied Mathematics,
Weizmann Institute of Science, Rehovot 76100, Israel. Email: moni.naor@weizmann.ac.il. Research supported in
part by a grant from the Israel Science Foundation. Part of the work was done at Microsoft Research.
‡Department of Computer Science and Applied Mathematics, Weizmann Institute of Science, Rehovot 76100,
Israel. Email: gil.segev@weizmann.ac.il. Part of the work was done at Microsoft Research.
Page 2
1 Introduction
The past two decades have witnessed striking technological breakthroughs in information collection
and storage capabilities. These breakthroughs allowed the emergence of enormous collections of
data, referred to as massive data sets, such as the World Wide Web, Internet traffic logs, finan-
cial transactions, census data and many more. This state of affairs introduces new and exciting
challenges in analyzing massive data sets and extracting useful information.
From a computational point of view, most of the traditional computational models consider
settings in which the input data is easily and efficiently accessible. This is, however, usually not
the case when dealing with massive data sets. Such data sets may either be stored on highly
constrained devices or may only be accessed in an on-line manner without the ability to actually
store any significant fraction of the data. In recent years several computational models which
are suitable for computing over massive data sets have been developed, such as sketch and lossy
compression schemes [13, 24], data stream computations [3, 21, 27], and property testing [26, 39].
Motivated by the challenges posed by computational tasks involving massive data sets, and by
the existing approaches for modeling such tasks, we formalize a realistic model of computation which
we refer to as the adversarial sketch model. This model can be seen as unifying the standard sketch
model and the data stream model together with a cryptographic flavor that considers the execution
of protocols in “hostile environments”. The model under consideration provides a framework for
studying the complexity of many fundamental and realistic problems that arise in the context of
massive data sets. In what follows we briefly describe the standard sketch model and the data
stream model, as well as our approach for modeling computations in hostile environments in this
context.
The standard sketch model.
several parties. Each party runs a compression procedure to obtain a compact “sketch” of its input,
and these sketches are then delivered to a referee. The referee has to compute (or to approximate)
the value of a pre-determined function applied to the inputs of the parties by using only the sketches
and not the actual inputs. The parties are not allowed to communicate with each other, but are
allowed to share a random reference string which is chosen independently of their inputs. This
string can be used, for example, to choose a random hash function that will be applied by each
party to obtain a compressed sketch of its input. This model fits many scenarios in which a massive
data set is partitioned and stored in a distributed manner in several locations. In each location a
compressed sketch of the stored data is computed, and then sent to a central processing unit that
uses only the sketches and not the actual data. The main performance criterion for protocols in
this model is the size of the sketches. We note that this model is essentially the public-coin variant
of the simultaneous communication model introduced by Yao [43].
In the standard sketch model the input is distributed among
The data stream model.
Once an element from the stream has been processed it is discarded and cannot be retrieved unless
it is explicitly stored in memory, which is typically small relative to the size of the data stream.
The data stream model captures scenarios in which computations involve either massive data sets
that are stored on sequential magnetic devices (for which one-way access is the most efficient access
method), or on-line data which is continuously generated and not necessarily stored. The main
performance criteria for algorithms in this model is the amount of storage they consume and the
amount of time required for processing each element in the stream. For a more complete description
of this model, its variants and the main results we refer the reader the surveys by Muthukrishnan
[35] and by Babcock et al. [5].
In the data stream model the input is received as a one-way stream.
1
Page 3
The adversarial factor.
parties share a random string, which is chosen independently of the inputs held by the parties1.
In many real-life scenarios, however, it is not at all clear that such an assumption is valid. First,
since the parties are assumed not to communicate with each other, this enforces the introduction
of trust in a third party to set up the random string. In many situations such trust may not
be available, and if the shared string is set up in an adversarial manner there are usually no
guarantees on the behavior of the protocol. That is, there may be “bad” choices of the shared
string that cause the protocol to fail with very high probability. Second, even when a truly random
string is available, this string may be known to an adversary as well (and in advance), and serve as
a crucial tool in attacking the system. For example, an adversary may be able to set the inputs of
the parties after having seen the random string. Thus, when considering computations in a setting
where the inputs of the parties may be adversarially chosen, it is usually not justified to assume
independence between the shared random string and the inputs of the parties2. For these reasons
we are interested in exploring the feasibility and efficiency of computations over massive data sets in
hostile environments. In such environments the honest parties do not share any secret information,
and any public information they share is known to the adversary in advance who may then set the
inputs of the parties. Protocols designed in such a model have significant security and robustness
benefits.
In the standard sketch model described above, it is assumed that the
Sketching in adversarial environments.
two honest parties, Alice and Bob, and an adversarial party3. Computation in this model proceeds
in two phases. In the first phase, referred to the as the sketch phase, the adversarial party chooses
the inputs of Alice and Bob. These inputs are sets of elements taken from a large universe U, and
provided to the honest parties in an on-line manner in the form of a sequence of insert and delete
operations. Once an operation from the sequence has been processed it is discarded and cannot be
retrieved unless explicitly stored. This phase defines the input sets SA⊆ U and SB⊆ U of Alice
and Bob, respectively. During this phase the honest parties are completely isolated in the sense
that (1) they are not allowed to communicate with each other, and (2) the sequence of operations
communicated to each party is hidden from the other party. In addition, we assume that the honest
parties do not share any secret information, and that any public information they share is known
to the adversary in advance. In the second phase, referred to as the interaction phase, Alice and
Bob engage in a protocol in order to compute (or approximate) a pre-determined function of their
input sets.
When designing protocols in the adversarial sketch model we are mainly interested in the
following performance criteria: (1) the amount of storage (i.e., the size of the sketches), (2) the
update time during the sketch phase (i.e., the time required for processing each of the insert and
delete operations), and (3) the communication and computation complexity during the interaction
phase.
The most natural question that arises in this setting is to characterize the class of functions
that can be computed or approximated in this model with sublinear sketches and poly-logarithmic
We consider a model with three participating parties:
1In the data stream model, when dealing only with insertions, several deterministic algorithms are known, most
notably those based on the notion of core-sets (see, for example, [2, 6]).
2Typical examples include: (1) Plagiarism detection – two parties wish to compute some similarity measure
between documents. In this case the inputs (i.e., the documents) are chosen by the assumed plagiarizer. (2) Traffic
logs comparison: two internet routers wish to compare their recent traffic logs. The inputs of the routers can be
influenced by any party that can send packets to the routers.
3For concreteness we focus in this informal discussion on the simplest case where only two honest parties are
participating in the computation. We note that the model naturally generalizes to any number of honest parties.
2
Page 4
update time, communication and computation4. In the standard sketch model a large class of
functions was shown to be computed or approximated with highly compressed sketches whose size
is only poly-logarithmic in the size of the input. Therefore, one can ask the rather general question
of whether the adversarial sketch model “preserves sublinearity and efficiency”. That is, informally:
Is any function, computable in the standard sketch model with highly compressed sketches
and poly-logarithmic update time, also computable in the adversarial sketch model with
sublinear sketches and poly-logarithmic update time, communication and computation?
1.1 Our Contributions
In this paper we study the two fundamental problems of testing whether two massive data sets are
equal, and approximating the size of their symmetric difference. For these problems we provide
an affirmative answer to the above question. We construct explicit and efficient protocols with
sketches of essentially optimal sublinear size, poly-logarithmic update time during the sketch phase,
and poly-logarithmic communication and computation during the interaction phase. We settle the
complexity, up to logarithmic factors, of these two problems in the adversarial sketch model.
Our main technical contribution, that serves as a building block of our protocols, is an explicit
and deterministic encoding scheme that enjoys two seemingly conflicting properties: incrementality
and high distance. That is, the encoding guarantees that (1) for any set S and element x the
encodings of the sets S∪{x} and S\{x} can be easily computed from the encoding of S by modifying
only a small number of entries, and (2) the encodings of any two distinct sets significantly differ with
respect to a carefully chosen weighted distance. In addition, the scheme enables efficient (linear
time) decoding. We believe that an encoding scheme with these properties can find additional
applications, and may be of independent interest. In what follows we formally state our results5.
Equality testing.
by the size N of the universe of elements from which the sets are taken, and by an upper bound K
on the size of the sets to be tested6. Our construction provides an explicit protocol, and in addition
a non-constructive proof for the existence of a protocol that enjoys slightly better guarantees7. We
prove the following theorem:
An equality testing protocol in the adversarial sketch model is parameterized
Theorem 1.1. In the adversarial sketch model, for every N, K and 0 < δ < 1 there exists a
protocol for testing the equality of two sets of size at most K taken from a universe of size N with
the following properties:
1. Perfect completeness: For any two sequences of insert and delete operations communicated to
the parties that lead to the same set of elements, the parties always output Equal.
4Various relaxations may be interesting as well, for instance, allowing rather high communication complexity.
5Our protocols have the property that, during the interaction phase, the amount of computation is linear in the
amount of communication. Therefore, for simplicity, we omit the computation cost and only state the communication
complexity.
6We note that the upper bound K on the size of the sets only imposes a restriction on the size of the sets at
the end of the sketch phase. During the sketch phase the parties should be able to deal with sets of arbitrary size,
and nevertheless the size of the sketches refers to their maximal size during the sketch phase. A possible adversarial
strategy, for example, is to insert all the N possible elements and then to delete N − K of them.
7The poly-logarithmic gap between the non-explicit and the explicit parameters is due to a poly-logarithmic gap
between the optimal and the known explicit constructions of dispersers (see, for example, [41]). Any improved explicit
construction of dispersers will, in turn, improve our explicit protocols.
3
Page 5
2. Soundness: For any two sequences of insert and delete operations communicated to the parties
that do not lead to the same set of elements, the parties output Not Equal with probability8
at least 1 − δ.
3. The size of the sketches, the update time during the sketch phase, and the communication
complexity during the interaction phase are described below in Table 1.
Non-explicit protocol
??K · logN · log(1/δ)
O??log2K + logK · loglogN?· log(1/δ)?
Table 1: The non-explicit and explicit parameters of the equality testing protocol.
Explicit protocol
?K · polylog(N) · log(1/δ)
polylog(N) · log(1/δ)
Size of sketches
Update time
Communication
O
?
O(logK · logN) polylog(N)
A rather straightforward reduction of computations in the private-coin simultaneous communi-
cation model to computations in the adversarial sketch model (see Section 2) implies that the size
the sketches in our protocols is essentially optimal (the following theorem is stated for protocols
with constant error).
Theorem 1.2. Any equality testing protocol in the adversarial sketch model requires sketches of
size Ω.
??K · log(N/K)
Approximating the size of the symmetric difference.
ables two parties to approximate the size of the symmetric difference between their two input sets
determined during the sketch phase. We prove the following theorem:
?
We construct a protocol that en-
Theorem 1.3. In the adversarial sketch model, for every N, K, 0 < δ < 1 and constant 0 < ρ ≤ 1,
there exists a protocol for approximating the size of the symmetric difference between two sets of
size at most K taken from a universe of size N with the following properties:
1. For any two sequences of insert and delete operations communicated to the parties that lead
to sets with symmetric difference of size ∆OPT, the parties output ∆APXsuch that
Pr[∆OPT≤ ∆APX≤ (1 + ρ)∆OPT] > 1 − δ .
2. Sketches of size O?√K · logN · (loglogK + log(1/δ))?.
3. Update time O(logK · logN).
4. Communication complexity O??log2K + logK · loglogN?· (loglogK + log(1/δ))?.
As with the equality testing protocol, our construction provides an explicit protocol as well.
The explicit protocol guarantees that Pr[∆OPT≤ ∆APX≤ polylog(N)∆OPT] > 1 − δ, and the size of
sketches, update time and communication complexity match those stated in Theorem 1.3 up to
poly-logarithmic factors.
8The probability is taken only over the internal coin tosses of the honest parties.
4
Page 25
model. For example, the existence of incremental collision resistant hash functions [9, 10] implies
an equality testing protocol with highly compressed sketches which dramatically circumvents the
lower bound stated in Theorem 1.2 in the computational setting. A major drawback of existing
constructions of such hash functions is that they either rely on a random oracle, or are inefficient
(more specifically, the construction of Bellare, Goldreich and Goldwasser [9] can be proved secure
without a random oracle, but in this case the size of the description of each hash function is too
large to be used in practice – linear in the number of input blocks)16.
In addition, for the problem of approximating the size of the symmetric difference we are not
aware of any protocol in the computational setting that similarly improves our protocol. It would
be very interesting to take advantage of computational assumptions and construct such a protocol
with highly compressed sketches.
Preserving sublinearity and efficiency without a shared random string.
in Section 1, the most natural question that arises in the context of the adversarial sketch model
is to characterize the class of functions that can be computed or approximated in this model with
sublinear sketches and poly-logarithmic update time, communication and computation. In partic-
ular, we have asked whether the adversarial sketch model “preserves sublinearity and efficiency” of
problems from the standard sketch model.
In this paper we provided an affirmative answer for the problems of testing whether two massive
data sets are equal, and approximating the size of their symmetric difference. It would be interesting
to consider other distances and similarity measures that can be efficiently approximated in the
standard sketch model (See Section 1.2). For example, an intriguing measure, due to its application
in eliminating near-duplicates of web pages is the resemblance measure [14], defined as
As discussed
r(S,T) =|S ∩ T|
|S ∪ T|
.
There are highly compressed sketches for estimating the resemblance between two sets using a
collection of min-wise independent permutations [13]. It is not clear, however, that without shared
randomness this technique can result in sketches that can be updated in an efficient incremental
manner. It would be interesting to construct an efficient protocol for approximating resemblance
in the adversarial sketch model.
Compressed sensing: explicit reconstruction of sparse signals.
any signal of length N with at most K non-zero entries can be compressed (and efficiently recon-
structed) using a fixed set of K ·2O(log2logN)non-adaptive linear measurements. Indyk’s construc-
tion of the set of measurements is explicit and is based on unbalanced bipartite graphs, similar
to the ideas underlying our encoding scheme. However, his construction requires extractors and
not dispersers, and this leads to the 2O(log2logN)factor. It would be interesting to find explicit
compressed sensing algorithms while relying on dispersers instead of extractors, and this may lead
to a set of measurements of size K · polylog(N), which is optimal up to poly-logarithmic factors.
Indyk [29] showed that
Acknowledgments
We thank Ziv Bar-Yossef, Robert Krauthgamer, Danny Segev and the anonymous referees for many
useful comments.
16An additional construction that does not rely on random oracles can be based on the techniques of Gennaro,
Halevi and Rabin [23] that require hash functions that output prime numbers, but this again does not result in an
efficient construction.
24
Page 26
References
[1] J. Abello, P. M. Pardalos, and M. G. C. Resende, editors. Handbook of Massive Data Sets.
Kluwer Academic Publishers, 2002.
[2] P. K. Agarwal, S. Har-Peled, and K. R. Varadarajan. Geometric approximation via core sets.
Combinatorial and Computational Geometry - MSRI Publications, pages 1–30, 2005.
[3] N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency
moments. Journal of Computer and System Sciences, 58(1):137–147, 1999.
[4] L. Babai and P. G. Kimmel. Randomized simultaneous messages: Solution of a problem of
Yao in communication complexity. In Proceedings of the 12th Annual IEEE Conference on
Computational Complexity, pages 239–246, 1997.
[5] B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. Models and issues in data
stream systems. In Proceedings of the 21st ACM SIGACT-SIGMOD-SIGART Symposium on
Principles of Database Systems, pages 1–16, 2002.
[6] M. Badoiu, S. Har-Peled, and P. Indyk. Approximate clustering via core-sets. In Proceedings
on 34th Annual ACM Symposium on Theory of Computing, pages 250–257, 2002.
[7] Z. Bar-Yossef. The Complexity of Massive Data Set Computations. PhD thesis, University of
California at Berkeley, 2002.
[8] Z. Bar-Yossef, T. S. Jayram, R. Krauthgamer, and R. Kumar. Approximating edit distance
efficiently. In Proceedings of the 45th Symposium on Foundations of Computer Science, pages
550–559, 2004.
[9] M. Bellare, O. Goldreich, and S. Goldwasser. Incremental cryptography: The case of hashing
and signing. In Advances in Cryptology - CRYPTO ’94, pages 216–233, 1994.
[10] M. Bellare and D. Micciancio. A new paradigm for collision-free hashing: Incrementality at
reduced cost. In Advances in Cryptology - EUROCRYPT ’97, pages 163–192, 1997.
[11] M. Blum, W. S. Evans, P. Gemmell, S. Kannan, and M. Naor. Checking the correctness of
memories. Algorithmica, 12(2/3):225–244, 1994.
[12] D. Boneh and M. K. Franklin. An efficient public key traitor tracing scheme. In Advances in
Cryptology - CRYPTO ’99, pages 338–353, 1999.
[13] A. Z. Broder, M. Charikar, A. M. Frieze, and M. Mitzenmacher. Min-wise independent per-
mutations. Journal of Computer and System Sciences, 60(3):630–659, 2000.
[14] A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the web.
Computer Networks and ISDN Systems, 29(8-13):1157–1166, 1997.
[15] E. J. Cand` es and T. Tao. Near-optimal signal recovery from random projections: Universal
encoding strategies? IEEE Transactions on Information Theory, 52(12):5406–5425, 2006.
[16] M. Charikar. Similarity estimation techniques from rounding algorithms. In Proceedings of
the 34th Annual ACM Symposium on Theory of Computing, pages 380–388, 2002.
25
Page 27
[17] G. Cormode and S. Muthukrishnan. Combinatorial algorithms for compressed sensing. In Pro-
ceedings of the 13th International Colloquium on Structural Information and Communication
Complexity, pages 280–294, 2006.
[18] D. L. Donoho. Compressed sensing. IEEE Transactions on Information Theory, 52(4):1289–
1306, 2006.
[19] T. Feder, E. Kushilevitz, M. Naor, and N. Nisan. Amortized communication complexity. SIAM
Journal on Computing, 24(4):736–750, 1995.
[20] J. Feigenbaum, Y. Ishai, T. Malkin, K. Nissim, M. J. Strauss, and R. N. Wright. Secure
multiparty computation of approximations. ACM Transactions on Algorithms, 2(3):435–472,
2006.
[21] J. Feigenbaum, S. Kannan, M. Strauss, and M. Viswanathan. An approximate L1-difference
algorithm for massive data streams. SIAM Journal on Computing, 32(1):131–151, 2002.
[22] S. Ganguly and A. Majumder. Deterministic k-set structure. In Proceedings of the 25th ACM
SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, pages 280–289,
2006.
[23] R. Gennaro, S. Halevi, and T. Rabin. Secure hash-and-sign signatures without the random
oracle. In Advances in Cryptology - EUROCRYPT ’99, pages 123–139, 1999.
[24] P. B. Gibbons and Y. Matias. Synopsis data structures for massive data sets. In Proceedings
of the 10th Annual ACM-SIAM Symposium on Discrete Algorithms, pages 909–910, 1999.
[25] A. C. Gilbert, M. J. Strauss, J. A. Tropp, and R. Vershynin. One sketch for all: fast algorithms
for compressed sensing. In Proceedings of the 39th Annual ACM Symposium on Theory of
Computing, pages 237–246, 2007.
[26] O. Goldreich, S. Goldwasser, and D. Ron. Property testing and its connection to learning and
approximation. Journal of the ACM, 45(4):653–750, 1998.
[27] M. R. Henzinger, P. Raghavan, and S. Rajagopalan. Computing on data streams. In External
memory algorithms, pages 107–118. American Mathematical Society, 1999.
[28] P. Indyk. Explicit constructions of selectors and related combinatorial structures, with appli-
cations. In Proceedings of the 13th Annual ACM-SIAM Symposium on Discrete Algorithms,
pages 697–704, 2002.
[29] P. Indyk. Explicit constructions for compressed sensing of sparse signals. In Proceedings of the
19th Annual ACM-SIAM Symposium on Discrete Algorithms, pages 30–33, 2008.
[30] P. Indyk and R. Motwani.
of dimensionality.
Computing, pages 604–613, 1998.
Approximate nearest neighbors: Towards removing the curse
In Proceedings of the 30th Annual ACM Symposium on the Theory of
[31] P. Indyk and D. P. Woodruff. Polylogarithmic private approximations and efficient matching.
In Proceedings of the 3rd Theory of Cryptography Conference, pages 245–264, 2006.
[32] I. Kremer, N. Nisan, and D. Ron. On randomized one-round communication complexity.
Computational Complexity, 8(1):21–49, 1999.
26
Page 28
[33] E. Kushilevitz, R. Ostrovsky, and Y. Rabani. Efficient search for approximate nearest neighbor
in high dimensional spaces. SIAM Journal on Computing, 30(2):457–474, 2000.
[34] T. Moran, M. Naor, and G. Segev. Deterministic history-independent strategies for storing
information on write-once memories. In Proceedings of the 34th International Colloquium on
Automata, Languages and Programming, pages 303–315, 2007.
[35] S. Muthukrishnan. Data streams: Algorithms and applications. Foundations and Trends in
Theoretical Computer Science, 1(2), 2005.
[36] J. Naor and M. Naor. Small-bias probability spaces: Efficient constructions and applications.
SIAM Journal on Computing, 22(4):838–856, 1993.
[37] I. Newman and M. Szegedy. Public vs. private coin flips in one round communication games. In
Proceedings of the 28th Annual ACM Symposium on the Theory of Computing, pages 561–570,
1996.
[38] N. Nisan and A. Ta-Shma. Extracting randomness: A survey and new constructions. Journal
of Computer and System Sciences, 58(1):148–173, 1999.
[39] R. Rubinfeld and M. Sudan. Robust characterizations of polynomials with applications to
program testing. SIAM Journal on Computing, 25(2):252–271, 1996.
[40] M. Sipser. Expanders, randomness, or time versus space. Journal of Computer and System
Sciences, 36(3):379–383, 1988.
[41] A. Ta-Shma, C. Umans, and D. Zuckerman. Lossless condensers, unbalanced expanders, and
extractors. Combinatorica, 27(2):213–240, 2007.
[42] H. S. Witsenhausen and A. D. Wyner. Interframe coder for video signals. U.S. patent number
4,191,970, 1980.
[43] A. C. Yao. Some complexity questions related to distributive computing. In Proceedings of
the 11th Annual ACM Symposium on Theory of Computing, pages 209–213, 1979.
27