Sketching in Adversarial Environments.
-
Citations (0)
-
Cited In (0)
Page 1
Sketching in Adversarial Environments
Ilya Mironov∗
Moni Naor†
Gil Segev‡
Abstract
We formalize a realistic model for computations over massive data sets. The model, re-
ferred to as the adversarial sketch model, unifies the well-studied sketch and data stream models
together with a cryptographic flavor that considers the execution of protocols in “hostile en-
vironments”, and provides a framework for studying the complexity of many tasks involving
massive data sets.
In the adversarial sketch model several parties are interested in computing a joint function in
the presence of an adversary that dynamically chooses their inputs. These inputs are provided to
the parties in an on-line manner, and each party incrementally updates a compressed sketch of
its input. The parties are not allowed to communicate, they do not share any secret information,
and any public information they share is known to the adversary in advance. Then, the parties
engage in a protocol in order to evaluate the function on their current inputs using only the
compressed sketches.
In this paper we settle the complexity of two fundamental problems in this model: testing
whether two massive data sets are equal, and approximating the size of their symmetric differ-
ence. For these problems we construct explicit and efficient protocols that are optimal up to
poly-logarithmic factors. Our main technical contribution is an explicit and deterministic encod-
ing scheme that enjoys two seemingly conflicting properties: incrementality and high distance,
which may be of independent interest.
A preliminary version of this work appeared in Proceedings of the 40th Annual ACM Symposium on Theory of
Computing (STOC), pages 651-660, 2008.
∗Microsoft Research, Silicon Valley Campus, 1065 La Avenida, Mountain View, CA 94043. Email: mironov@
microsoft.com.
†Incumbent of the Judith Kleeman Professorial Chair, Department of Computer Science and Applied Mathematics,
Weizmann Institute of Science, Rehovot 76100, Israel. Email: moni.naor@weizmann.ac.il. Research supported in
part by a grant from the Israel Science Foundation. Part of the work was done at Microsoft Research.
‡Department of Computer Science and Applied Mathematics, Weizmann Institute of Science, Rehovot 76100,
Israel. Email: gil.segev@weizmann.ac.il. Part of the work was done at Microsoft Research.
Page 2
1Introduction
The past two decades have witnessed striking technological breakthroughs in information collection
and storage capabilities. These breakthroughs allowed the emergence of enormous collections of
data, referred to as massive data sets, such as the World Wide Web, Internet traffic logs, finan-
cial transactions, census data and many more. This state of affairs introduces new and exciting
challenges in analyzing massive data sets and extracting useful information.
From a computational point of view, most of the traditional computational models consider
settings in which the input data is easily and efficiently accessible. This is, however, usually not
the case when dealing with massive data sets. Such data sets may either be stored on highly
constrained devices or may only be accessed in an on-line manner without the ability to actually
store any significant fraction of the data. In recent years several computational models which
are suitable for computing over massive data sets have been developed, such as sketch and lossy
compression schemes [13, 24], data stream computations [3, 21, 27], and property testing [26, 39].
Motivated by the challenges posed by computational tasks involving massive data sets, and by
the existing approaches for modeling such tasks, we formalize a realistic model of computation which
we refer to as the adversarial sketch model. This model can be seen as unifying the standard sketch
model and the data stream model together with a cryptographic flavor that considers the execution
of protocols in “hostile environments”. The model under consideration provides a framework for
studying the complexity of many fundamental and realistic problems that arise in the context of
massive data sets. In what follows we briefly describe the standard sketch model and the data
stream model, as well as our approach for modeling computations in hostile environments in this
context.
The standard sketch model.
several parties. Each party runs a compression procedure to obtain a compact “sketch” of its input,
and these sketches are then delivered to a referee. The referee has to compute (or to approximate)
the value of a pre-determined function applied to the inputs of the parties by using only the sketches
and not the actual inputs. The parties are not allowed to communicate with each other, but are
allowed to share a random reference string which is chosen independently of their inputs. This
string can be used, for example, to choose a random hash function that will be applied by each
party to obtain a compressed sketch of its input. This model fits many scenarios in which a massive
data set is partitioned and stored in a distributed manner in several locations. In each location a
compressed sketch of the stored data is computed, and then sent to a central processing unit that
uses only the sketches and not the actual data. The main performance criterion for protocols in
this model is the size of the sketches. We note that this model is essentially the public-coin variant
of the simultaneous communication model introduced by Yao [43].
In the standard sketch model the input is distributed among
The data stream model.
Once an element from the stream has been processed it is discarded and cannot be retrieved unless
it is explicitly stored in memory, which is typically small relative to the size of the data stream.
The data stream model captures scenarios in which computations involve either massive data sets
that are stored on sequential magnetic devices (for which one-way access is the most efficient access
method), or on-line data which is continuously generated and not necessarily stored. The main
performance criteria for algorithms in this model is the amount of storage they consume and the
amount of time required for processing each element in the stream. For a more complete description
of this model, its variants and the main results we refer the reader the surveys by Muthukrishnan
[35] and by Babcock et al. [5].
In the data stream model the input is received as a one-way stream.
1
Page 3
The adversarial factor.
parties share a random string, which is chosen independently of the inputs held by the parties1.
In many real-life scenarios, however, it is not at all clear that such an assumption is valid. First,
since the parties are assumed not to communicate with each other, this enforces the introduction
of trust in a third party to set up the random string. In many situations such trust may not
be available, and if the shared string is set up in an adversarial manner there are usually no
guarantees on the behavior of the protocol. That is, there may be “bad” choices of the shared
string that cause the protocol to fail with very high probability. Second, even when a truly random
string is available, this string may be known to an adversary as well (and in advance), and serve as
a crucial tool in attacking the system. For example, an adversary may be able to set the inputs of
the parties after having seen the random string. Thus, when considering computations in a setting
where the inputs of the parties may be adversarially chosen, it is usually not justified to assume
independence between the shared random string and the inputs of the parties2. For these reasons
we are interested in exploring the feasibility and efficiency of computations over massive data sets in
hostile environments. In such environments the honest parties do not share any secret information,
and any public information they share is known to the adversary in advance who may then set the
inputs of the parties. Protocols designed in such a model have significant security and robustness
benefits.
In the standard sketch model described above, it is assumed that the
Sketching in adversarial environments.
two honest parties, Alice and Bob, and an adversarial party3. Computation in this model proceeds
in two phases. In the first phase, referred to the as the sketch phase, the adversarial party chooses
the inputs of Alice and Bob. These inputs are sets of elements taken from a large universe U, and
provided to the honest parties in an on-line manner in the form of a sequence of insert and delete
operations. Once an operation from the sequence has been processed it is discarded and cannot be
retrieved unless explicitly stored. This phase defines the input sets SA⊆ U and SB⊆ U of Alice
and Bob, respectively. During this phase the honest parties are completely isolated in the sense
that (1) they are not allowed to communicate with each other, and (2) the sequence of operations
communicated to each party is hidden from the other party. In addition, we assume that the honest
parties do not share any secret information, and that any public information they share is known
to the adversary in advance. In the second phase, referred to as the interaction phase, Alice and
Bob engage in a protocol in order to compute (or approximate) a pre-determined function of their
input sets.
When designing protocols in the adversarial sketch model we are mainly interested in the
following performance criteria: (1) the amount of storage (i.e., the size of the sketches), (2) the
update time during the sketch phase (i.e., the time required for processing each of the insert and
delete operations), and (3) the communication and computation complexity during the interaction
phase.
The most natural question that arises in this setting is to characterize the class of functions
that can be computed or approximated in this model with sublinear sketches and poly-logarithmic
We consider a model with three participating parties:
1In the data stream model, when dealing only with insertions, several deterministic algorithms are known, most
notably those based on the notion of core-sets (see, for example, [2, 6]).
2Typical examples include: (1) Plagiarism detection – two parties wish to compute some similarity measure
between documents. In this case the inputs (i.e., the documents) are chosen by the assumed plagiarizer. (2) Traffic
logs comparison: two internet routers wish to compare their recent traffic logs. The inputs of the routers can be
influenced by any party that can send packets to the routers.
3For concreteness we focus in this informal discussion on the simplest case where only two honest parties are
participating in the computation. We note that the model naturally generalizes to any number of honest parties.
2
Page 4
update time, communication and computation4. In the standard sketch model a large class of
functions was shown to be computed or approximated with highly compressed sketches whose size
is only poly-logarithmic in the size of the input. Therefore, one can ask the rather general question
of whether the adversarial sketch model “preserves sublinearity and efficiency”. That is, informally:
Is any function, computable in the standard sketch model with highly compressed sketches
and poly-logarithmic update time, also computable in the adversarial sketch model with
sublinear sketches and poly-logarithmic update time, communication and computation?
1.1Our Contributions
In this paper we study the two fundamental problems of testing whether two massive data sets are
equal, and approximating the size of their symmetric difference. For these problems we provide
an affirmative answer to the above question. We construct explicit and efficient protocols with
sketches of essentially optimal sublinear size, poly-logarithmic update time during the sketch phase,
and poly-logarithmic communication and computation during the interaction phase. We settle the
complexity, up to logarithmic factors, of these two problems in the adversarial sketch model.
Our main technical contribution, that serves as a building block of our protocols, is an explicit
and deterministic encoding scheme that enjoys two seemingly conflicting properties: incrementality
and high distance. That is, the encoding guarantees that (1) for any set S and element x the
encodings of the sets S∪{x} and S\{x} can be easily computed from the encoding of S by modifying
only a small number of entries, and (2) the encodings of any two distinct sets significantly differ with
respect to a carefully chosen weighted distance. In addition, the scheme enables efficient (linear
time) decoding. We believe that an encoding scheme with these properties can find additional
applications, and may be of independent interest. In what follows we formally state our results5.
Equality testing.
by the size N of the universe of elements from which the sets are taken, and by an upper bound K
on the size of the sets to be tested6. Our construction provides an explicit protocol, and in addition
a non-constructive proof for the existence of a protocol that enjoys slightly better guarantees7. We
prove the following theorem:
An equality testing protocol in the adversarial sketch model is parameterized
Theorem 1.1. In the adversarial sketch model, for every N, K and 0 < δ < 1 there exists a
protocol for testing the equality of two sets of size at most K taken from a universe of size N with
the following properties:
1. Perfect completeness: For any two sequences of insert and delete operations communicated to
the parties that lead to the same set of elements, the parties always output Equal.
4Various relaxations may be interesting as well, for instance, allowing rather high communication complexity.
5Our protocols have the property that, during the interaction phase, the amount of computation is linear in the
amount of communication. Therefore, for simplicity, we omit the computation cost and only state the communication
complexity.
6We note that the upper bound K on the size of the sets only imposes a restriction on the size of the sets at
the end of the sketch phase. During the sketch phase the parties should be able to deal with sets of arbitrary size,
and nevertheless the size of the sketches refers to their maximal size during the sketch phase. A possible adversarial
strategy, for example, is to insert all the N possible elements and then to delete N − K of them.
7The poly-logarithmic gap between the non-explicit and the explicit parameters is due to a poly-logarithmic gap
between the optimal and the known explicit constructions of dispersers (see, for example, [41]). Any improved explicit
construction of dispersers will, in turn, improve our explicit protocols.
3
Page 5
2. Soundness: For any two sequences of insert and delete operations communicated to the parties
that do not lead to the same set of elements, the parties output Not Equal with probability8
at least 1 − δ.
3. The size of the sketches, the update time during the sketch phase, and the communication
complexity during the interaction phase are described below in Table 1.
Non-explicit protocol
??K · logN · log(1/δ)
O??log2K + logK · loglogN?· log(1/δ)?
Table 1: The non-explicit and explicit parameters of the equality testing protocol.
Explicit protocol
?K · polylog(N) · log(1/δ)
polylog(N) · log(1/δ)
Size of sketches
Update time
Communication
O
?
O(logK · logN) polylog(N)
A rather straightforward reduction of computations in the private-coin simultaneous communi-
cation model to computations in the adversarial sketch model (see Section 2) implies that the size
the sketches in our protocols is essentially optimal (the following theorem is stated for protocols
with constant error).
Theorem 1.2. Any equality testing protocol in the adversarial sketch model requires sketches of
size Ω.
??K · log(N/K)
Approximating the size of the symmetric difference.
ables two parties to approximate the size of the symmetric difference between their two input sets
determined during the sketch phase. We prove the following theorem:
?
We construct a protocol that en-
Theorem 1.3. In the adversarial sketch model, for every N, K, 0 < δ < 1 and constant 0 < ρ ≤ 1,
there exists a protocol for approximating the size of the symmetric difference between two sets of
size at most K taken from a universe of size N with the following properties:
1. For any two sequences of insert and delete operations communicated to the parties that lead
to sets with symmetric difference of size ∆OPT, the parties output ∆APXsuch that
Pr[∆OPT≤ ∆APX≤ (1 + ρ)∆OPT] > 1 − δ .
2. Sketches of size O?√K · logN · (loglogK + log(1/δ))?.
3. Update time O(logK · logN).
4. Communication complexity O??log2K + logK · loglogN?· (loglogK + log(1/δ))?.
As with the equality testing protocol, our construction provides an explicit protocol as well.
The explicit protocol guarantees that Pr[∆OPT≤ ∆APX≤ polylog(N)∆OPT] > 1 − δ, and the size of
sketches, update time and communication complexity match those stated in Theorem 1.3 up to
poly-logarithmic factors.
8The probability is taken only over the internal coin tosses of the honest parties.
4