Page 1

Sketching in Adversarial Environments

Ilya Mironov∗

Moni Naor†

Gil Segev‡

Abstract

We formalize a realistic model for computations over massive data sets. The model, re-

ferred to as the adversarial sketch model, unifies the well-studied sketch and data stream models

together with a cryptographic flavor that considers the execution of protocols in “hostile en-

vironments”, and provides a framework for studying the complexity of many tasks involving

massive data sets.

In the adversarial sketch model several parties are interested in computing a joint function in

the presence of an adversary that dynamically chooses their inputs. These inputs are provided to

the parties in an on-line manner, and each party incrementally updates a compressed sketch of

its input. The parties are not allowed to communicate, they do not share any secret information,

and any public information they share is known to the adversary in advance. Then, the parties

engage in a protocol in order to evaluate the function on their current inputs using only the

compressed sketches.

In this paper we settle the complexity of two fundamental problems in this model: testing

whether two massive data sets are equal, and approximating the size of their symmetric differ-

ence. For these problems we construct explicit and efficient protocols that are optimal up to

poly-logarithmic factors. Our main technical contribution is an explicit and deterministic encod-

ing scheme that enjoys two seemingly conflicting properties: incrementality and high distance,

which may be of independent interest.

A preliminary version of this work appeared in Proceedings of the 40th Annual ACM Symposium on Theory of

Computing (STOC), pages 651-660, 2008.

∗Microsoft Research, Silicon Valley Campus, 1065 La Avenida, Mountain View, CA 94043. Email: mironov@

microsoft.com.

†Incumbent of the Judith Kleeman Professorial Chair, Department of Computer Science and Applied Mathematics,

Weizmann Institute of Science, Rehovot 76100, Israel. Email: moni.naor@weizmann.ac.il. Research supported in

part by a grant from the Israel Science Foundation. Part of the work was done at Microsoft Research.

‡Department of Computer Science and Applied Mathematics, Weizmann Institute of Science, Rehovot 76100,

Israel. Email: gil.segev@weizmann.ac.il. Part of the work was done at Microsoft Research.

Page 2

1 Introduction

The past two decades have witnessed striking technological breakthroughs in information collection

and storage capabilities. These breakthroughs allowed the emergence of enormous collections of

data, referred to as massive data sets, such as the World Wide Web, Internet traffic logs, finan-

cial transactions, census data and many more. This state of affairs introduces new and exciting

challenges in analyzing massive data sets and extracting useful information.

From a computational point of view, most of the traditional computational models consider

settings in which the input data is easily and efficiently accessible. This is, however, usually not

the case when dealing with massive data sets. Such data sets may either be stored on highly

constrained devices or may only be accessed in an on-line manner without the ability to actually

store any significant fraction of the data. In recent years several computational models which

are suitable for computing over massive data sets have been developed, such as sketch and lossy

compression schemes [13, 24], data stream computations [3, 21, 27], and property testing [26, 39].

Motivated by the challenges posed by computational tasks involving massive data sets, and by

the existing approaches for modeling such tasks, we formalize a realistic model of computation which

we refer to as the adversarial sketch model. This model can be seen as unifying the standard sketch

model and the data stream model together with a cryptographic flavor that considers the execution

of protocols in “hostile environments”. The model under consideration provides a framework for

studying the complexity of many fundamental and realistic problems that arise in the context of

massive data sets. In what follows we briefly describe the standard sketch model and the data

stream model, as well as our approach for modeling computations in hostile environments in this

context.

The standard sketch model.

several parties. Each party runs a compression procedure to obtain a compact “sketch” of its input,

and these sketches are then delivered to a referee. The referee has to compute (or to approximate)

the value of a pre-determined function applied to the inputs of the parties by using only the sketches

and not the actual inputs. The parties are not allowed to communicate with each other, but are

allowed to share a random reference string which is chosen independently of their inputs. This

string can be used, for example, to choose a random hash function that will be applied by each

party to obtain a compressed sketch of its input. This model fits many scenarios in which a massive

data set is partitioned and stored in a distributed manner in several locations. In each location a

compressed sketch of the stored data is computed, and then sent to a central processing unit that

uses only the sketches and not the actual data. The main performance criterion for protocols in

this model is the size of the sketches. We note that this model is essentially the public-coin variant

of the simultaneous communication model introduced by Yao [43].

In the standard sketch model the input is distributed among

The data stream model.

Once an element from the stream has been processed it is discarded and cannot be retrieved unless

it is explicitly stored in memory, which is typically small relative to the size of the data stream.

The data stream model captures scenarios in which computations involve either massive data sets

that are stored on sequential magnetic devices (for which one-way access is the most efficient access

method), or on-line data which is continuously generated and not necessarily stored. The main

performance criteria for algorithms in this model is the amount of storage they consume and the

amount of time required for processing each element in the stream. For a more complete description

of this model, its variants and the main results we refer the reader the surveys by Muthukrishnan

[35] and by Babcock et al. [5].

In the data stream model the input is received as a one-way stream.

1

Page 3

The adversarial factor.

parties share a random string, which is chosen independently of the inputs held by the parties1.

In many real-life scenarios, however, it is not at all clear that such an assumption is valid. First,

since the parties are assumed not to communicate with each other, this enforces the introduction

of trust in a third party to set up the random string. In many situations such trust may not

be available, and if the shared string is set up in an adversarial manner there are usually no

guarantees on the behavior of the protocol. That is, there may be “bad” choices of the shared

string that cause the protocol to fail with very high probability. Second, even when a truly random

string is available, this string may be known to an adversary as well (and in advance), and serve as

a crucial tool in attacking the system. For example, an adversary may be able to set the inputs of

the parties after having seen the random string. Thus, when considering computations in a setting

where the inputs of the parties may be adversarially chosen, it is usually not justified to assume

independence between the shared random string and the inputs of the parties2. For these reasons

we are interested in exploring the feasibility and efficiency of computations over massive data sets in

hostile environments. In such environments the honest parties do not share any secret information,

and any public information they share is known to the adversary in advance who may then set the

inputs of the parties. Protocols designed in such a model have significant security and robustness

benefits.

In the standard sketch model described above, it is assumed that the

Sketching in adversarial environments.

two honest parties, Alice and Bob, and an adversarial party3. Computation in this model proceeds

in two phases. In the first phase, referred to the as the sketch phase, the adversarial party chooses

the inputs of Alice and Bob. These inputs are sets of elements taken from a large universe U, and

provided to the honest parties in an on-line manner in the form of a sequence of insert and delete

operations. Once an operation from the sequence has been processed it is discarded and cannot be

retrieved unless explicitly stored. This phase defines the input sets SA⊆ U and SB⊆ U of Alice

and Bob, respectively. During this phase the honest parties are completely isolated in the sense

that (1) they are not allowed to communicate with each other, and (2) the sequence of operations

communicated to each party is hidden from the other party. In addition, we assume that the honest

parties do not share any secret information, and that any public information they share is known

to the adversary in advance. In the second phase, referred to as the interaction phase, Alice and

Bob engage in a protocol in order to compute (or approximate) a pre-determined function of their

input sets.

When designing protocols in the adversarial sketch model we are mainly interested in the

following performance criteria: (1) the amount of storage (i.e., the size of the sketches), (2) the

update time during the sketch phase (i.e., the time required for processing each of the insert and

delete operations), and (3) the communication and computation complexity during the interaction

phase.

The most natural question that arises in this setting is to characterize the class of functions

that can be computed or approximated in this model with sublinear sketches and poly-logarithmic

We consider a model with three participating parties:

1In the data stream model, when dealing only with insertions, several deterministic algorithms are known, most

notably those based on the notion of core-sets (see, for example, [2, 6]).

2Typical examples include: (1) Plagiarism detection – two parties wish to compute some similarity measure

between documents. In this case the inputs (i.e., the documents) are chosen by the assumed plagiarizer. (2) Traffic

logs comparison: two internet routers wish to compare their recent traffic logs. The inputs of the routers can be

influenced by any party that can send packets to the routers.

3For concreteness we focus in this informal discussion on the simplest case where only two honest parties are

participating in the computation. We note that the model naturally generalizes to any number of honest parties.

2

Page 4

update time, communication and computation4. In the standard sketch model a large class of

functions was shown to be computed or approximated with highly compressed sketches whose size

is only poly-logarithmic in the size of the input. Therefore, one can ask the rather general question

of whether the adversarial sketch model “preserves sublinearity and efficiency”. That is, informally:

Is any function, computable in the standard sketch model with highly compressed sketches

and poly-logarithmic update time, also computable in the adversarial sketch model with

sublinear sketches and poly-logarithmic update time, communication and computation?

1.1 Our Contributions

In this paper we study the two fundamental problems of testing whether two massive data sets are

equal, and approximating the size of their symmetric difference. For these problems we provide

an affirmative answer to the above question. We construct explicit and efficient protocols with

sketches of essentially optimal sublinear size, poly-logarithmic update time during the sketch phase,

and poly-logarithmic communication and computation during the interaction phase. We settle the

complexity, up to logarithmic factors, of these two problems in the adversarial sketch model.

Our main technical contribution, that serves as a building block of our protocols, is an explicit

and deterministic encoding scheme that enjoys two seemingly conflicting properties: incrementality

and high distance. That is, the encoding guarantees that (1) for any set S and element x the

encodings of the sets S∪{x} and S\{x} can be easily computed from the encoding of S by modifying

only a small number of entries, and (2) the encodings of any two distinct sets significantly differ with

respect to a carefully chosen weighted distance. In addition, the scheme enables efficient (linear

time) decoding. We believe that an encoding scheme with these properties can find additional

applications, and may be of independent interest. In what follows we formally state our results5.

Equality testing.

by the size N of the universe of elements from which the sets are taken, and by an upper bound K

on the size of the sets to be tested6. Our construction provides an explicit protocol, and in addition

a non-constructive proof for the existence of a protocol that enjoys slightly better guarantees7. We

prove the following theorem:

An equality testing protocol in the adversarial sketch model is parameterized

Theorem 1.1. In the adversarial sketch model, for every N, K and 0 < δ < 1 there exists a

protocol for testing the equality of two sets of size at most K taken from a universe of size N with

the following properties:

1. Perfect completeness: For any two sequences of insert and delete operations communicated to

the parties that lead to the same set of elements, the parties always output Equal.

4Various relaxations may be interesting as well, for instance, allowing rather high communication complexity.

5Our protocols have the property that, during the interaction phase, the amount of computation is linear in the

amount of communication. Therefore, for simplicity, we omit the computation cost and only state the communication

complexity.

6We note that the upper bound K on the size of the sets only imposes a restriction on the size of the sets at

the end of the sketch phase. During the sketch phase the parties should be able to deal with sets of arbitrary size,

and nevertheless the size of the sketches refers to their maximal size during the sketch phase. A possible adversarial

strategy, for example, is to insert all the N possible elements and then to delete N − K of them.

7The poly-logarithmic gap between the non-explicit and the explicit parameters is due to a poly-logarithmic gap

between the optimal and the known explicit constructions of dispersers (see, for example, [41]). Any improved explicit

construction of dispersers will, in turn, improve our explicit protocols.

3

Page 5

2. Soundness: For any two sequences of insert and delete operations communicated to the parties

that do not lead to the same set of elements, the parties output Not Equal with probability8

at least 1 − δ.

3. The size of the sketches, the update time during the sketch phase, and the communication

complexity during the interaction phase are described below in Table 1.

Non-explicit protocol

??K · logN · log(1/δ)

O??log2K + logK · loglogN?· log(1/δ)?

Table 1: The non-explicit and explicit parameters of the equality testing protocol.

Explicit protocol

?K · polylog(N) · log(1/δ)

polylog(N) · log(1/δ)

Size of sketches

Update time

Communication

O

?

O(logK · logN) polylog(N)

A rather straightforward reduction of computations in the private-coin simultaneous communi-

cation model to computations in the adversarial sketch model (see Section 2) implies that the size

the sketches in our protocols is essentially optimal (the following theorem is stated for protocols

with constant error).

Theorem 1.2. Any equality testing protocol in the adversarial sketch model requires sketches of

size Ω.

??K · log(N/K)

Approximating the size of the symmetric difference.

ables two parties to approximate the size of the symmetric difference between their two input sets

determined during the sketch phase. We prove the following theorem:

?

We construct a protocol that en-

Theorem 1.3. In the adversarial sketch model, for every N, K, 0 < δ < 1 and constant 0 < ρ ≤ 1,

there exists a protocol for approximating the size of the symmetric difference between two sets of

size at most K taken from a universe of size N with the following properties:

1. For any two sequences of insert and delete operations communicated to the parties that lead

to sets with symmetric difference of size ∆OPT, the parties output ∆APXsuch that

Pr[∆OPT≤ ∆APX≤ (1 + ρ)∆OPT] > 1 − δ .

2. Sketches of size O?√K · logN · (loglogK + log(1/δ))?.

3. Update time O(logK · logN).

4. Communication complexity O??log2K + logK · loglogN?· (loglogK + log(1/δ))?.

As with the equality testing protocol, our construction provides an explicit protocol as well.

The explicit protocol guarantees that Pr[∆OPT≤ ∆APX≤ polylog(N)∆OPT] > 1 − δ, and the size of

sketches, update time and communication complexity match those stated in Theorem 1.3 up to

poly-logarithmic factors.

8The probability is taken only over the internal coin tosses of the honest parties.

4