Page 1

Epistemic Privacy

Alexandre EvfimievskiRonald FaginDavid Woodruff

IBM Almaden Research Center

650 Harry Rd., San Jose, CA, USA

{evfimi, fagin, dpwoodru}@us.ibm.com

ABSTRACT

Wepresent a novel definition ofprivacy in the framework ofoffline (retroac-

tive) database query auditing. Given information about the database, a de-

scription of sensitive data, and assumptions about users’ prior knowledge,

our goal is to determine if answering a past user’s query could have led

to a privacy breach. According to our definition, an audited property A

is private, given the disclosure of property B, if no user can gain confi-

dence in A by learning B, subject to prior knowledge constraints. Privacy

is not violated if the disclosure of B causes a loss of confidence in A. The

new notion of privacy is formalized using the well-known semantics for

reasoning about knowledge, where logical properties correspond to sets of

possible worlds (databases) that satisfy these properties. Database users are

modelled as either possibilistic agents whose knowledge is a set of possible

worlds, or as probabilistic agents whose knowledge is a probability distri-

bution on possible worlds.

We analyze the new privacy notion, show its relationship with the con-

ventional approach, and derive criteria that allow the auditor to test privacy

efficiently in some important cases. In particular, we prove characteriza-

tion theorems for the possibilistic case, and study in depth the probabilistic

case under the assumption that all database records are considered a-priori

independent by the user, as well as under more relaxed (or absent) prior-

knowledge assumptions. In the probabilistic case we show that for certain

families of distributions there is no efficient algorithm to test whether an au-

dited property A is private given the disclosure of a property B, assuming

P ?= NP. Nevertheless, for many interesting families, such as the fam-

ily of product distributions, we obtain algorithms that are efficient both in

theory and in practice.

Categories and Subject Descriptors: H.2.7 [Database Manage-

ment] : Database Administration; F.2.1 [Analysis of Algorithms

and Problem Complexity] : Numerical Algorithms and Problems

General Terms: Algorithms, Security, Theory

Keywords: privacy, disclosure, auditing, query logs, reasoning

about knowledge, supermodularity, Positivstellensatz

1.INTRODUCTION

Today, privacy protection has become a popular and even fash-

ionable area of database research. This situation is, of course, quite

natural, given the importance of privacy in our social life and the

risks we face in the digital world. These risks were highlighted

by numerous recent reports of personal data theft and misappro-

priation, prompting many countries to enact data protection laws.

However, the current state of scientific knowledge still does not al-

low the implementation of a comprehensive privacy solution that

guarantees provable protection. In fact, the notion of privacy itself

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are

not made or distributed for profit or commercial advantage and that copies

bear this notice and the full citation on the first page. To copy otherwise, to

republish, to post on servers or to redistribute to lists, requires prior specific

permission and/or a fee.

PODS’08, June 9–12, 2008, Vancouver, BC, Canada.

Copyright 2008 ACM 978-1-60558-108-8/08/06 ...$5.00.

has many definitions and interpretations, some focused on theoret-

ical soundness, others on practical usefulness. This paper attempts

to reduce the gap between these two aspects by exploring more

flexible yet sound definitions.

One typical privacy enforcement problem, called query auditing,

is to determine if answering a user’s database query could lead to a

privacy breach. To state the problem more accurately, we assume

that the auditor is given:

• The database at the time of the user’s query, or some partial

knowledge about that database;

• A description of information considered sensitive, often

called the privacy policy or the audit query;

• Assumptions about the user’s prior knowledge of the

database, of the audit query/privacy policy, and of the audi-

tor’s privacy enforcement strategy if it exists;

• The user’s query, or a range of queries.

The auditor wants to check whether answering a given query could

augment the user’s knowledge about some sensitive data, thereby

violating the privacy of that data. This problem has two extensions:

proactive privacy enforcement (also called online auditing [18]),

and retroactive or offline auditing.

In the proactive (online) privacy enforcement scenario, users is-

sue a stream of queries, and the database system decides whether

to answer or to deny each query. The denial, when it occurs, is

also an “answer” to some (implicit) query that depends on the au-

ditor’s privacy enforcement strategy, and therefore it may disclose

sensitive data. The strategy has to be chosen in advance, before

the user’s queries become available. A strategy that protects pri-

vacy for a specified range of queries represents a solution to this

auditing problem. An in-depth discussion of online auditing can be

found in [18, 23] and papers referenced therein.

In the retroactive (offline) scenario, the users issue their queries

and receive the answers; later, an auditor checks if a privacy viola-

tion might have occurred. The audit results are not made available

to the users, so the auditor’s behavior no longer factors into the dis-

closure of data, and this considerably simplifies the problem. This

also allows for more flexibility in defining sensitive information:

while in the proactive case the privacy policy is typically fixed and

open to the users, in the retroactive case the audit query itself may

be sensitive, e.g. based on an actual or suspected privacy breach [1,

22]. Retroactive auditing is the application that motivates this pa-

per, although our framework turns out to be fairly general.

To further illustrate the above, suppose Alice asks Bob for his

HIV status.Assume that Bob never lies and considers “HIV-

positive” to be sensitive information, while “HIV-negative” is for

him OK to disclose. Bob is HIV-negative at the moment; can he

adopt the proactive strategy of answering “I am HIV-negative” as

long as it is true? Unfortunately, this is not a safe strategy, because

if he does become HIV-positive in the future, he will have to deny

171

Page 2

further inquiries, and Alice will infer that he contracted HIV. The

safest bet for Bob is to always refuse an answer.1

For the retroactive scenario, suppose that Bob contracted HIV

in 2006. Alice, Cindy and Mallory legitimately gained access to

Bob’s health records and learned his HIV status, but Alice and

Cindy did it in 2005 and Mallory did in 2007. Bob discovers that

his disease is known to the drug advertisers, and he initiates an au-

dit, specifying “HIV-positive” as the audit query. The audit will

place the suspicion on Mallory, but not on Alice and Cindy.

In legal practice, retroactive law enforcement has shown to be

better suitedtothecomplexneeds ofour society, although proactive

measures are used too, especially in simple or critical situations.

For example, a valuable item can be protected from theft by lock

and key (a proactive measure) or by the fear of being caught and

jailed (a retroactive measure). If it is simple to fence off the item

and distribute the keys to all authorized users, or if the item has

extraordinary value, then proactive defense is the best option, but

in less clear-cut cases this would be too cumbersome or intrusive.

After all, even an authorized user might steal or lose the item, and

even a stranger sometimes should be able to gain access to it, e.g.

in an emergency. Healthcare [2] is one area where the complexity

of data management is just too high to hope for a fully proactive

solution toprivacy. Theimportance of offline disclosure auditing in

healthcare has been recognized by the U.S. President’s Information

Technology Advisory Committee [26], which recommended that

healthcare information systems have thecapability toaudit who has

accessed patient records. Webelieve in coexistence and importance

of both auditing approaches.

1.1 Privacy Definitions in Query Auditing

The art of encryption and cryptanalysis goes back to antiquity,

but the scientificmaturity of privacy theory was made possible only

in modern times by mathematical modeling of the eavesdropper’s

knowledge. One of the first such models was proposed in 1949

by Claude Shannon [29], who introduced the notion of perfect se-

crecy. Shannon suggested to represent the adversarial knowledge

by a probability distribution over possible private data values: prior

distribution before the cryptogram is revealed, and posterior dis-

tribution after the adversary sees the cryptogram (but not the key).

Perfect secrecy corresponds to thesituation where the posterior dis-

tribution is identical to the prior, for every possible cryptogram.

This general idea has been later adapted and extended to many pri-

vacy frameworks and problems, including query auditing.

Denote by Ω the set of all possible databases, and by A and B

two properties of these databases; each database ω ∈ Ω either has

or does not have each property. Assume that the actual database

satisfies both A and B. Suppose that property A is sensitive, and

property B is what user Alice has learned by receiving the answer

to her query. Was the privacy of A violated by the disclosure of B?

This depends on what Alice knew before learning B; for example,

if she knew “B ⇒ A” (but did not know A), then B of course

revealed A to her. Miklau and Suciu [21] applied Shannon’s model

to this problem and declared A to be private given B if and only if

P[A|B] = P[A]

(1)

for all probability distributions P over Ω that might describe Al-

ice’s prior knowledge about the database. Unfortunately, if no con-

straints are placed on P, no pair (A,B) of non-trivial properties

will satisfy this privacy definition. Miklau and Suciu considered a

quite limiting, yet popular, constraint: that Alice treats all database

records r ∈ ω independently, i.e. P is a product distribution:

P(ω) =

?

r∈ωP[r] ×

?

r/ ∈ω

?1 − P[r]?

1If Alice pays Bob for answers, he can balance privacy and profit by tossing

a coin and answering “I am HIV-negative” only if the coin falls heads.

Under this constraint, they prove that property A is private given

the disclosure of B if and only if they share no critical records

(Theorem 3.5 in [21]). A database record, real or imaginary, is

called “critical” for A (for B) if its presence or absence in some

database may decide the truthvalue for A (for B). One cansee that,

even with prior knowledge restricted to product distributions, very

few practical queries would get privacy clearance: usually we can

find an imaginary record and a pair of imaginary databases, ωAfor

A and ωBfor B, where inserting the record into ωA(into ωB) flips

the truth value of A (of B). Perfect secrecy appears too demanding

to be practical.

A number of recent papers studied ways to relax condition (1)

and make it approximate. They follow the same principle: for cer-

tain pairs of numerical bounds (ρ1,ρ2), ρ1 < ρ2, require that

P[A] ? ρ1 ⇒ P[A|B] ? ρ2

where P is a prior knowledge distribution. This idea is behind the

definition of ρ1-to-ρ2privacy breach in [12]; Kenthapadi et al. [18]

use a slightly different version as part of their definition:

1 − λ ? P[A|B]/P[A] ? 1/(1 − λ)

The Sub-Linear Queries (SuLQ) framework developed in [5, 10,

11] has a more sophisticated version with nice theoretical charac-

teristics:

?

While there is no space here to thoroughly review these frame-

works, conceptually they all require that no user can gain much

confidence in the audited property A by learning the disclosed

property B, subject to prior knowledge constraints.

Perhaps surprisingly, however, all papers known to us, in their

proofs if not in their definitions, do not make any distinction be-

tween gaining and losing the confidence in A upon learning B. For

example, the SuLQ results remain in force if the privacy definition

of [5] is changed by placing the absolute value sign “|...|” over the

difference in (2). In some papers [11] the “|...|” appears in the def-

inition explicitly.

It turns out that taking advantage of the gain-vs.-loss distinction

yields a remarkable increase in the flexibility of query auditing.

To bring it into focus, we shall put aside the approximate privacy

relaxations and replace Eq. (1) with inequality

Prlog

P[A|B]

1 − P[A|B]− log

P[A]

1 − P[A]> ε

?

? δ

(2)

P[A|B] ? P[A]

(3)

That is, we call property A private given the disclosure of property

B when (3) holds for all distributions P that are admissible as a

user’s prior knowledge. One might call this “semiperfect secrecy,”

for it has the same sort of “absolute” form as perfect secrecy. This

and related notions are the subject of this paper.

Let us illustrate its flexibility with a simple example of Alice

(a user) and Bob (a patient). The hospital’s database ω has two

records: r1 = “Bob is HIV-positive” and r2 = “Bob had blood

transfusions.” The sensitive property A is the presence of r1, i.e.

that Bob is HIV-positive. The property B that Alice queries and

learns is “r1 ∈ ω implies r2 ∈ ω,” i.e. that “if Bob is HIV-positive,

then he had blood transfusions.” We make no constraints on Alice’s

prior knowledge distribution, other than a nonzero probability of

the actual database. Could the disclosure of B violate the privacy

of A? Look at the following table of possible worlds:

r2 ∈ ω

A is true

A is false

r2 / ∈ ω

A is true ?

A is false

r1 ∈ ω

r1 / ∈ ω

172

Page 3

For Alice, learning B has the effect of ruling out the cell marked

with a ?, while leaving the other cells untouched. Whatever the

cells’ prior probabilities are, the odds of A can only go down:

P[A|B] ? P[A].

though A and B share a critical record r1, and regardless of any

possible dependence among the records.2

1.2Summary of Results

This paper studies a notion of database privacy that makes it

illegal for users to gain confidence about sensitive facts, yet allows

arbitrary confidence loss. Webegin inSections2and 3by introduc-

ing two novel privacy frameworks that implement the above con-

cept for two different knowledge representations: possibilistic and

probabilistic. We outline some properties of our privacy definitions

that are relevant to the problem of testing privacy, and give neces-

sary and sufficient conditions for privacy with no restrictions on the

user’s prior knowledge.

Section 4 delves deeper into the possibilistic model. For cer-

tain important cases, notably when the constraints on a user’s prior

knowledge are intersection-closed (i.e. not violated by the collu-

sion of users), we give necessary and sufficient criteria for testing

possibilistic privacy, which reduce the complexity of this problem.

Sections 5 and 6 focus on the more complex probabilistic model,

over the set {0,1}nof Boolean vectors that represent subsets of

database records. Section 5 studies two probabilistic prior knowl-

edge constraints: bit-wiseindependence (product distributions) and

log-supermodularity. The bit-wise independence constraint was

used also in [21] by Miklau and Suciu, so our work can be viewed

as an extension of theirs. Log-supermodularity is chosen to pro-

vide a “middle ground” between bit-wise independence and the

unconstrained prior knowledge.

necessary criteria and sufficient criteria for privacy under the log-

supermodular and the product distribution constraints.

In Section 6, we study more general families Π of distributions

over {0,1}nthat can be described by the intersection of a finite

number of polynomial inequalities in a finite number of real-valued

variables. We prove that even for certain very restricted Π, de-

ciding whether a set B ⊆ {0,1}nviolates the privacy of a set

A ⊆ {0,1}nwith respect to distributions in Π cannot be done in

polynomial time, unless P = NP.

We overcome this negative result in two ways. First, using some

deep results from algebraic geometry, we show that in certain inter-

esting cases, such as when Π is the family of product distributions,

there are provably efficient algorithms for deciding privacy. Sec-

ond, we describe the sum-of-squares heuristic for deciding privacy

for any Π, which has been implemented and works remarkably

well in practice.

Thus, A is private with respect to B, even

We give simple combinatorial

2.WORLDS AND AGENTS

Epistemology, the study of knowledge, has a long and honorable

tradition in philosophy, starting with the early Greek philosophers.

Philosophers were concerned with questions such as “What does it

mean to say that someone knows something?” In the 1950’s and

1960’s [17, 19, 33] the focus shifted more to developing an epis-

temic logic, a logic of knowledge, and trying to capture the inherent

properties of knowledge. Here there is a set Ω of possible worlds,

one of which is the “real world” ω∗. An agent’s knowledge is a set

S ⊆ Ω of worlds that the agent considers possible. Since we are

modeling knowledge rather than belief, we require that ω∗∈ S.

If F is a (possible) fact, and A ⊆ Ω is the set of possible worlds

where F is true, then we say that the agent knows F iff S ⊆ A.

More recently, researchers in such diverse fields as economics,

linguistics, artificial intelligence, and theoretical computer science

2Note that if Bob proactively tells Alice “If I am HIV-positive, then I had

blood transfusions,” a privacy breach of A may occur, because Alice may

learn more than just B.

have become interested in reasoning about knowledge [13]. The

focus of attention has shifted to pragmatic concerns about the re-

lationship between knowledge and action. That is our focus: the

effect of an action, such as the disclosure of certain information, on

the knowledge of an agent.

Worlds

Let Ω be a finite set of all possible databases. We shall

call a database ω ∈ Ω a world, and the entire Ω the set of all

possible worlds. The actual world, denoted by ω∗, represents the

real database. Every property of the database, or assertion about

its contents, can be formulated as “ω∗∈ A” where A ⊆ Ω is the

set of all databases that satisfy the property. A subset A ⊆ Ω that

contains ω∗shall be called a knowledge set.

Agents

We shall think of database users as agents who know

something about the worlds in Ω and who try to figure out which

ω ∈ Ω is the actual world ω∗. An agent’s knowledge can be mod-

elled in different ways; we shall consider two approaches. In a

possibilistic agent, knowledge is represented by a set S ⊆ Ω that

contains exactly all the worlds this agent considers possible. In par-

ticular, ω∗∈ S. Here every world is either possible or not, with no

ranking or score assigned. In a probabilistic agent, knowledge is

represented by a probability distribution P : Ω → R+that assigns

a nonnegative weight P(ω) to every world. We denote the sum

?

erty A ⊆ Ω when S ⊆ A. We say that A is possible for this agent

a probabilistic agent with distribution P, to know A means to have

P[A] = 1, and to consider A possible means to have P[A] > 0.

A function Q that maps Ω to another set shall be called a query;

if its range is {0,1} then Q is a Boolean query. For a given actual

world ω∗, each query Q corresponds to the knowledge set associ-

ated with the query’s “actual” output:?ω ∈ Ω??Q(ω) = Q(ω∗)?.

tor whose task is to analyse the queries disclosed to the users and

determine which of these disclosures could breach privacy. The au-

ditor may or may not have complete information about the actual

world ω∗. For example, if the query disclosure occurred several

years ago, the record update logs may provide only a partial de-

scription of the database state at that moment. Even more impor-

tantly, the auditor does not know what the user’s knowledge of the

database was at the disclosure time. We characterize the auditor’s

knowledge by specifying which pairs of a database ω and the user’s

knowledge S (or P) the auditor considers possible. Let us formally

define the auditor’s knowledge about a user:

Definition 2.1. (Possibilistic case) A possibilistic knowledge

world is a pair (ω,S), where ω is a world and S is a knowledge

set, which satisfies ω ∈ S ⊆ Ω. The set of all possibilistic knowl-

edge worlds shall be denoted as

?(ω,S)??ω ∈ S ⊆ Ω?

knowledge is S∗⊆ Ω, the pair (ω∗,S∗) ∈ Ωposs is called the

actual knowledge world. The auditor’s knowledge about the user

is defined as a non-empty set K ⊆ Ωposs of knowledge worlds,

which must include the actual knowledge world. We refer to K as

a second-level knowledge set.

Our knowledge worlds are similar to the 2-worlds of [14], except

that the 2-worlds of [14] would deal not only with the knowledge

that the user has of the world, but also with the knowledge that

the auditor has of the world, Also, our second-level knowledge sets

are similar to the 3-worlds of [14], except that the 3-worlds of [14]

would deal not only with the knowledge that the auditor has about

the user’s knowledge of the world, but also with the knowledge that

the user has about the auditor’s knowledge of the world.

ω∈AP(ω) by P[A], requiring that P[Ω] = 1 and P(ω∗) > 0.

Wesaythat apossibilisticagent withknowledge S knows aprop-

when S ∩ A = ∅, i.e. when the agent does not know Ω − A. For

The Auditor

There is a special “meta-agent” called the audi-

Ωposs :=

Ωposs can be viewed as an extension of Ω. For a given user whose

173

Page 4

Definition 2.2. (Probabilistic case) A probabilistic knowledge

world is a pair (ω,P) where P is a probability distribution over

Ω such that P(ω) > 0. The set of all probabilistic knowledge

worlds shall be denoted as

?(ω,P)??P is a distribution, P(ω) > 0?.

second-level knowledge set K ⊆ Ωprobare defined analogously to

the possibilistic case.

Ωprob :=

The actual knowledge world (ω∗,P∗) ∈ Ωprob and the auditor’s

Remark 2.3. The requirement of ω

(ω,S) ∈ Ωposs and of P(ω) > 0 for every pair (ω,P) ∈ Ωprob

represent our assumption that every agent considers the actual

world possible. All pairs that violate this assumption are excluded

as inconsistent. Note that a probabilistic pair (ω,P) is consis-

tent iff the possibilistic pair

supp(P) := {ω |P(ω) > 0}.

Remark 2.4. In practice, it may be computationally infeasible

to precisely characterize the auditor’s second-level knowledge and

to use this precisely characterized knowledge in the privacy defini-

tions. Instead, the auditor makes assumptions about the database

and the user’s knowledge by placing constraints on the possible

pairs (ω,S) or (ω,P). These assumptions and constraints are also

represented by a second-level knowledge set, which must contain

the auditor’sprecise knowledge set asasubset. Fromnow on, when

we talk about the auditor’s knowledge set, we mean a superset of

the actual knowledge set, unless stated otherwise.

∈

S for every pair

?ω,supp(P)?

is consistent, where

Definitions 2.1 and 2.2 allow us to consider an auditor whose

assumptions about the user’s knowledge depend on the contents of

the database. For example, the auditor may assume that, if the hos-

pital database contains record “Bob’s doctor is Alice,” then Alice

knows Bob’s HIV status, but if there is no such record, then Alice

may or may not know it. However, in many situations we can sep-

arate the auditor’s knowledge about the database from the auditor’s

assumptions about the user. We do so by specifying two sets:

1. A non-empty set C ⊆ Ω that consists of all databases the

auditor considers possible, with ω∗∈ C;

2. A family Σ of subsets of Ω and/or a family Π of probability

distributions over Ω. The possibilistic agent’s knowledge has

to belong to Σ, the probabilistic agent’s knowledge has to

belong to Π.

If the auditor knows the actual database exactly, e.g. by recon-

structing its state from the update logs, then C = {ω∗}; if the

auditor has no information about the database or is unwilling to

take advantage of it, then C = Ω. Some choices for Σ and Π will

be discussed in the subsequent sections.

When we say that the auditor’s knowledge is represented by C

and Σ described above, we mean that all knowledge worlds (ω,S)

with ω ∈ C and S ∈ Σ, and none other, are considered possible

by the auditor. However, in most cases the auditor’s second-level

knowledge set cannot be the Cartesian product C × Σ, because it

contains inconsistent (ω,S) pairs (see Remark 2.3). The same is

true in the probabilistic case, for C and Π. Let us then define a

product operation that excludes all inconsistent pairs:

Definition 2.5. The product of a set C ⊆ Ω and a family Σ of

subsets of Ω (a family Π of probability distributions over Ω) is a

second-level knowledge set C ⊗Σ (C ⊗Π) defined by

C ⊗ Σ :=

C ⊗ Π :=

Wecallthepair(C,Σ)or(C,Π)consistent iftheir product C ⊗ Σ

or C ⊗ Π is non-empty, because ∅ is not a valid second-level

knowledge set.

?(ω,S) ∈ C×Σ

??ω ∈ S?

= (C×Σ) ∩ Ωposs

= (C×Π) ∩ Ωprob

?(ω,P) ∈ C×Π

??P(ω)>0?

3. PRIVACY OF KNOWLEDGE

This section introduces the definition of privacy for the possi-

bilistic and the probabilistic knowledge models. Let A,B ⊆ Ω

be two arbitrary non-empty subsets of Ω; as a shorthand, write

¯ A = Ω − A and AB = A ∩ B. Sets A and B correspond to two

Boolean queries on the database ω∗; e.g. query A returns “true” iff

ω∗∈ A and “false” otherwise.

We shall study the following question: When could the disclo-

sure of B violate the privacy of A? In our model, a positive result

of query A is considered private and needs protection, whereas a

negative result (that asserts¯ A) is not protected. Neither the user

nor the auditor are assumed to know if A is true, and A may ac-

tually be false. On the other hand, B represents the disclosed fact,

and therefore B has to be true. The auditor knows that B is true;

the user transitions from not knowing B to knowing B.

Conceptually, we say that property A is private, given the disclo-

sure of property B, if the user could not gain confidence in A by

learning B. Below we shall make this notion precise for the two

knowledge models, possibilistic and probabilistic. From now on,

we shall use pronoun “he” for the user and “she” for the auditor.

3.1Possibilistic Privacy

Let us suppose first that the auditor knows everything: the actual

database ω∗such that ω∗∈ B, and the actual knowledge set S∗of

the user at the time of the disclosure. In the possibilistic model, the

user may have only two “grades of confidence” in property A: he

either knows A (S∗⊆ A), or he does not (S∗⊆ A). Theuser gains

confidence iff he does not know A before learning B (i.e. S∗⊆ A)

and knows A after learning B (i.e. S∗∩ B ⊆ A). Therefore,

equivalently, iff

the privacy of A is preserved iff ¬(S∗⊆ A & S∗∩ B ⊆ A), or

S∗∩ B ⊆ A ⇒ S∗⊆ A.

(4)

Now, suppose that the auditor does not know ω∗and S∗pre-

cisely, but has a second-level knowledge set K ⊆ Ωposs such

that (ω∗,S∗) ∈ K. Then the auditor makes sure that A is pri-

vate given B by checking condition (4) for all pairs in K. Before

doing so, the auditor must discard from K all pairs (ω,S) such that

ω / ∈ B, because they are inconsistent with the disclosure of B. We

arrive at the following possibilistic privacy definition:

Definition 3.1. Set A ⊆ Ω is called K-private given the disclo-

sure of set B ⊆ Ω, for K ⊆ Ωposs, iff

∀(ω,S) ∈ K :

We denote this predicate by SafeK(A,B).

Remark 3.2. It is easy to see from (5) that SafeK(A,B) and

K?⊆ K imply SafeK?(A,B). Therefore, the auditor may assume

less than she actually knows, i.e. consider more knowledge worlds

possible, and still catch all privacy violations, at the expense of

restricting more queries.

?ω ∈ B & S ∩ B ⊆ A?

⇒ S ⊆ A.

(5)

When the auditor wants to separate her knowledge about the

database from her assumptions about the user’s knowledge, she

represents her second-level knowledge set K as a product C ⊗ Σ,

where C ⊆ Ω and Σ is a family of subsets of Ω.

case we shall use the term “(C,Σ)-private” and the notation

SafeC,Σ(A,B), which is defined as SafeC⊗Σ(A,B); let us also

use P(Ω) to denote the power set of Ω.

PROPOSITION 3.3. For a consistent pair (C,Σ) such that

C ⊆ Ω and Σ ⊆ P(Ω), the privacy predicate SafeC,Σ(A,B) can

be equivalently defined as follows: (denoting S ∩B ∩C as SBC)

∀S ∈ Σ :

In this

?SBC = ∅ & SB ⊆ A?

⇒ S ⊆ A.

(6)

174

Page 5

3.2Probabilistic Privacy

Once again, suppose first that the auditor knows the actual

database ω∗∈ B and the actual probability distribution P∗that

represents the user’s knowledge prior to the disclosure. As opposed

to Section 3.1, in the probabilistic model the user has a continuum

of “grades of confidence” inA, measured by P∗[A]. The user gains

confidence iff his prior probability of A before learning B, which

is P∗[A], is strictlysmaller than his posterior probability of A after

B is disclosed, which is P∗[A|B]. Therefore, the privacy of A is

preserved iff

P∗[A|B] ? P∗[A].

The conditional probability P∗[A|B] is well-defined since

P∗[B] ? P∗(ω∗) > 0.

When the auditor does not know ω∗and P∗, but has a second-

level knowledge set K ⊆ Ωprobsuch that (ω∗,P∗) ∈ K, she has

to check inequality (7) for all possible pairs (ω,P) in K. Before

doing so, she must discard all pairs (ω,P) such that ω / ∈ B. We

obtain the following probabilistic privacy definition:

Definition 3.4. Set A ⊆ Ω is called K-private given the disclo-

sure of set B ⊆ Ω, for K ⊆ Ωprob, iff

∀(ω,P) ∈ K : ω ∈ B ⇒ P[A|B] ? P[A].

As before, we denote this predicate by SafeK(A,B).

Remark 3.5. In the probabilistic case, too, SafeK(A,B) and

K?⊆ K imply SafeK?(A,B). Thus, Remark 3.2 applies here.

When the auditor’s knowledge can be represented as a product

C ⊗ Π for some C ⊆ Ω and some family Π of probability dis-

tributions over Ω, we shall use the term “(C,Π)-private” and the

notation SafeC,Π(A,B), which is defined as SafeC⊗Π(A,B). In

this case the following proposition can be used:

(7)

(8)

PROPOSITION 3.6. For a consistent pair (C,Π) where C ⊆ Ω

and Π is a family of distributions over Ω, the privacy predicate

SafeC,Π(A,B) can be equivalently defined as follows:

∀P ∈ Π : P[BC] > 0 ⇒ P[AB] ? P[A]P[B].

In fact, the definition of privacy given by (9) can be further sim-

plified, for many families Π that occur in practice:

Definition 3.7. We shall call a family Π ω-liftable for ω ∈ Ω

when ∀P ∈ Π such that P(ω) = 0 it satisfies the condition

∀ε > 0 ∃P?∈ Π : P?(ω) > 0 & ||P − P?||∞ < ε.

Family Π is called S-liftable for a set S ⊆ Ω iff Π is ω-liftable for

∀ω ∈ S. The norm ||P − P?||∞ := maxω∈Ω|P(ω) − P?(ω)|.

PROPOSITION 3.8. For a consistent pair (C,Π) such that fam-

(9)

(10)

ily Π is C-liftable, and given BC = ∅ (since ω∗∈ BC), predi-

cate SafeC,Π(A,B) is equivalent to SafeΠ(A,B) defined as

SafeΠ(A,B)

def

⇐⇒ ∀P ∈ Π : P[AB] ? P[A]P[B].

Knowledge Acquisition

An agent (a database user) modifies his knowledge when he re-

ceives a disclosed query result. The disclosed property is a knowl-

edge set B ⊆ Ω, telling the agent that every world in Ω − B is

impossible. We model the agent’s acquisition of B as follows. A

possibilistic agent with prior knowledge S ⊆ Ω, upon receiving B

rior knowledge S ∩B. A probabilistic agent with prior distribution

P : Ω → R+, upon receiving B such that P[B] ? P(ω∗) > 0,

ends up with posterior distribution P(·|B) defined by

?

(11)

3.3

such that S ∩ B = ∅ (because ω∗∈ S ∩ B), ends up with poste-

P(ω |B) =

P(ω)/P[B],

0,

ω ∈ B

ω ∈ Ω − B

The acquisition of B1 followed by B2 is equivalent to the acqui-

sition of B1B2 = B1 ∩ B2. When the auditor’s second-level

knowledge set K representsher assumption about theuser’sknowl-

edge, rather than her knowledge of the user’s knowledge (see Re-

mark 2.4), she may want to require that K remains a valid assump-

tion after each disclosure. This property is formalized below:

Definition 3.9. Let K be a second-level knowledge set, which

may be possibilistic (K ⊆ Ωposs) or probabilistic (K ⊆ Ωprob).

A set B ⊆ Ω is called K-preserving when for all (ω,S) ∈ K

or (ω,P) ∈ K such that ω ∈ B we have (ω,S ∩ B) ∈ K or

?ω,P(·|B)?∈ K.

to disclose, while protecting the privacy of A, to an agent whose

knowledge satisfiesthe constraintsdefined by K. If, afterB1isdis-

closed, the updated agent’s knowledge still satisfies the constraints,

thenitissafetodiscloseB2too. Thus, itissafetodiscloseboth sets

at once—as long as at least one of them preserves the constraints:

Suppose that knowledge sets B1 and B2 are individually safe

PROPOSITION 3.10. For every second-level knowledge set K,

possibilistic or probabilistic, we have:

1. B1and B2are K-preserving ⇒ B1∩B2is K-preserving;

2. If SafeK(A,B1) and SafeK(A,B2) and if at least one of

B1, B2is K-preserving, then SafeK(A, B1∩ B2).

3.4Unrestricted Prior Knowledge

What is the characterization of privacy when the auditor knows

nothing? More formally, which knowledge sets A and B sat-

isfy K-privacy for K=Ωposs

K = Ωprob= Ω ⊗ Pprob(Ω), where Pprob(Ω) is the set of all

probability distributions over Ω? Also, what is the answer to this

question if the auditor has complete information about the actual

world ω∗, but knows nothing about the user’s knowledge, i.e. for

K = {ω∗} ⊗ P(Ω) and for K = {ω∗} ⊗ Pprob(Ω)? Here is a

theorem that answers these questions:

=Ω ⊗ P(Ω) and for

THEOREM 3.11. For all sets A,B ⊆ Ω and for all ω∗∈ B the

following four conditions are equivalent:

1. Either A ∩ B = ∅, or A ∪ B = Ω;

2. SafeK(A,B) for K = Ωposs;

3. SafeK(A,B) for K = Ωprob;

4. SafeK(A,B) for K = {ω∗} ⊗ Pprob(Ω).

Also, the following two conditions are equivalent (again ω∗∈ B):

1. A ∩ B = ∅, or A ∪ B = Ω, or ω∗∈ B −A;

2. SafeK(A,B) for K = {ω∗} ⊗ P(Ω).

Remark 3.12. In the auditing practice, the interesting case is

ω∗∈ A ∩ B, i.e. when the protected and the disclosed properties

tional privacy can thus be tested by checking whether A ∪ B = Ω,

i.e. whether “A or B” is always true.

are both true. In this case A ∩ B = ∅ and ω∗/ ∈ B −A. Uncondi-

4.POSSIBILISTIC CASE

In this section, we shall focus exclusively on the possibilistic

case; thus K ⊆ Ωposs. Proposition 4.1 below gives a necessary

and sufficient condition for K-preserving sets B to satisfy the pri-

vacy predicate SafeK(A,B), for a given and fixed set A. It asso-

ciates every world ω ∈ A with a “safety margin” β(ω) ⊆ Ω−A

which depends only on ω, A and K. Given B, the condition ver-

ifies whether every ω ∈ A occurs in B together with its “safety

margin,” or does not occur in B at all. The “safety margin” en-

sures that this ω will not reveal A to the agent, no matter what prior

knowledge S ∈ π2(K) the agent might have. (By πi we denote

the projection operation.)

175

Page 6

PROPOSITION 4.1. Let K ⊆ Ωposs be an arbitrary second-

level knowledge set, and let A ⊆ Ω.

β: A → P(Ω−A) such that ∀B ⊆ Ω

?∀ω ∈ AB : β(ω) ⊆ B?

and if B is K-preserving, then the converse holds:

?∀ω ∈ AB : β(ω) ⊆ B?.

Remark 4.2. In Proposition 4.1, the condition of B being

K-preserving is essential for the converse implication (13). In-

deed, let Ω = {1,2,3}, K = Ω ⊗ {Ω}, and A = {3}. Then both

B1 = {1,3} and B2 = {2,3} protect the K-privacy of A, yet

B1∩ B2 = {3} does not. So, there is no suitable value for β(3).

However, see Corollary 4.14 for more on this subject.

The characterization in Proposition 4.1 could be quite useful for

auditing a lot of properties B1,B2,...,BN disclosed over aperiod

of time, using the same audit query A. Given A, the auditor would

compute the mapping β once, and use it to test every Bi. This

comment applies to Section 4.1 as well.

4.1Intersection-Closed Knowledge

There exists a function

⇒ SafeK(A,B),

(12)

SafeK(A,B) ⇒

(13)

Motivation.

join forces in attacking protected information, their knowledge sets

intersect: they jointly consider a world possible if and only if none

of them has ruled it out. Therefore, if the auditor wants to ac-

count for potential collusions, she must consider knowledge world

(ω,S1 ∩ S2) possible whenever she considers both (ω,S1) and

(ω,S2) possible. This motivates the following definition:

Definition 4.3. A second-level knowledge set K ⊆ Ωposs is

intersection-closed, or ∩-closed for short, iff ∀(ω,S1) ∈ K and

∀(ω,S2) ∈ K we have (ω, S1∩ S2) ∈ K. Note that we intersect

the user’s knowledge sets (ω,S1) and (ω,S2) only when they are

paired with the same world ω.

One way to obtain a second-level knowledge set K ⊆ Ωposs

that is ∩-closed is by taking an ∩-closed family Σ of subsets of Ω

(such that ∀S1,S2 ∈ Σ: S1∩S2 ∈ Σ) and computing the product

K = C ⊗ Σ with some knowledge set C.

Intervals.

When the auditor’s knowledge is ∩-closed, the notion

of an “interval” between two worlds becomes central in character-

izing the privacy relation:

Definition 4.4. Let K ⊆ Ωpossbe ∩-closed, and let ω1,ω2 ∈ Ω

be two worlds such that

When two or more possibilistic agents collude, i.e.

ω1 ∈ π1(K),ω2 ∈

??S??(ω1,S) ∈ K?.

(14)

The K-interval from ω1 to ω2, denoted by IK(ω1,ω2), is the

smallest set S such that (ω1,S) ∈ K and ω2 ∈ S, or equivalently:

??S??(ω1,S) ∈ K, ω2 ∈ S?.

If the worlds ω1,ω2do not satisfy conditions (14), we shall say that

interval IK(ω1,ω2) does not exist.

The following proposition shows that we need to know only the

intervals in order to check whether or not SafeK(A,B) holds:

IK(ω1,ω2) :=

PROPOSITION 4.5. For an ∩-closed set K ⊆ Ωpossand for all

A,B ⊆ Ω, we have SafeK(A,B) if and only if

∀IK(ω1,ω2) :ω1 ∈ AB & ω2 / ∈ A

⇒

IK(ω1,ω2) ∩ (B − A) = ∅.

(15)

Figure 1: An example of an ∩-closed K ⊆ Ωposs where the

worlds are the pixels inside the 14 × 7 rectangle (such as ω1,

ω2 and ω?

integer sub-rectangles (i.e. composed of whole squares). Set¯ A

is the complement of the privacy-sensitive knowledge set. See

Example 4.9 for details.

2), and the permitted user’s knowledge sets are the

Remark 4.6. As implied by Proposition 4.5, there is no need

to store the entire ∩-closed second-level knowledge set K (which

could require |Ω|·2|Ω|bits of data) in order to test the possibilistic

privacy. It is sufficient to store one set IK(ω1,ω2) ⊆ Ω, or the fact

of its non-existence, for each pair (ω1,ω2) ∈ Ω × Ω, i.e. at most

|Ω|3bits of data.

Minimal intervals.

In fact, in Proposition 4.5 we do not even

have to check all intervals; it is enough to consider just “minimal”

intervals defined as follows:

Definition 4.7. For an ∩-closed second-level knowledge set

K ⊆ Ωposs, for a world ω1 ∈ Ω and for a set X ⊆ Ω not con-

taining ω1, an interval IK(ω1,ω2) is called a minimal K-interval

from ω1to X iff ω2 ∈ X and

∀ω?

PROPOSITION 4.8. For an ∩-closed set K ⊆ Ωposs and for

∀A,B ⊆ Ω, we have SafeK(A,B) if and only if the formula (15)

holds over all intervals IK(ω1,ω2) that are minimal from a world

ω1 ∈ AB to the set Ω−A.

2∈ X ∩ IK(ω1,ω2) :IK(ω1,ω?

2) = IK(ω1,ω2).

Example 4.9. Let Ω be an area of the plane that is bounded by

a rectangle and discretized into pixels to ensure finiteness (the area

within the 14×7 rectangle on Figure 1). Let the worlds be the pix-

els. Consider an auditor who does not know the actual database ω∗

and who assumes that the user’s prior knowledge set S ∈ Σ is an

integer rectangle, i.e. a rectangle whose four corners have integer

coordinates (corresponding to the vertical and horizontal lines in

the picture). The family Σ of integer rectangles is ∩-closed, and so

is the auditor’s second-level knowledge set K = Ω ⊗ Σ.

Given ω1,ω2 ∈ Ω, the interval IK(ω1,ω2) is the smallest in-

teger rectangle that contains both ω1 and ω2. For ω1 and ω2 in

Figure 1, the interval IK(ω1,ω2) is the light-grey rectangle from

point (1,1) to point (4, 4); for ω1and ω?

is the rectangle from point (1,1) to point (9,3).

The interval IK(ω1,ω2) shown on the picture is one of the

three minimal intervals from ω1 to set ¯ A (the area bounded

by the ellipse).The other two minimal intervals are the rect-

angles (1,1)−(5,3) and (1,1)−(6,2).

(ω1,S) ∈ K and S ? A, e.g. the interval IK(ω1,ω?

tain at least one of the three minimal intervals, implying Proposi-

tion 4.8 for the case of the actual world ω∗= ω1.

2, the interval IK(ω1,ω?

2)

Every set S such that

2), must con-

Interval-induced partitions of¯ A.

the minimal K-intervals from a given world ω1 ∈ A to the set

Let us have a closer look at

176

Page 7

¯ A = Ω−A. For every ω2 ∈ ¯ A, the interval IK(ω1,ω2), if it

exists, is either minimal or not; if it is not minimal, then ω2cannot

belong to any minimal interval from ω1 to¯ A. Now, take some

pair ω2,ω?

minimal. There are two possible situations:

1. IK(ω1,ω2) = IK(ω1,ω?

2. IK(ω1,ω2) ∩ IK(ω1,ω?

Indeed, if ∃ω??

tion 4.7 the interval IK(ω1,ω??

vals, making them equal. We have thus shown the following

2 ∈¯ A such that both IK(ω1,ω2) and IK(ω1,ω?

2) are

2), or

2) ∩¯ A = ∅.

2 ∈ IK(ω1,ω2) ∩IK(ω1,ω?

2) ∩¯ A, then by Defini-

2) equals both of the minimal inter-

PROPOSITION 4.10. Given an ∩-closed set K ⊆ Ωposs, a set

A ⊆ Ω, and a world ω1 ∈ A, the minimal K-intervals from

ω1to¯ A partition set¯ A into disjoint equivalence classes

¯ A = D1 ∪ D2 ∪ ... ∪ Dm ∪ D?

where two worlds ω2,ω?

both belong to the same minimal interval, or (class D?) when they

both do not belong to any minimal interval.

2∈¯ A belong to the same class Di iff they

Definition 4.11. In the assumptions and in the notation of

Proposition 4.10, denote

ΔK(¯ A,ω1) := {D1,D2,...,Dm}.

In other words, ΔK(¯ A,ω1) is the disjoint collection of all sets

formed by intersecting¯ A with the minimal intervals from ω1to¯ A.

COROLLARY 4.12. Given an ∩-closed set K ⊆ Ωposs, for all

A,B ⊆ Ω we have SafeK(A,B) if and only if

∀ω1 ∈ AB, ∀Di ∈ ΔK(¯ A,ω1) : B ∩ Di = ∅.

(16)

As Figure 1 illustrates for Example 4.9, the three minimal in-

tervals from ω1 to¯ A formed by integer rectangles (1,1)−(4,4),

(1,1)−(5,3) and (1,1)−(6,2) are disjoint inside¯ A. Their inter-

sections with¯ A, shown hatched in Figure 1, constitute the collec-

tion ΔK(¯ A,ω1). A disclosed set B is private, assuming ω∗= ω1,

iff B intersects each of these three intervals inside¯ A.

The case of all-singleton ΔK’s.

defined next, privacy testing is simplified still further:

If set K satisfies the property

Definition 4.13. An ∩-closed set K ⊆ Ωposshas tight intervals

iff for every K-interval IK(ω1,ω2) we have

∀ω?

When K has tight intervals, every minimal interval IK(ω1,ω2)

from ω1 ∈ A to¯ A has exactly one of its elements in¯ A, namely ω2:

¯ A ∩ IK(ω1,ω2) = {ω2}.

ΔK(¯ A,ω1) are singletons, and Corollary 4.12 takes the form of

Proposition 4.1:

2∈ IK(ω1,ω2) : ω?

2= ω2 ⇒ IK(ω1,ω?

2) ? IK(ω1,ω2).

Then all equivalence classes Di in

COROLLARY 4.14. Let K ⊆ Ωpossbe an ∩-closed set that has

tight intervals, let A ⊆ Ω. Then ∃β: A → P(Ω−A) given by

∀ω1 ∈ A :β(ω1) :=

?

ΔK(¯ A,ω1)

such that ∀B ⊆ Ω

SafeK(A,B) ⇔ (∀ω1 ∈ AB : β(ω1) ⊆ B).

Having tight intervals is essential for Corollary 4.14 to hold; see

Remark 4.2 for a counterexample where an ∩-closed K does not

have tight intervals.

5.MODULARITY ASSUMPTIONS FOR

PROBABILISTIC KNOWLEDGE

In the previous section we clarified some general properties of

possibilistic knowledge; now we turn to the more complex prob-

abilistic case. Rather than studying arbitrary probabilistic knowl-

edge families, here we shall focus on a few specific, yet important,

families of distributions. Later, in Section 6, we present more so-

phisticated approaches that extend beyond these families.

From now on, we assume that Ω = {0,1}nfor some fixed n.

Let ω1 ∧ ω2 (ω1 ∨ ω2, ω1 ⊕ ω2) be the bit-wise “AND” (“OR”,

“XOR”), and define the partial order ω1 ? ω2 to mean “∀i =

1...n: ω1[i] = 1 ⇒ ω2[i] = 1.” A set S ⊆ Ω shall be called an

up-set (a down-set) when ∀ω1 ∈ S, ∀ω2 ? ω1 (∀ω2 ? ω1) we

have ω2 ∈ S.

Definition 5.1. A probability distribution P over Ω is called

log-supermodular (log-submodular)3when the following holds:

∀ω1,ω2 ∈ Ω : P(ω1)P(ω2) ? (?) P(ω1∧ ω2)P(ω1∨ ω2)

The family of all log-supermodular distributions shall be denoted

by Π+

A distribution P is called a product distribution if it makes every

coordinate independent. Every product distribution corresponds to

a vector (p1,...,pn) of Bernoulli probabilities, each pi ∈ [0,1],

such that

?n

The family of all product distributions shall be denoted by Π0

is easy to show that Π0

distribution if and only if

m, the family of all log-submodular distributions by Π−

m.

∀ω ∈ {0,1}n: P(ω) =

i=1pω[i]

i

· (1 − pi)1−ω[i]

(17)

m. It

m= Π−

m∩ Π+

m [20]. In fact, P is a product

∀ω1,ω2 ∈ Ω : P(ω1)P(ω2) = P(ω1∧ ω2)P(ω1∨ ω2) (18)

Supermodular and submodular functions occur often in mathe-

matics and have been extensively studied [15, 20]. Our goal in con-

sidering these assumptions was to substantially relax bit-wise inde-

pendence while staying away from the unconstrained case. Besides

that, the log-supermodular assumption (as implied by Theorem 5.3

below) describes situations where no negative correlations are per-

mitted between positive events—something we might expect from

knowledge about, say, HIV incidence among humans.

PROPOSITION 5.2(Π+

mSAFETY: NECESSARY CRITERION).

For all A,B ⊆ Ω = {0,1}n, we have:

SafeΠ+

?ω1∧ω2 ∈ A−B

Our sufficient criterion for Π+

form, and relies on the following well-known theorem [3] (see

also [6], §19):

m(A,B)

⇒∀ω1 ∈ AB, ∀ω2 ∈¯ A¯B :

?

m-safety has a very similar

(19)

ω1∨ω2 ∈ B −A

or

?ω1∧ω2 ∈ B −A

ω1∨ω2 ∈ A−B

?

THEOREM 5.3

distributive lattice, and let α,β,γ,δ : L → R+. For all A,B ⊆ L

denote f[A] =?

α[A] · β[B] ? γ[A ∨ B] · δ[A ∧ B]

holds for all subsets A,B ⊆ L if and only if it holds for one-

element subsets, i.e. iff

(FOUR FUNCTIONS THEOREM). Let L be a

a∈Af(a), A ∨ B = {a ∨ b|a ∈ A, b ∈ B},

and A ∧ B = {a ∧ b|a ∈ A, b ∈ B}. Then the inequality

α(a) · β(b) ? γ(a ∨ b) · δ(a ∧ b)

for all elements a,b ∈ L.

3The “log-” means that supermodularity is multiplicative, not additive. The

subscript “m” in Π−

m, Π+

metc. means “modular.”

177

Page 8

PROPOSITION 5.4(Π+

mSAFETY: SUFFICIENT CRITERION).

For all A,B ⊆ Ω = {0,1}n, either one of the two conditions

below is sufficient to establish SafeΠ+

• AB ∧¯ A¯ B ⊆ A−B and AB ∨¯ A¯ B ⊆ B −A;

• AB ∨¯ A¯ B ⊆ A−B and AB ∧¯ A¯ B ⊆ B −A.

COROLLARY 5.5. IfAisan up-set and B isadown-set (or vice

versa), then SafeΠ+

m(A,B) :

m(A,B).

Remark 5.6. Thus, if the user’s prior knowledge is assumed to

be in Π+

serves the privacy of a “yes” answer to another monotone Boolean

query. Roughly speaking, it is OK to disclose a negative fact while

protecting a positive fact. This observation is especially helpful

when A and B are given by query language expressions, whose

monotonicity is often obvious.

5.1Product Distributions

Inthissectionweshall study theproblem of checking theprivacy

relation SafeΠ0

family Π0

holds iff P[A]P[B] = P[AB] for all P ∈ Π0

denote by A⊥Π0

who proved the following necessary and sufficient criterion:

m, a “no” answer to a monotone Boolean query always pre-

m(A,B) for sets A,B ⊆ Ω = {0,1}nover the

mof product distributions. The independence relation that

m, and which we

mB, has been studied by Miklau and Suciu in [21]

THEOREM 5.7

A ⊥Π0

ordinates,” i.e. when coordinates 1,2,...,n can be rearranged

so that only ω[1],ω[2],...,ω[k] determine if ω ∈ A, and only

ω[k + 1],ω[k + 2],...,ω[k?], k?? n, determine if ω ∈ B.

Since A ⊥Π0

terion is a sufficient criterion for our notion of privacy. It is not a

necessary one, even for n = 2: we have SafeΠ0

but not X1⊥Π0

Another sufficient criterion is given by Proposition 5.4, if we

note that Π0

up-set and B is a down-set, or vice versa (Corollary 5.5). A little

more generally, SafeΠ0

z ∈ Ω such that z ⊕A is an up-set and z ⊕B is a down-set. Let us

call this criterion the monotonicity criterion.

It turns out that both the Miklau-Suciu and the monotonicity cri-

teria are special cases of another simple yet surprisingly strong suf-

ficient criterionfor SafeΠ0

called the cancellation criterion, because its verification is equiva-

lent to cancelling identical monomial terms in the algebraic expan-

sion for the difference

P[A¯B]P[¯ AB] − P[AB]P[¯ A¯B] = P[A]P[B] − P[AB],

where P is a product distribution written as in (17). In order to for-

mulate the criterion in combinatorial (rather than algebraic) terms,

we need the following definition:

Definition 5.8. The pairwise matching function Match(u,v)

maps a pair (u,v) of vectors from Ω = {0,1}nto a single match-

vector w = Match(u,v) in {0,1,∗}nas follows:

(MIKLAU & SUCIU). For all A,B

mB if and only if sets A and B “share no critical co-

⊆

Ω,

mB implies SafeΠ0

m(A,B), the Miklau-Suciu cri-

m(X1,¯ X1∪X2)

m(¯ X1∪X2), where Xi = {ω ∈ Ω|ω[i] = 1}.

m⊂ Π+

m; it implies SafeΠ0

m(A,B) whenever A is an

m(A,B) holds if there exists a mask vector

m(A,B). Thissufficient criterionshallbe

∀i = 1...n :w[i] =

?

u[i]

∗

if u[i] = v[i];

if u[i] ?= v[i].

For example, pair (01011,01101) gets mapped into 01∗∗1. Wesay

that v ∈ Ωrefines amatch-vector w when v can beobtained fromw

byreplacing itseverystarwitha0or a1. Forevery match-vector w,

define the following two sets:

?v ∈ Ω

Circ(w) :=

Box(w) :=

??v refines w?;

?(u,v) ∈ Ω×Ω

??Match(u,v) = w?.

Now we are ready to state the cancellation criterion, which is a

sufficient criterion for SafeΠ0

criterion of a similar form, for comparison:

m(A,B), and also state a necessary

PROPOSITION 5.9

A,B ⊆ Ω, in order to establish SafeΠ0

verify the following ∀w ∈ {0,1,∗}n:

??A¯B ×¯ AB ∩ Circ(w)???

PROPOSITION 5.10

A,B ⊆ Ω, if SafeΠ0

??A¯ B ∩ Box(w)??·?? ¯ AB ∩ Box(w)???

We hope that the combinatorial simplicity of the criterion given

by Proposition 5.9 will allow highly scalable implementations that

apply in real-life database auditing scenarios, where sets A and B

are given via expressions in a query language. The theorem below

justifies our interest in the cancellation criterion:

THEOREM 5.11. If sets A,B satisfy the Miklau-Suciu criterion

or the monotonicity criterion, they also satisfy the cancellation cri-

terion.

(CANCELLATION CRITERION).

For all

m(A,B) it is sufficient to

??AB ×¯ A¯B ∩ Circ(w)??.

(A NECESSARY CRITERION).

m(A,B) holds, then ∀w ∈ {0,1,∗}n:

For all

?

??AB ∩ Box(w)??·?? ¯ A¯B ∩ Box(w)??.

Remark 5.12. The cancellation criterion is only sufficient, but

not necessary. Here is a pair of sets that satisfies SafeΠ0

and does not satisfy the criterion: A = {011, 100, 110, 111} and

B = {010, 101, 110, 111}. Specifically, for these sets we have

??A¯B ×¯ AB ∩Circ(∗∗∗)??= 0 and??AB ×¯ A¯B ∩ Circ(∗∗∗)??= 2.

6. GENERAL ALGEBRAIC APPROACHES

We use techniques from multivariate polynomial optimization to

test safety with respect to certain families Π of prior distributions

on an agent’s knowledge. Recall that a set A ⊆ Ω is Π-safe given

B ⊆ Ω when for all distributions P ∈ Π, we have P[AB] ?

P[A] · P[B]. As in some previous sections, we identify the set Ω

of possible worlds with the hypercube {0,1}n.

For each x ∈ {0,1}n, we create variables px ∈ [0,1]. We con-

sider those families Π containing distributions (px)x∈{0,1}n which

can be described by the intersection of a finite number r of polyno-

mial inequalities:

m(A,B)

α1((px)x∈{0,1}n) ? 0,...,αr((px)x∈{0,1}n) ? 0,

?

We call such a family Π algebraic. For example, if we had the

family of log-submodular distributions, then for all x,y ∈ {0,1}n,

we would have the constraint αx,y = pxpy − px∧ypx∨y ? 0.

For the family of log-supermodular distributions, we would instead

have αx,y = px∧ypx∨y − pxpy ? 0. Finally, for the family of

product distributions, we would have both pxpy − px∧ypx∨y ? 0

and px∧ypx∨y− pxpy ? 0.

For sets A and B, and a family of distributions Π, we define the

set K(A,B,Π) of distributions (px)x∈{0,1}n to be:

?

α1((px)x∈{0,1}n) ? 0,...,αr((px)x∈{0,1}n) ? 0

?

The following proposition is an equivalent algebraic formulation

of the fact that in order for SafeΠ(A,B) to hold, there cannot be a

single distribution P ∈ Π for which P[AB] > P[A] · P[B].

x∈{0,1}n px = 1, ∀x px? 0.

w∈AB

pw >

?

x∈A

px

?

y∈B

py

x∈{0,1}n px = 1, ∀x px? 0.

178

Page 9

PROPOSITION 6.1. SafeΠ(A,B) iff the set K(A,B,Π) is

empty.

We are interested in algorithms that decide emptiness of

K(A,B,Π) in time polynomial or nearly polynomial in N

Note that N does not need to be the number of possible worlds, but

rather only the potentially much smaller number of possible rel-

evant worlds in the desired application. For example, if the agent

executes a combination of PROJECTand SELECTqueries in SQL,

he may be left only with a subset S of possible records with a small

number of attributes and values for those attributes. In this case,

the number N of possible relevant worlds could by very small, and

algorithms for testing safety of additional queries on S which run

in time polynomial or quasi-polynomial in N would be efficient.

As the following theorem shows, even when the number N of

possible relevant worlds is small, we may need to restrict the class

of distributionsΠ that weconsider in order to efficientlytest safety.

def

= 2n.

THEOREM 6.2. If P = NP, there is an algebraic Π for which

r = poly(N), each αihas degree at most 2, and for which decid-

ing SafeΠ(A,B) cannot be done in poly(N) time.

PROOF. (sketch) The main idea is a reduction from a restricted

version ofthedecision problemof MAX-CUT.Wecarefullychoose

constraints defining the family Π so that given a graph G on t

vertices, we can encode G into sets A,B ⊆ {0,1}nso that

the constraints defining Π together with the constraint P[AB] >

P[A]·P[B] define a non-empty set K(A,B,Π) iff the maximum

cut size in G is sufficiently large. We need to suitably restrict the

decision version of MAX-CUT so that this is possible. Here we

require N = poly(t). We defer the details of the proof to the full

paper.

Despite this negative result, for certain interesting families Π we

obtain efficient algorithms, as we now discuss.

6.1 Specific Distributions

We first obtain a necessary and sufficient condition for A,B ⊆

{0,1}nto be safe with respect to the family Π of product dis-

tributions by providing a deterministic algorithm.

time is NO(lg lg N), which is essentially polynomial for all prac-

tical purposes. The key observation is that while K(A,B,Π) is

N = 2n-dimensional for general familiesof distributions, forprod-

uct distributions it can be embedded into Rn.

Indeed, it is easy to see that K(A,B,Π) can be defined in vari-

ables p1,...,pn ∈ R constrained by pi(1−pi) ? 0, and for which

P[AB] > P[A]·P[B], where P(ω) =?n

inequalities. We apply the following simplified form of Theorem 3

of Basu, Pollack, and Roy [4]:

Its running

i=1pω[i]

i

·(1−pi)1−ω[i]

for all ω ∈ {0,1}n. We can write this with n variables and n + 1

THEOREM 6.3. Given a set K = {β1,...,βr} of r polyno-

mials each of degree at most d in s variables with coefficients in

R, the problem of deciding whether there exist X1,...,Xs ∈ R

for which β1(X1,...,Xs) ? 0,...,βr(X1,...,Xs) ? 0, can be

solved deterministically with τ(rd)O(k)bit operations, where τ is

the number of bits needed to describe a coefficient in β1,...,βr.

We apply this theorem to the set K = K(A,B,Π). From the

program above it is easy to see that τ,r,d, and s are all linear in n,

and so emptiness (and hence safety) for product distributions can

be decided in nO(n)= NO(lg lg N)time.

Thealgorithmof Basu, Pollack, and Royusessophisticated ideas

from algebraic geometry over R, and we cannot do it justice here.

The general approach taken by such algorithms is to reduce a sys-

tem of polynomial inequalities into a system of polynomial equal-

ities by introducing slack variables, and then combining the mul-

tivariate polynomial equalities pi(x) = 0 into a single equality

q(x)

is, the set VC of common zeros of its partial derivatives over the

complex field C. By perturbing q(x) and applying Bézout’s Theo-

rem, one can show that |VC| is finite. Various approaches are used

tofind thesubset VRof VCof real-valued points. SinceVRisfinite,

once it is found q is evaluated on each of its elements and the min-

imum value is taken. The main step is finding VR, and approaches

based on Gröbner bases, resultant theory, and homotopy theory ex-

ist (see [25]). The algorithm of [4] may be practical. Indeed, a

similar algorithm of Canny was implemented [7].

This approach generalizes to other algebraic families Π de-

scribed by poly(n) constraints and O(n) variables. For instance, a

family of distributions for which px = py whenever the Hamming

weight of x and y are equal is described by n + 1 variables.

Even when the family Π of distributions requires N variables

to describe, in certain cases we can obtain a polynomial-time algo-

rithm for testing safety with respect to Π. Indeed, if the constraints

αi defining Π have degree at most 2 and there are only a con-

stant number r of them, an algorithm in [16] shows how to decide

emptiness of K(A,B,Π) in NO(r)time. This algorithm makes

black-box use of the earlier algorithm of Basu, Pollack, and Roy

[4]. As an optimization, we note that if there are multiple linear

equality constraints Li(X1,...,Xs) = 0, it is helpful to combine

them into a single quadratic constraint?

6.2 Heuristics

For most families of distributions we will have to settle for a

heuristic or an approximation for testing safety. If the program

describing K(A,B,Π) is multilinear (e.g., one can show this is

the case for log-submodular and log-supermodular distributions),

there are heuristics such as branch-and-bound or cutting-plane

techniques. See page 2 of [9].

Here we describe the arguably most practical heuristic, the sum-

of-squares heuristic, introduced in [30, 31, 24], which works even

for systems that are not multilinear. This heuristic was imple-

mented with great success in [25].

The problem of minimizing a degree-d multivariate polynomial

f over a set K ⊆ Rsis equivalent to finding the maximum γ ∈ R

for which f(x) − γ ? 0 for all x ∈ K. Let Pd

of all polynomials in R[x1,...,xs] of degree at most d which are

non-negative on every point in K. Thus, our problem is to find the

maximum γ ∈ R for which f(x) − γ ∈ Pd

It is unknown how to optimize over Pd

the following indirect route is taken. Define the set Σ2:

def

=?

xp2

i(x) = 0. One finds the critical points of q(x), that

iL2

i= 0. This is because

the running time is exponential in the number of constraints.

+(K) be the set

+(K).

+(K) efficiently, and so

Σ2=

{f(x) ∈ R[x1,...,xs] | ∃t,g1(x),...,gt(x)

∈ R[x1,...,xs] s.t. f(x) =?t

Notice that Σ2is a subset of non-negative polynomials, as every

sum of squares of polynomials is non-negative. It turns out that Σ2

is in fact a strict subset of the non-negative polynomials, as shown

non-constructively by Hilbert, and constructively by Motzkin who

provided the polynomial

i=1gi(x)2}.

M(x,y,z) = x4y2+ x2y4+ z6− 3x2y2z2.

Motzkin showed M(x,y,z) is non-negative on R3, yet inexpress-

ible as a sum of squares of polynomials. It turns out that every

non-negative polynomial can be written as a sum of squares of ra-

tional functions (functions of the form gi(x)/hi(x) for polynomi-

alsgiand hi), which wasHilbert’s17th problem, solvedby Artinin

1927. While Σ2fails to capture all non-negative polynomials, the

following proposition is a compelling reason for studying it. The

proposition is proven using semidefinite programming.

PROPOSITION 6.4. For f ∈ R[x1,...,xs] of bounded degree,

the test “f(x) ∈ Σ2” can be done in poly(s) time.

179

Page 10

Let Σ2,dbe those f(x) ∈ Σ2of degree at most d. Then Σ2,d⊆

Pd

which f(x)−λ ∈ Σ2,dvia abinary search onλ and theproposition

above. The value λ is a lower bound on f(x) and in practice almost

always agrees with the true minimum of f [25].

To minimize f(x) over a set K constrained by polynomials, we

need a few more tools. Wecould reduce the problem to minimizing

a single polynomial, as mentioned in Section 6.1, but the following

may work better in practice. We follow the presentation in [8].

Definition 6.5. The Algebraic Cone generated by elements

β1,...,βt ∈ R[x1,...,xs] is the set,

A(β1,...,βt)

+(R). To minimize f(x) over Rs, we find the largest λ ∈ R for

def

= {f ∈ R[x1,...,xl] | f = η +

?

I⊆[t]

ηI

?

i∈I

βi},

where η and the ηI are in Σ2, and [t] = {1,2,...,t}.

Thus, the algebraic cone can be thought of as the set of all affine

combinations of all possible products of polynomials β1,...,βt,

where the coefficients of the affine combination are taken from Σ2.

Definition 6.6. TheMultiplicative Monoid M(β1,...,βt) gen-

erated by β1,...,βt ∈ R[x1,...,xs] is the set of finite products

of the βi, including the empty product which we set to 1.

The key result is a simplified form of the Positivstellensatz [32]:

THEOREM 6.7. Givenpolynomials {f1,...,ft1}, {g1,...,gt2}

in R[x1,...,xs], the set

K

def

=

{x ∈ Rs: fi(x) ? 0,gj(x) = 0,∀i ∈ [t1],j ∈ [t2]}

is empty iff ∃F ∈ A(f1,...,ft1) and G ∈ M(g1,...,gt2) for

which F + G2is the zero polynomial.

Thus, for a set K described by fi, and gj of the form above, we

consider K?= K ∩{x ∈ Rs| γ −f(x) ? 0,f(x)−γ = 0}. K?

is empty iff f(x) > γ for all x ∈ K.

Heuristics implemented in practice work by choosing a degree

bound D, generating all G ∈ M(f − γ,g1,...,gt2) of degree

at most D (there are at most tD

is an F ∈ A(γ − f,f1,...,ft1) for which F + G2= 0 via

semidefinite programming. This is efficient for constant D, which

usually suffices in practice. Better algorithms for special cases are

based on alternative forms of the Positivstellensatz; see [27, 28].

7.CONCLUSION

We presented a novel approach to privacy where only gaining

confidence in a sensitive fact is illegal, while losing confidence is

allowed. We showed that this relaxation is significant and permits

many morequeries thanwithwell-known approaches. Inexchange,

this gave us an opportunity to strenghten prior knowledge assump-

tions beyond current standards. Our hope is that work in this di-

rection will help bridge the gap between theoretical soundness and

practical usefulness of privacy frameworks.

One possible futuregoal istoobtain abetter understanding of the

families of sets and distributions that arise in practice, and to un-

derstand whether they admit efficient privacy tests. Another goal is

to apply the new frameworks to online (proactive) auditing, which

will require the modeling of a user’s knowledge about the auditor’s

query-answering strategy.

Acknowledgements:

We thank Kenneth Clarkson for bringing

our attention to the Four Functions Theorem.

8.REFERENCES

[1] R. Agrawal, R. J. Bayardo, C. Faloutsos, J. Kiernan, R. Rantzau, and

R. Srikant. Auditing compliance with a hippocratic database. In

Proc. VLDB, pages 516–527, 2004.

[2] R. Agrawal, J. Kiernan, R. Srikant, and Y. Xu. Hippocratic databases.

In Proc. VLDB, pages 143–154, 2002.

2 such G), and checking if there

[3] R. Ahlswede and D. E. Daykin. An inequality for the weights of two

families of sets, their unions and intersections. Z. Wahrschein. und

Verw. Gebiete, 43:183–185, 1978.

[4] S. Basu, R. Pollack, and M.-F. Roy. On the combinatorial and

algebraic complexity of quantifier elimination. J. ACM,

43(6):1002–1045, 1996.

[5] A. Blum, C. Dwork, F. McSherry, and K. Nissim. Practical privacy:

The SuLQ framework. In Proc. PODS, pages 128–138, 2005.

[6] B. Bollobás. Combinatorics. Cambridge Univ. Press, 1986.

[7] J. Canny. Improved algorithms for sign determination and existential

quantifier elimination. Computer Journal, 36(5):409–418, 1993.

[8] C. Caramanis. Non-convex optimization via real algebraic geometry,

2001. http://web.mit.edu/~cmcaram/www/pubs/

nonconvex_opt_review.pdf.

[9] C. P. de Campos and F. G. Cozman. Computing lower and upper

expectations under epistemic independence. In Proc. 4th Intl. Symp.

on Imprecise Probabilities and Their Apps., 2005.

[10] I. Dinur and K. Nissim. Revealing information while preserving

privacy. In Proc. PODS, pages 202–210, 2003.

[11] C. Dwork and K. Nissim. Privacy-preserving datamining on

vertically partitioned databases. In Proc. CRYPTO, pages 528–544,

2004.

[12] A. Evfimievski, J. Gehrke, and R. Srikant. Limiting privacy breaches

in privacy preserving data mining. In Proc. PODS, pages 211–222,

2003.

[13] R. Fagin, J. Y. Halpern, Y. Moses, and M. Y. Vardi. Reasoning About

Knowledge. The MIT Press, 1995. Paperbook edition appeared in

2001.

[14] R. Fagin, J. Y. Halpern, and M. Y. Vardi. A model-theoretic analysis

of knowledge. J. ACM, 91(2):382–428, 1991.

[15] S. Fujishige. Submodular Functions and Optimization, volume 58 of

Annals of Discrete Mathematics. Elsevier, 2nd edition, 2005.

[16] D. Grigoriev, E. de Klerk, and D. V. Pasechnik. Finding optimum

subject to few quadratic constraints in polynomial time. In Proc.

Conf. on Effective Methods in Algebraic Geometry (MEGA), 2003.

[17] J. Hintikka. Knowledge and Belief. Cornell University Press, 1962.

[18] K. Kenthapadi, N. Mishra, and K. Nissim. Simulatable auditing. In

Proc. PODS, pages 118–127, 2005.

[19] S. Kripke. A semantical analysis of modal logic I: normal modal

propositional calculi. Zeitschrift für Mathematische Logik und

Grundlagen der Mathematik, 9:67–96, 1963. Announced in J. of

Symbolic Logic 24, 1959, p. 323.

[20] L. Lovász. Submodular functions and convexity. In A. Bachem,

M. Grötchel, and B. Korte, editors, Mathematical Programming –

The State of the Art, pages 235–257. Springer-Verlag, 1983.

[21] G. Miklau and D. Suciu. A formal analysis of information disclosure

in data exchange. In Proc. SIGMOD, pages 575–586, 2004.

[22] R. Motwani, S. U. Nabar, and D. Thomas. Auditing SQL queries. In

Proc. ICDE, 2008. to appear.

[23] S. U. Nabar, B. Marthi, K. Kenthapadi, N. Mishra, and R. Motwani.

Towards robustness in query auditing. In Proc. VLDB, pages

151–162, 2006.

[24] P. A. Parrilo. Structured semidefinite programs and semialgebraic

geometry methods in robustness and optimization, 2000. Ph.D.

Thesis, California Institute of Technology.

[25] P. A. Parrilo and B. Sturmfels. Minimizing polynomial functions. In

Algorithmic and Quantitative Aspects of Real Algebraic Geometry in

Mathematics and Computer Science, pages 83–100, 2001.

[26] President’s Information Technology Advisory Committee.

Revolutionizing health care through information technology, 2004.

[27] M. Putinar. Positive polynomials on compact semi-algebraic sets.

Indiana University Math Journal, 42(3), 1993.

[28] K. Schmüdgen. The k-moment problem for compact semialgebraic

sets. Annals of Math, 289:203–206, 1991.

[29] C. E. Shannon. Communication theory of secrecy systems. Bell

System Technical Journal, 28-4:656–715, 1949.

[30] N. Z. Shor. Class of global minimum bounds of polynomial

functions. Cybernetics, 6:731–734, 1987.

[31] N. Z. Shor and P. I. Stetsyuk. The use of a modification of the

r-algorithm for finding the global minimum of polynomial functions.

Cybernetics and Systems Analysis, 33:482–497, 1997.

[32] G. Stengle. A Nullstellensatz and a Positivstellensatz in

semialgebraic geometry. Annals of Math, 207:87–97, 1974.

[33] G. H. v. Wright. An Essay in Modal Logic. North-Holland, 1951.

180