Conference PaperPDF Available

Curie: Policy-based Secure Data Exchange

Authors:

Abstract and Figures

Data sharing among partners---users, companies, organizations---is crucial for the advancement of collaborative machine learning in many domains such as healthcare, finance, and security. Sharing through secure computation and other means allow these partners to perform privacy-preserving computations on their private data in controlled ways. However, in reality, there exist complex relationships among members (partners). Politics, regulations, interest, trust, data demands and needs prevent members from sharing their complete data. Thus, there is a need for a mechanism to meet these conflicting relationships on data sharing. This paper presents, an approach to exchange data among members who have complex relationships. A novel policy language, CPL, that allows members to define the specifications of data exchange requirements is introduced. With CPL, members can easily assert who and what to exchange through their local policies and negotiate a global sharing agreement. The agreement is implemented in a distributed privacy-preserving model that guarantees sharing among members will comply with the policy as negotiated. The use of Curie is validated through an example healthcare application built on recently introduced secure multi-party computation and differential privacy frameworks, and policy and performance trade-offs are explored.
Content may be subject to copyright.
Curie: Policy-based Secure Data Exchange
Z. Berkay Celik
SIIS Laboratory, Department of CSE
The Pennsylvania State University
zbc102@cse.psu.edu
Abbas Acar, Hidayet Aksu
CPS Security Lab, Department of ECE
Florida International University
aacar001,haksu@u.edu
Ryan Sheatsley
SIIS Laboratory, Department of CSE
The Pennsylvania State University
rms5643@cse.psu.edu
Patrick McDaniel
SIIS Laboratory, Department of CSE
The Pennsylvania State University
mcdaniel@cse.psu.edu
A. Selcuk Uluagac
CPS Security Lab, Department of ECE
Florida International University
suluagac@u.edu
ABSTRACT
Data sharing among partners—users, companies, organizations—is
crucial for the advancement of collaborative machine learning in
many domains such as healthcare, nance, and security. Sharing
through secure computation and other means allow these part-
ners to perform privacy-preserving computations on their private
data in controlled ways. However, in reality, there exist complex
relationships among members (partners). Politics, regulations, inter-
est, trust, data demands and needs prevent members from sharing
their complete data. Thus, there is a need for a mechanism to meet
these conicting relationships on data sharing. This paper presents
Curie
1
, an approach to exchange data among members who have
complex relationships. A novel policy language, CPL, that allows
members to dene the specications of data exchange requirements
is introduced. With CPL, members can easily assert who and what
to exchange through their local policies and negotiate a global
sharing agreement. The agreement is implemented in a distributed
privacy-preserving model that guarantees sharing among members
will comply with the policy as negotiated. The use of Curie is vali-
dated through an example healthcare application built on recently
introduced secure multi-party computation and dierential privacy
frameworks, and policy and performance trade-os are explored.
CCS CONCEPTS
Information systems Data exchange
;
Security and pri-
vacy Economics of security and privacy.
KEYWORDS
Collaborative learning; policy language; secure data exchange
1 INTRODUCTION
Inter-organizational data sharing is crucial to the advancement of
many domains including security, health care, and nance. Previous
works have shown the benet of data sharing within distributed,
1
Our paper named after Marie Curie. She is physicist and chemist who conducted
pioneering research in health care and won Nobel prize twice.
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specic permission and/or a
fee. Request permissions from permissions@acm.org.
CODASPY ’19, March 25–27, 2019, Richardson, TX, USA
©2019 Association for Computing Machinery.
ACM ISBN 978-1-4503-6099-9/19/03. . . $15.00
https://doi.org/10.1145/3292006.3300042
U.S.1
UK
RU
U.S.2
Figure 1: An illustration of data exchange requirements of countries
learning a predictive model on their shared data. Arrows show the
data requirements of countries.
collaborative, and federated learning [
5
,
12
,
37
]. Privacy-preserving
machine learning oers data sharing among multiple members
while avoiding the risks of disclosing the sensitive data (e.g., health-
care records, personally identiable information) [
14
]. For example,
secure multiparty computation enables multiple members, each
with its training dataset, to collaboratively learn a shared predictive
model without revealing their datasets [
31
]. These approaches solve
the privacy concerns of members during model computation, yet do
not consider the complex relationships such as regulations, compet-
itive advantage, data sovereignty, and jurisdiction among members
on private data sharing. Members want to be able to articulate and
enforce their conicting requirements on data sharing.
To illustrate such complex data sharing requirements, consider
health care organizations that collaborate for a joint prediction
model of diagnosis of patients experiencing blood clots (see Fig-
ure 1). Members wish to dictate their needs through their legal and
political limitations as follows:
U.S.1
is able to share its complete
data for nation-wide members (
U.S.2
) [
3
,
23
], yet it is obliged to
share the data of patients deployed in NATO countries with NATO
members (
UK
) [
17
]. However,
U.S.1
wishes to acquire all patient
data from other countries.
UK
is able to share and acquire complete
data from NATO members, yet it desires to acquire only data of
certain race groups from
U.S1
to increase its data diversity.
RU
wishes to share and acquire complete data from all members, yet
members limit their data share to Russian citizens who live in their
countries. Such complex data sharing requirements also commonly
occur today in non-healthcare systems [
28
,
38
]. For instance, Na-
tional Security Agency has varying restrictions on how human
intelligence is shared with other countries; nancial companies
share data based on trust, and competition among each other.
This paper presents a policy-based data exchange approach,
called Curie, that allows secure data exchange among members
1
Session 3: Data Security and Privacy
CODASPY ’19, March 25–27, 2019, Richardson, TX, USA
121
that have such complex relationships. Members specify their re-
quirements on data exchange using a policy language (CPL). The
requirements dened with the use of CPL form the local data ex-
change policies of members. Local policies are dened separately
for data sharing and data acquisition policies. This property allows
asymmetric relations on data exchange. For example, a member
does not necessarily have to acquire the data that the other mem-
bers dictate to share. By using these two policies, members spec-
ify statements of who to share/acquire and what to share/acquire.
The statements are dened using conditional and selection expres-
sions. Selections allow members to lter data and limit the data
to be exchanged, whereas conditional expressions allow members
to dene logical statements. Another advanced property of CPL
is predened data-dependent conditionals for calculating the sta-
tistical metrics between member’s data. For instance, members
can dene a conditional to compute the intersection size of data
columns without disclosing their data. This allows members to de-
ne content-dependent conditional data exchange in their policies.
Once members have dened their local policies, they negotiate
a sharing agreement. The guarantee provided by Curie is that all
data exchanged among members will respect the agreement. The
agreement is executed in a multi-party privacy-preserving predic-
tion model enhanced with optional dierential privacy guarantees.
In this work, we make the following contributions:
We introduce Curie, an approach for secure data exchange
among members that have complex relationships. Curie in-
cludes CPL policy language allowing members to dene
complex specications of data exchange requirements, nego-
tiate an agreement, and execute agreements in a multi-party
predictive model that policies respect the negotiated policy.
We validate Curie through an example of real healthcare
application used to prescribe warfarin dosage. A privacy-
preserving joint dose model among medical institutions is
compiled with the use of various data exchange policies
while protecting the privacy of members’ healthcare records.
We show Curie incurs low overhead and policies are eective
at improving the dose accuracy of medical institutions.
We begin in the next section by dening the analysis task and
outlining the security and attacker models.
2 PROBLEM SCOPE AND ATTACKER MODEL
Problem Scope.
We introduce Curie Policy Language (CPL) to
express data exchange requirements of distributed members. Un-
like the programming languages used for writing secure multi-
party computation (MPC) [
24
,
33
] and the frameworks designed for
privacy-preserving machine learning (ML) [
7
,
14
,
29
,
31
,
32
], CPL
is a policy language in a Backus Normal Form (BNF) notation to
express the conicting relationships of members on data sharing.
Members can express data exchange requirements using the condi-
tionals, selections, and secure pairwise data-dependent statistics.
Curie then enforces the policy agreements in a shared predictive
model through an MPC protocol that ensures members comply
with the policies as negotiated.
We integrate Curie into 24 medical institutions. Without deploy-
ment of Curie, institutions compute warfarin dosage of a patient
using a model computed on their local patient records. Curie allows
institutions to construct various consortia wherein each member
denes a data exchange policy for other members via CPL. This
Consor&um)and)Local)Policies) Policy)Nego&a&ons)
Members)
Computa(ons++
on+shared+
data+
a+ b+ c+
d+
Figure 2: Curie data exchange process in a collaborative learning
setting. The dashed boxes show data remains condential.
enables institutions to acquire the patient records based on regula-
tions as well as the records that they need to improve the accuracy
of their dose predictions. Curie implements a privacy-preserving
dose model through homomorphic encryption (HE) to enforce the
policy agreements of the members. We note that a centralized party
in HE cannot provide a privacy-preserving model on negotiated
data [
39
]. However, Curie implements a novel protocol that allows
institutions to perform local computations by aggregating the inter-
mediate results of the dose model. Additionally, Curie implements
an optional dierential private (DP) mechanism that allows insti-
tutions to perform dierentially-private (DP) secure dose model.
DP guarantees that no information leaks on the targeted individual
(i.e., patient) with high condence from the released dose model.
Threat Model.
We consider a semi-honest adversary model. That
is, members in a consortium runs the protocol exactly as specied,
yet they try to learn the dataset inputs of the other members as
much as possible from their views of the protocol. Additionally, we
consider non-adaptive adversary wherein members cannot modify
inputs of their dataset once the protocol on shared data is initiated.
3 ORGANIZATIONAL DATA EXCHANGE
Depicted in Figure 2, Curie includes two independent parts: policy
management and multiparty secure computation.
Policy Management.
We dene a consortium that is a group made
up of two or more members–individuals, companies or governments
(
a
). Members of a consortium aim to compute a predictive model
m
over their condential data in a secure manner. For instance, data
may be curated from medical history of patients or nancial reports
of companies with the objective of building an ML model. Moreover,
each member wants to enforce a set of local constraints toward
other consortium members to control their requirements on how
and with whom they share their condential data. These constraints
dene a member’s interest, trust, regulations and data demands,
and also impacts the accuracy of a model
m
. Thus, there is a need
for connecting data needs of members to the privacy-preserving
models. In Curie, each member of a consortium denes a local policy
(
b
). The local policy of a member dictates the requirements of data
exchange as follows:
(1)
The member wishes to specify with whom to share and acquire
data (partnership requirement).
(2)
The member wishes to dene what data to share and acquire
(sharing and acquisition requirement).
In this, the member wishes to rene its sharing and acquisition
requirements to express the following:
(1)
The member wishes to dictate a set of conditions to restrict
data sharing and select which data to be acquired (conditional
selective share and acquisition); and
2
Session 3: Data Security and Privacy
CODASPY ’19, March 25–27, 2019, Richardson, TX, USA
122
(2)
The member wishes to dictate conditionals based on the other
member’s data (data-dependent conditionals).
The policy of members need not be-nor are likely to be-symmetric.
Local policy is dened with requirements for sharing and acquisi-
tion that is tailored to each partner member in the consortium–thus
allowing each pairwise sharing to be unique. Here, the local poli-
cies are used to negotiate pairwise sharing within the consortium.
To illustrate how members negotiate an agreement, consider the
consortium of three members in Figure 3.
Figure 3: An example consortium of three members.
Each member initiates pairwise policy negotiations with other
members to reconcile contradictions between acquisition and share
policies (
c
). A member starts the negotiation by sending a request
message including the acquisition policy dened for a member.
When a member receives the acquisition policy, it reconciles the
received acquisition policy with its share policy specied for that
member. Three negotiation outcomes are possible: the acquisition
policy is entirely satised, partially satised with the intersection
of acquisition and share policies or is an empty set. A member
completes its negotiations after all of its acquisition policies for
interested parties are negotiated.
Computations on Negotiated Data.
Once members negotiate
their policies (
d
), Curie provides a multiparty data exchange device
using secure multi-party computation techniques enhanced with
(optional) dierential privacy guarantees. This device ensures data
and individual privacy. The guarantee provided by Curie is that all
computations among members will respect their policies.
To ensure data privacy, Curie includes cryptographic primitives
such as Homomorphic Encryption (HE) and garbled circuits from
the secure multi-party computation literature that allows members
to perform computations on negotiated data with no disclosed data
from any single member. At the end of the secure computation, all
of the parties obtain a nal predictive model based on their policy
negotiations. To ensure the privacy of the individuals in the dataset,
which the nal model is computed on, Curie integrates Dierential
Privacy (DP). DP protects against an attacker who tries to extract a
particular individual’s data in the dataset from the nal computed
model at the end of the secure computation protocol.
4 CURIE POLICY DESCRIPTION LANGUAGE
We now illustrate the format and semantics of the Curie Policy Lan-
guage (CPL). A BNF description of CPL is presented in Appendix A.
Turning to the example consortium in Figure 3 established with
three members, each member denes its requirements for other
members on a dataset having the columns of age, race, genotype,
and weight (see Table 1). The criteria dened by members are used
throughout to construct their local policies.
Consortia member: M1
M2desires to acquire complete data of users who are older than 25
M2shares its complete data
M3
desires to acquire Asian users such that the Jaccard similarity of its age
column and M3’s age column is greater than 0.3
M3shares its complete data
Consortia member: M2
M1desires to acquire complete data
M1
limits its share to EU and NATO citizen users if M
1
is both NATO and
EU member and located in North America. Otherwise, it shares only
White users
M3desires to acquire complete data if M3is a NATO member
M3shares its complete data
Consortia member: M3
M1desires to acquire complete data of users having genotype ‘A/A
M1
share complete data if intersection size of its and M
1
’s genotype column
is less than 10. Otherwise, it shares data of users that weigh more than
100 pounds
M2desires to acquire complete data
M2
shares complete data if M
2
is EU member and its data size is greater
than 1K
Table 1: An example of member’s data exchange requirements.
Share and Acquisition Clauses.
Curie policies are collections
of clauses. The collection of clauses for partners denes the local
policy of a member. The clauses allow each member to dictate a
member specic policy for each other member. Clauses have the
following structure:
clause tag:members:conditionals:: selections;
Clause tags are reference names for policy entries. Share and ac-
quire are two reserved tags. Those clauses are comprised of three
parts. The rst part, members, denes a list of members with whom
to share and acquire. This can be a single member or a comma-
separated list of members. An empty member entry matches all
members. The second part, conditionals, is a list of conditions con-
trolling when this clause will be executed. A condition is a Boolean
function which expresses whether the share or acquire is allowed
or not. For instance, a member may dene a condition where the
data size is greater than a specic value. Only if all conditions listed
in conditionals are true, then this clause is executed. Last part, se-
lections, states what to share or acquire. It can be a list of lters on
a member’s data. For instance, a member may dene a lter on a
column of a dataset to limit acquisition to a subset of the dataset.
More complex selections can be assigned using member dened
sub-clauses. A sub-clause has the following structure:
tag:conditionals:: selections;
where tag is the name of sub-clause; conditionals is, as explained
above, a list of conditions stating whether this clause will be exe-
cuted; selections is a list of lters or a reference to a new sub-clause.
Complex data selection can be addressed with nested sub-clauses.
CPL allows members to dene multiple clauses. For instance, a
member may share a distinct subset of data for dierent conditions.
CPL evaluates multiple clauses in a top-down order. When condi-
tionals of a clause evaluate to false, it moves to the next clause until
a clause is matched or it reaches end of the policy le.
Conditionals and Selections.
We present the use of conditionals
and selections through policies with examples. Their format and
semantics are detailed. Consider an example of two members, M
1
and M2, within a consortium. They dene their local policies as:
@M1acquire : M2: :: s1;
share : M2: :: ;
@M2acquire : M1: :: ;
share : M1: c1, c2:: fine-select ;
fine-select : c3:: s2;
fine-select : :: s3;
3
Session 3: Data Security and Privacy
CODASPY ’19, March 25–27, 2019, Richardson, TX, USA
123
where c
1
,c
2
and c
3
are conditionals, s
1
,s
2
and s
3
are selections
and fine-select is a tag dened by M2.
The acquire clause of M
1
states that data is requested from M
2
after it applies s
1
selection (e.g., age
>
25) to its data. In contrast, its
share clause allows complete share of its data if M
2
requests. On the
other hand, the acquisition clause of M
1
dictates requesting com-
plete data from M
2
. However, M
2
allows data sharing if the acquisi-
tion clause issued by M
1
holds c
1
c
2
conditions (e.g., is both NATO
and EU member). Then, M
2
delegates selection to member-dened
fine-select sub-clauses. fine-select states that if the request satises
the c
3
condition (located in North America) then the request is met
with the data that is selected by the s
2
selection (e.g., limits share
of its data to NATO and EU member country citizens). Otherwise,
it shares data that is specied by selection s3(White users).
CPL supports selections through lters. A lter contains zero or
more operations over data inputs describing the share and acquisi-
tion criteria to be enforced. Operations are dened as keywords or
symbols such as
<
,
>
,
=
,
in
,
lik e
, and so on. Selections and lters
are dened in CPL as follows:
selections::= <filters> | <tag>
<filters> ::= <filter> [‘,’ <filters>]
<filter> ::= <var> <operation> <value> | ‘’
Selections are executed when conditionals evaluated to be true.
Conditionals can be consortium and dataset-specic. For instance,
a member may require other members to be in a particular country
or to be in an alliance such as NATO and to have their dataset size
greater than a particular value. Such conditionals do not require
any data exchange between members to be evaluated. However,
members may want to incorporate a relation between their data
and other member’s data into their policies as detailed next.
Data-dependent Conditionals.
A member’s decision on whether
to share or to acquire data can depend on other member’s data.
Simply put, one example of a data-dependent conditional among
two members could be whether the intersection size of the two sets
(e.g., a specic column of a dataset) is not too high. Considering
such knowledge, a member can make a conditional decision about
share or acquisition of that data. For instance, consider a list of
private IP addresses used for blacklisting the domains. If a member
knows that the intersection size is close to zero, then the member
may dictate an acquire clause to request complete features from
that member based on IP addresses [18].
CPL denes an evaluate keyword for data-dependent condition-
als through functions on data. Data-dependent conditionals take
the following form:
conditionals::= <var>‘=’<value> [‘,’ <conditionals>]
|
‘evaluate’ ‘(’ <data_ref> ‘,’ <alg_arg> ‘,’ <thshold_-
arg> ‘)’ [‘,’ <conditionals>] | ‘’
A member that uses the data-dependent conditionals denes a
reference data (data_ref) required for a such computation, an algo-
rithm (alg_arg) and a threshold (thshold_arg) that is compared with
the output of the computation. CPL includes four algorithms for
data-dependent conditionals (see Table 2). To be brief, intersection
size measures the size of the overlap between two sets; Jaccard index
is a statistic measure of similarity between sets; Pearson correlation
is a statistical measure of how much two sets are linearly depen-
dent; and Cosine similarity is a measure of similarity between two
vectors. Each algorithm is based on a dierent assumption about
the underlying reference data. However, central to all of them is to
privately (without leaking any sensitive data) measure a relation
Pairwise alg. Output Private protocol Proof
Intersection size |Di∩ Dj|Intersection cardinality [11]
Jaccard index (|DiDj|)/(| Di∪ Dj|)Jaccard similarity [6]
Pearson correlation (COV (Di,Dj) )/(σDiσDj)Garbled circuits [25]
Cosine similarity (DiDj)/(Di∥ ∥ Dj)Garbled circuits [25]
Table 2: CPL data-dependent conditional algorithms. Two members
of a consortium use the conditionals to compute the pairwise sta-
tistics. The members then use the output of the algorithm to deter-
mine whether to acquire or share data from another party. (Diand
Djare the inputs of a dataset, and σis std. deviation).
between two members’ data to oer an eective data exchange. We
note that these algorithms are found to be eective in capturing
input relations in datasets [18, 19].
Data-dependent conditionals are implemented through private
protocols (as dened in Table 2). These protocols are implemented
with the cryptographic tools of garbled circuits and private func-
tions. Protocols preserve the condentiality of data. That is, each
member gets the output indicated in Table 2 without revealing their
sensitive data in plain text. After the private protocol terminates,
the output of the algorithm is compared with a threshold value
set by the requester. If the output is below the threshold value, the
conditional is evaluated to true. Turning to above example M
3
joins
the consortium. M1and M2extend their local policies for M3:
@M1acquire : M3: evaluate(local data, ’Jaccard’, 0.3) :: race=Asian;
share : M3: :: ;
@M2acquire : M3: M3in $NATO :: ;
share : M3: :: ;
@M3acquire : M1: :: Genotype = ’A/A’ ;
share : M1: evaluate(local data,’intersection size’, 10) :: ;
share : M1: :: weight>150 ;
acquire : M2: :: ;
share : M2: M2in $EU, size(data)>1K :: ;
The acquire clause of M
1
denes a data-dependent conditional for
M
3
. It denes a Jaccard measure on its local data through evaluate
keyword and sets its threshold value equal to 0.3. M
3
agrees to
share its local data with M
1
if intersection size of its local data is
less then 10. Otherwise, it consults the next share clause dened
for M
1
which states that an individual’s weight greater than 150
pounds will be shared. All other share and acquire clauses are
trivial. Members agree to share and acquire complete data based
on data size (data size
>
1K), alliance membership (e.g., NATO or
EU member) and inputs (e.g., genotype).
Putting pieces together, CPL allows members independently de-
ne a data exchange policy with share and acquire clauses. The
policies are dictated through conditionals and selections. This al-
lows members to dictate policies in complex and asymmetric rela-
tionships. Dened in Section 3, CPL provides members to dictate
partnership, share, acquisition, and data-dependent conditionals.
Policy Negotiation and Conicts.
Data exchange between mem-
bers is governed by matching share and acquire clauses in each
member’s respective policies. Both share and acquire clauses state
conditions and selections on the data exchanged. Consider two ex-
ample local policies with a share clause @
m2
(
share
:
m1
:
c1
::
s1
)
and matching acquire clause @
m1
(
acquir e
:
m2
:
c2
:
s2
). Curie’s
negotiation algorithm respects both autonomy of the data owner
and the needs of the requester. It conservatively negotiates share
4
Session 3: Data Security and Privacy
CODASPY ’19, March 25–27, 2019, Richardson, TX, USA
124
Policy ID Consortium Name Policy Denition Acquisition Policy Share Policy
P.1 Single Source Each member uses its local patient dataset to learn warfarin dose model. ✗ ✗
P.2 Nation-wide Members in the same country establish a consortium based on state and country laws. ✓ ✓
P.3 Regional Members in the same continent establish a consortium. ✓ ✓
P.4 NATO-EU NATO and EU members establish a consortium independently based on their mutual agreements. ✓ ✓
P.5 Global Members exchange their complete data to build the warfarin dose model. ✓ ✓
Table 3: Consortia constructed among members. Acquisition and share policies of members for each consortium are studied in Section 6.
and acquire clauses such that it will return the intersection of respec-
tive data sets in resulting policy assignment. The resolved policy in
this example is
share
:
m1
:
c1c2
::
s1s2
which states that the
data exchange from
m2
to
m1
is subject to both
c1
and
c2
condition-
als and resulting sharing has
s1
and
s2
selections on
m2
’s data. This
authoritative negotiation makes sure no member’s data is shared
beyond its explicit intent, regardless how the other members’ poli-
cies are dened. This is because negotiation fullling the criteria for
each clause is based on the union of logical expressions dened in
two policies. Each member runs the negotiation algorithm for mem-
bers found in their member list. After all members terminate their
negotiations, the negotiated policy is enforced in computations.
5 DEPLOYMENT OF CURIE
To validate Curie in a real application, we integrated Curie into 24
medical institutions. Each institution wants to compute a warfarin
dose model on the distributed dataset without disclosing the pa-
tient health-care records. Without deployment of Curie, institutions
compute warfarin dosage of a patient using a model computed on
their local patient data. Curie rst enables institutions to negotiate
their data exchange requirements through CPL. In this, Curie al-
lows members to construct various consortia wherein each member
denes a data exchange policy for other members. The next step is
to compute a privacy-preserving dose model such that each party
does not learn any information about the patient’s records of other
medical institutions and respects the policy negotiated. Curie im-
plements a secure dose protocol through homomorphic encryption
(HE) to enforce the policy agreements of the members. We next
present the deployment of Curie to institutions (Section 5.1) and in-
tegration of policy agreements in warfarin dose model (Section 5.2).
5.1 Deployment Setup
Warfarin-
known as the brand name Coumadin is a widely pre-
scribed (over 20 million times each year in the United States) anti-
coagulant medication. It is mainly used to treat (or prevent) blood
clots (thrombosis) in veins or arteries. Taking high-dose warfarin
causes thin blood which may result in intracranial and extracranial
bleeding. Taking low doses causes thick blood which may result
in embolism and stroke. Current clinical practices suggest a xed
initial dose of 5 or 10 mg/day. Patients regularly have a blood test
to check how long it takes for blood to clot (international normal-
ized ratio (INR)). Based on the INR, subsequent doses are adjusted
to maintain the patient’s INR at the desired level. Therefore, it is
important to predict the proper warfarin dose for the patients.
Consortium Members.
24 medical institutions from nine coun-
tries and four continents individually collected the largest patient
data for predicting personalized warfarin dose (see Appendix D
for details of members involved in the study). Members collect 68
Secure Dose Algorithm Protocol
PiPi+1
Initialize:
Random values:Vi=XiTYi,Oi=XiTXi
Generate HE key pair (Ki;K
pi)
Secure Data Transfer:
Mi:<E(Oi)Ki,E(Vi)Ki,Ki>
Phase 1
Phase n
Post-Reconciliation:
Compute:Vj=XjTYj,Oj=XjTXj
Phase 2
Secure Computation:
E(Oj)Ki;E(Vj)Ki
HA: E(Oj)Ki+E(Oi)Ki
HA: E(Vj)Ki+E(Vi)Ki
::: Pn
From Pn:ΣE(Oi)Ki,E(Vi)Ki
O,V=D(Σ(E(Oi)Ki,E(Vi)Ki)Kpi
True values:Vt
i=XiTYi;Ot
i=XiTXi
O=O-Oi+Ot
i,V=V-Vi+Vt
i
Global parameters: η=O1V
Secure Data Transfer:
Mi+1:<E(Oj)Ki+E(Oi)Ki,
:::
:::
:::
E(Vj)Ki+E(Vi)Ki,Ki>to Pi
-"
Figure 4: Secure dose algorithm protocol: Member (Pi) starts the pro-
tocol, the procedures and message ow among members are high-
lighted in boldface. At the nal phase, Piis able to compute the dose
model coecients from the negotiated data.
inputs from patients’ genotypic, demographic, background infor-
mation, yet a long study concluded that eight inputs are sucient
for proper prescriptions [26].
Warfarin Dose Prediction Model.
To determine the proper per-
sonalized warfarin dosage, a long line of work concluded with an
algorithm of an ordinary linear regression model [
26
]. The model
is a function
f
:
X → Y
that aim at predicting targets of warfarin
dose
y∈ Y
given a set of patient inputs
x∈ X
. We represent the
patient dataset of each member
Di={(xi,yi)}n
i=1
, and a loss func-
tion
:
Y × Y
[0
,)
. The loss function penalizes deviations
between true dose and predictions. Learning is then searching for
a dose model fminimizing the average loss:
L(D,f)=1
n
n
X
i=1
(f(xi),yi).(1)
The dose model reduces to minimizing the average loss
L(D,f)
with respect to the parameters of the model
f
. The model is lin-
ear, i.e.,
f(x)=αx+β
, and the loss function is the squared loss
(f(x),y)=(f(x)y)2
. The dose model gives as well or better re-
sults than other more complex numerical methods and outperforms
xed-dose approach
2
[
26
]. We re-implemented the algorithm in
Python by direct translation from the authors’ implementation and
found that the accuracy of our implementation has no statistically
signicant dierence.
Consortia and Member Policies.
We dene consortia among
medical institutions that they state partnerships for data exchange.
Table 3 summarizes the consortia. The consortia are dened based
on statute and regulations between members, as well as regional,
and national partnerships are studied based on their countries [
3
,
17
,
23
,
34
]. For example, NATO allied medical support doctrine allows
strategic relationships that are otherwise not obtainable by non-
NATO members. Each member in a consortium exchanges data with
2
The model has been released online http://www.warfarindosing.org to help doctors
and other clinicians for predicting ideal dose of warfarin.
5
Session 3: Data Security and Privacy
CODASPY ’19, March 25–27, 2019, Richardson, TX, USA
125
other members based on its CPL policy. Various acquisition and
share policies of CPL are studied via conditionals and selections in
Section 6. We note that policy construction is a subjective enterprise.
Depending on the nature and constraints of a given environment,
any number of policies are appropriate. Such is the promise of
policy dened behavior; alternate interpretations leading to other
application requirements can be addressed through CPL.
5.2 Privacy-preserving Dose Prediction Model
The computation of local dose model of a medical institution is
straightforward: a member calculates the dose model through Equa-
tion 2 with the use of patient data collected locally. To implement a
privacy-preserving dose model among consortia members of med-
ical institutions, we dene the dose prediction formula stated in
Equation 1 in a matrix form by minimizing with maximum likeli-
hood estimation:
β=(XX)1XY,(2)
where
X
is the input matrix,
Y
is the dose matrix, and
β
is the
coecients of the dose model.
Curie allows members to collaboratively learn a dose model with-
out disclosing their patient records and guarantees data sharing
complies with the policy as negotiated. As shown in Equation 3,
each member translates its negotiated data into neutral input ma-
trices [
41
]. Particularly, patient samples to be exchanged by each
member are computed as an input matrix
X0, . . . , Xn
and dose ma-
trix
Y
0, . . . , Y
n
. The transformation denes each member’s local
statistics
Oi=XX
and
V
i=XY
. Local statistics is the output of
the negotiation of each member in a consortium. The aggregation of
the local statistics corresponds to a negotiated dataset which is the
exact amount that a member negotiates to obtain from other mem-
bers in a consortium. Curie constructs the dose algorithm of the
negotiated dataset as a concatenation of members’ local statistics
as follows:
XX=X
1|. . . | X
nX1|. . . |Xn
=
n
X
i=1
X
iXi=
n
X
i=1
V
i=V
XY=X
1|. . . | X
nY
1|. . . | Xn
=
n
X
i=1
X
iY
i=
n
X
i=1
Oi=O(3)
In Equation 3, a member computes model coecients using the
sum of other members local statistics. The local statistics includes
m×m
constant matrices where
m
is the number inputs (independent
of number of dataset size). Using this observation, a party computes
the coecients of the negotiated dataset:
η(ne дot i at e d )=(XX)1XY=O1V(4)
In Equation 4, while the accuracy objective of the dose model is
guaranteed using the coecients obtained from the sum of local
statistics, the exchange of clear statistics among parties may leak
information about members’ data. A member can infer knowledge
about the distribution of each input of other members from matrices
of
Oi
and
V
i
[
14
]. Furthermore, an adversary may sni data trac
to control and modify exchanged messages. To solve these problems,
we use homomorphic encryption (HE) that allows computation on
ciphertexts [
2
]. HE allows members to perform the computation
of joint of function without requiring additional communication
complexity other than the data exchange. We note that HE itself
cannot preserve the condentiality of data from multiple parties in
centralized settings [
40
]. However, Curie implements a distributed
privacy-preserving multi-party dose model, as shown in Figure 4.
To illustrate, we consider an example session of
n
members
authorized for data exchange in a consortium. In this example,
a ring topology is used for secure group communication (i.e.,
Pi
talks to
Pi+1
, and similarly
Pn
talks to
Pi
).
P1
initially generates a
pair of encryption keys using the homomorphic cryptosystem and
broadcasts the public key to the members in its member list.
P1
then
generates random
V
i
,
Oi
and encrypts them
E(Oi)Ki
and
E(V
i)Ki
using its public key
Ki
. It starts the session by sending them to the
next member in the ring. When next member receives the encrypted
message, it adds its local
V
i
and
Oi
matrices through homomorphic
addition to the output of its policy reconciliation for
P1
and passes
to the next member. Remaining members take the similar steps.
Secure computation executes one round per member in which the
computation for the particular member visits other members. This
allows Curie to enforce HE on shared data of a particular member
in each round uses and does not suer insecurities associated with
centralized HE constructions [40].
At the nal stage of the protocol,
P1
receives the sum statistics
of
Oi
and
V
i
from
Pn
.
P1
decrypts the sum of the statistics using
its private key and then subtracts the initial random values of
V
i
,
Oi
and adds its true values used for computation of the local dose
model coecients. The nal result
O
and
V
is the coecients of the
dose model that respects
P1
’s policy negotiations. Other consortium
members similarly start the protocol and compute the coecients.
We present the security analysis of the dose protocol in Appendix C,
and show its dierentially-private extension in Appendix B.
6 EVALUATION
This section details the operation of the Curie through policies.
We show how exible data exchange policies are implemented and
operated. We focus on the following questions:
(1) What are the performance trade-os in conguring CPL?
(2)
Can members reliably use Curie to integrate various policies?
(3)
Do members improve the accuracy of dose predictions with
the use of CPL?
The answers to the rst two questions are addressed in Sec-
tion 6.1, and the last question is answered in Section 6.2. As de-
tailed throughout, Curie allows 50 members to compute the privacy-
preserving model using 5K data samples with 40 inputs in less than
a minute. We also show how an algorithm with exible data ex-
change policies can improve–often substantially–the accuracy of
the warfarin dose model accuracy.
Experimental Setup.
The experiments were performed on a clus-
ter of machines with 32 GB of maximum memory and 16-core Intel
Xeon CPU at 1.90 GHz, where we use one core to get a lower bound
estimate. Each member is simulated in a server that stores its data.
Secure computation protocols of Curie are implemented using the
open-source HElib library [
4
]. We set the security parameter of
HElib as 128 bits. Multiplication level is optimized per member to
increase the number of allowed homomorphic operations without
decryption failure and to reduce the computation time.
We validate the accuracy of dose model in various consortia
dened in Table 3 with members dening dierent data exchange
policies. The dataset used in our experiments contains 5700 patient
records from 21 members. Dose model accuracy of each member
6
Session 3: Data Security and Privacy
CODASPY ’19, March 25–27, 2019, Richardson, TX, USA
126
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
Consortia size
100
101
102
103
104
Total messages for policy agreement
P.2 UK
P.3 N. America
P.4 NATO
P.2 U.S.
P.3 Asia
P.3 Europe
P.4 EU
P.5 Global
Figure 5: CPL negotiation cost - Costs associated with a number of
varying members in a consortium. Each member denes asymmet-
ric share and acquisition policy for other members. The number of
members in warfarin consortia is marked with red circles.
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
Consortia size
0
2
4
6
8
10
12
14
16
18
Negotiation time (s)
Intersection size
Jaccard Distance
Pearson Correlation
Cosine Similarity
Figure 6: CPL selections and data-dependent conditional costs -
Costs associated with varying members and algorithms. All consor-
tia members agree on policy including a dierent data-dependent
conditional and selections over one input of having 200 samples.
0 10 20 30 40 50
Number of members
0
2
4
6
8
Time (s)
(a) Number of data samples = 5000
0 10 20 30 40 50
Number of members
0
20
40
60
80
Time (s) with key genaration
(b) Number of data samples = 5000
inputs=8 inputs=16 inputs=24 inputs=32 inputs=40
2000 4000 6000 8000 10000
Number of data samples
0
2
4
6
8
10
12
14
Time (s)
(c) Members = 20
2000 4000 6000 8000 10000
Number of data samples
0
20
40
60
80
100
Time (s) with key genaration
(d) Members = 20
Figure 7: CPL performance on privacy-preserving and dierential private protocol - All members dene an asymmetric share and acquisition
policy through selections and conditionals. The agreements of CPL policies between consortia members are studied with the dierent number
of consortia members, data samples, and input size. (Std. dev. of ten runs is ±3.6 and ±0.3 sec. with and without homomorphic key generation.)
is validated with Mean Absolute Percentage Error (MAPE). MAPE
measures the percentage of how far predicted dosages are away
from true dosage. Lower values indicate better quality of treatment.
6.1 Performance Evaluation
We present the costs associated with various Curie mechanisms.
We illustrate the cost of the CPL in policy negotiations, in the use
of data-dependent conditionals, and in the dose algorithm.
6.1.1 CPL Benchmarks. Our rst set of experiments characterize
the policy construction and negotiation costs. Various consortia and
policies are instrumented to analyze the overhead of the number of
messages and time required to compute the CPL selections and data-
dependent conditionals. All the costs not specic to the policies are
excluded in measurements (e.g., network latency). The benchmark
results are summarized in Figure 5 and 6 and discussed below.
Figure 5 shows the number of messages for policy construction
required for dierent consortia size. The number of members in
warfarin study is also labeled. For instance, NATO consortium has
13 members; ten members from U.S. and three from UK. The ex-
periments illustrate the upper bound results wherein each member
denes a dierent share and acquisition policy for other members
(i.e., asymmetric relations). In this, each member sends acquisition
policy request to consortium members. After a member gets the
acquisition request, it reconciles with its share policy and output
of negotiation message is returned. The number of messages asso-
ciated with varying number of selections and conditionals dictated
by the members does not require any additional messages. For in-
stance, the acquisition request of a member includes arguments
when conditionals are dened (e.g., reference data and a thresh-
old value for data-dependent conditionals such as pairwise Jaccard
distance), and the result is returned with the negotiation output
message. However, the use of the selections and data-dependent
conditionals brings additional processing cost as detailed next.
Figure 6 shows the costs associated with the use of CPL se-
lection and data-dependent conditionals. All the members dictate
data-dependent conditionals and selections on a single input. The
members input size for the data-dependent conditional computa-
tions is set to 200 real values. This is the average number of inputs
found in members’ dataset. Since selections and conditionals rec-
oncile contradictions between acquisition and share policies, they
do not require any additional computation overhead and yield a
processing time of milliseconds. However, the time associated with
varying data-dependent conditionals depend on the protocol of
associated secure pairwise algorithm. In our experiments, cosine
similarity and intersection size exhibited shorter computation time
than Pearson correlation and Jaccard distance. Overall, we found
that 25 members compute the metrics less than 18 seconds. Note
that the results serve as an upper bound that all members dene a
set of selections and a data-dependent conditional on one input.
6.1.2 Dose Model Benchmarks. Our second series of experiments
characterize the impact of CPL on the average time of computing
privacy-preserving dose model with varying number of members
and dataset sizes. Though the warfarin study includes eight inputs,
7
Session 3: Data Security and Privacy
CODASPY ’19, March 25–27, 2019, Richardson, TX, USA
127
U.S.(M1) U.S.(M2)Brasil UK Israel Taiwan S. Korea
0
10
20
30
40
50
60
70
80
Error (%)
(a) P.1 Single Source (No Consortium)
(Members treat their patients)
U.S. Brasil UK Israel Taiwan S. Korea
0
10
20
30
40
50
60
70
80
Error (%)
(f) P.5 Global Consortium
U.S. Brasil UK Israel Taiwan S. Korea
0
10
20
30
40
50
60
70
80
Error (%)
(b) Cross-border (No Consortium)
(Members treat other members' patients)
N. America S. America Europe Asia
0
10
20
30
40
50
60
70
80
Error (%)
(d) P.3 Regional Consortium
same region
other regions
NATO NATO and Allies EU
0
10
20
30
40
50
60
70
80
Error (%)
(e) P.4 NATO-EU Consortium
members
non-members
M1Brasil UK Israel Taiwan S. Korea
0
10
20
30
40
50
60
70
80
Error (%)
(c) P.2 Nation-wide (U.S.) Consortium
(a member and nation-wide consortium treat patients)
Single organization (M1) in U.S.
Nation-wide (U.S.)
Data Exchange Policy: Acquisition ×, Share ×Data Exchange Policy: Acquisition ×, Share ×
Figure 8: The implication of policies on model accuracy - errors are validated in various consortia through data exchange policies. Figure
6(c-f): The local acquisition policies of members comply with the sharing policy within a consortium (i.e., members acquire complete data of
the consortia members. Std. devs. of errors are within %5, if not illustrated).
evaluations are repeated with the input size of 8, 16, 24, 32, and 40
through various dataset sample sizes for completeness. The input
and sample size together represents the total dataset shared for a
member as a result of the policy agreements. Our experiments show
that 80% of computation overhead is attributed to HE key genera-
tion. The cost of the dierential privacy takes microseconds, as the
members can calculate the (optional) dierential private algorithm
model at the end of the secure dose protocol. Computations are
instrumented to classify the overheads incurred by key generation,
encryption, decryption, and evaluation. We next present the costs
with and without key generation to study the impact of the number
of members and data size.
Figure 7 (a-b) presents the computation cost with varying number
of members. Each member’s dataset includes 5000 data samples
which acquired as a result of the policy negotiations. Figure 7 (a)
presents the cost of the total computation time excluding HE key
generation. There is a linear increase in time with the growing
number of members. This is the fundamental cost of encryption and
evaluation operations dominated by matrix encryption and addition.
To prole the generation of key cost, in Figure 7 (b), we conducted
similar experiments. Each input size cost increases because of the
key generation overhead. The increase is quadratic as a number of
slots (plaintext elements) are set to square of input size not to lose
any data during input conversion. It is important to note that the
cost is independent of the member size because a member generates
the key only once in a computation of a consortium. We note that
the time overhead of key generation is not a limiting factor as
members may generate keys before a consortium is established.
In Figure 7 (c-d), we show the costs associated with dierent data
samples. The number of members in a consortium is set to 20. Sim-
ilar to the previous experiments, the key generation dominates the
computation costs. Our experiments also reported no relationship
between the cost and number of samples. That is, even though the
size of the data samples increases, the overhead is amortized over
the operations on the local statistics of the computations (which is
the square matrix of the input size in the warfarin dataset); thus
the time of computing dose algorithm converges to the number of
dataset inputs. This explains the similar trends observed in plots.
6.2 Eectiveness of Policies
We validate the performance of privacy-preserving dose model
quantitatively and qualitatively. For the warfarin study, these are
translated to the following questions: How do policies impact the
accuracy of members’ warfarin dose prediction? (Section 6.2.1), and
Does policies help to prevent the adverse impacts of dose errors on
patient health? (Section 6.2.2).
6.2.1 Implications of CPL on Model Accuracy. In our rst set of
experiments, we validate how well a member prescribe warfarin
dose for its local patients and patient’s of the consortium members
without using CPL. These results are used as a baseline for compar-
ison of varying consortia and data exchange policies throughout.
Figure 8 (a) sought to identify the local algorithm errors (
P.1
). The
errors signicantly dier between countries and for the members
of the same country (depicted as M
1
and M
2
in the U.S.). The low
results are due to having homogeneous data; all the inputs in these
countries have similar traits. For instance, similar age and ethnicity
found in a dataset produce over-tted computation results for its
local patients. These ndings are validated with use of local algo-
rithms for treatment of other countries’ patients. As illustrated in
Figure 8 (b), the dose errors yield signicantly high for particu-
lar countries’ patients. The results indicate that improvements in
dose predictions of local patients and members’ patients lay in the
creation of data exchange policies to increase the patient diversity.
The next experiments measure the impact of CPL in nation-wide
(
P.2
), regional (
P.3
), NATO-EU (
P.4
) and global (
P.5
) consortia.
Each member creates a local acquisition policy to acquire the com-
plete data of consortia members (i.e., the acquisition policy of a
consortium member complies with the share policy of the requested
member). We make three major observations. First, varying part-
nerships yield dierent dose accuracy. For instance, members of
nation-wide consortium get better dose accuracy than their lo-
cal results. This result is validated through nationwide consortia
8
Session 3: Data Security and Privacy
CODASPY ’19, March 25–27, 2019, Richardson, TX, USA
128
Member Agreement of policy negotiations
U.S. f(Race=“Asian”)(EVALUATE(age))(height <160) (weight <65)(CYP2C9 IN ( 2*/*2, 2*/*3)(Amiodarone=“Y”)(Enzyme=“Y”)g
Brasil f(Race=“Asian”)(height <165)(CYP2C9 IN (2*/*2, 2*/*3)EVALUATE (Amiodarone)(Enzyme=“Y”)g
UK f(Race,“White”)(age BETWEEN 20-29 AND >80)(height<165)(60<weight <100)EVALUATE(CYP2C9)(Amiodarone=“Y”), (Enzyme=“Y”)g
Israel f(Race,“White”)(height <160cm)(weight <60)(CYP2C9=3*/*3)(Amiodarone=“Y”)(Enzyme Inducer =“Y”)g
Taiwan f(Race=All)(age BETWEEN 20-29)(height >170)(weight >65)(CYP2C9 IN (1*/*2, 2*/*2, 2*/*3, 3*/*3)(VK0RC1=“G/G”)(Amiodarone=“Y”)(Enzyme=“Y”)g
S. Korea f(Race=All)(age BETWEEN 20-29)(height >165)(weight >60)(CYP2C9 IN (1*/*2, 2*/*2, 2*/*3, 3*/*3)(VK0RC1=“G/G”)(Amiodarone=“Y”)(Enzyme=“Y”)g
Table 4: An exploration of CPL policies in the global consortium (illustrated as a plain language): Each member denes asymmetric local policy
based on its data diversity. The agreement of share and acquisition policies are depicted as a policy clause in a single row. The agreement result
of each member for other members is not presented for brevity.
and a single member (M
1
) in United States (see Figure 8 (c)). Sec-
ond, supporting previous ndings, all regional (excluding Asia)
and NATO-EU policies decrease the error for both treatment of
their patients and the other countries’ patients (see Figure 8 (d-e)).
However, Asia consortium results in unexpected dose errors for the
treatment of other regions’ patients. This is because nation-wide,
regional, and NATO-EU policies include patient population hav-
ing dierent characteristics; thus the data obtained through policy
negotiations better generalize to the dosages. In contrast, Asia col-
laboration lacks large enough White and Black groups. Third, the
global consortium results in higher dose errors when evaluated for
particular countries such as Brazil and Taiwan (see Figure 8 (f)). To
conclude, while CPL is eective in reducing dose error of a member,
the results highlight the need for the systematic use of CPL through
selections and conditionals to obtain better results.
In these experiments, each member dictates a dierent acquisi-
tion policy based on its racial groups. Members aim at having an
ideal patient population uniformity. To do so, each member denes
a local acquisition policy and negotiates it with other members.
Each member sets its share policy to conditionals of being in the
same consortium and data size greater than 200; thus, the policy of
each member is asymmetric. Table 4 shows the simplied notation
of the policy agreements in the global consortium. For instance, a
member having a small number of white patients denes selections
to solely acquire that group and a member having large enough pa-
tients for all genotypes sets data-dependent conditionals to obtain
patient inputs that are not similar in its data samples (e.g., acquires
dierent genotypes). Figure 9 presents a subset of results on dose
errors per patient race. The errors of the other races yield similar for
each member. The results without CPL conditionals and selections
are plotted as a dashed line for comparison. We nd that members
can improve the dose accuracy with the use of policies. We note
that the use of dierent data-dependent conditionals dened in
evaluate does not result in statistically signicant accuracy gain.
6.2.2 Implications of CPL on Patient Health. We examine the im-
pact of the dose errors found in the previous section to better
quantify the eectiveness of policies on patient health.
To identify the adverse eects of warfarin, we use a clinical
study to evaluate the clinical relevance of prediction errors [
9
] and
a medical guide to identify the consequences of over- and under-
prescriptions [
16
]. We dene errors that are inside and outside of
the warfarin safety window, and the under- or over prescriptions.
We consider weekly errors for each patient because using weekly
values eliminates the errors posed by the initial (daily) dose. The
weekly dose is in the safety window if an estimated dose falls
within 20% of its corresponding clinically-deduced value [
26
,
27
].
Taiwan(Asian) S.Korea(Asian)S.Korea(White) Brazil(Asian) U.S.(Asian) UK(Asian) Israel(Black)
0
5
10
15
20
25
30
35
40
45
Error (%)
Figure 9: Dose accuracy of members using CPL policies dened in
Table 4. Members construct a model per race after they reconcile
the policies. The dashed line is the average error found without the
use of conditionals and selections in policies.
Consortium U SW O Selections Conditionals
Single Source 37.7% 43.4% 18.8% ✗ ✗
Nation-wide 18.9% 52.3% 28.8% ✓ ✓
NATO 19.3% 51.5% 29.2% ✓ ✓
Regional 19% 51.3% 29.7% ✓ ✓
Global 21.2% 46.8% 32% ✓ ✓
Table 5: Impact of policies on health-related risks: Results are from
a global consortium patients using policy agreement of a member
located in the U.S. The member uses the policy dened in Table 4.
(U: Under-prescription, SW: Safety Window, O: Over-prescription)
The deviations falling outside of the safety window is an under- or
over prescriptions, and cause health-related risks.
Table 5 presents the percentage of patients falls in safety window,
over- and -under prescriptions with varying policies of a member.
We nd that use of CPL increases the number of patients in the
safety window. For instance, a member has 43.4% patient with using
its local data (single source model), and the member increases the
percentage of patients in a safety window with varying consortia
and policies, for instance, it is 52.4% in the nation-wide consortium.
We conclude that CPL might be useful in preventing errors that
introduce health-related risks.
7 LIMITATIONS AND DISCUSSION
One requirement for correctly interpreting the CPL policies is a
shared schema for solving the compatibility issues among members.
For instance, members may interpret the data columns (e.g., column
names and types) dierently or may not have the information
about consortium members (e.g., membership status of an alliance).
CPL implements a shared schema describing column names, their
types, and explanations of data elds as well as consortium-specic
9
Session 3: Data Security and Privacy
CODASPY ’19, March 25–27, 2019, Richardson, TX, USA
129
information. Members can negotiate the schema similar to the
policy negotiations and revise the schema based on the schema of
a negotiation initiator.
CPL provides a set of data-dependent statistical functions (e.g., co-
sine similarity) to compute pairwise statistics among member’s local
data. However, there might be a need for other functions that help
members decide their data exchange policies. For example, data
exchange among nance companies may require calculating the
similarity between data distributions. Future work will investigate
the integration of dierent data-dependent statistics into CPL.
Lastly, we did not focus much on the reasons of policy impacts
on the prediction success of the dose algorithm and its adverse
outcomes on patient health over time. While our evaluation results
showed that members could express both complex relations and
constraints on the data exchange through CPL policies, members
require establishing true partnerships to improve the prediction
model accuracy. While this explanation matches both our intu-
ition and the experimental results, a further domain-specic formal
analysis is needed. We plan to pursue this in future work.
8 RELATED WORK
Policy has been used in several contexts as a vehicle for represent-
ing conguration of secure groups [
30
], network management [
35
],
threat mitigation [
18
], access control [
13
], and data retrieval sys-
tems [
15
]. These approaches dene a schema for their target prob-
lem and do not consider the challenges in secure data exchange. In
contrast, Curie denes a formal policy language to dictate the data
exchange requirements of members and enforces the agreement in
collaborative ML settings.
On the other hand, secure computation on sensitive proprietary
data has recently attracted attention. Federated learning [
20
,
37
],
anonymization [
14
], multi-site statistical models [
10
], secure multi-
party computation [
28
], and secure and dierentially-private multi-
party computation [
1
] have started to shed light on this issue. Such
techniques have been used both for training and classication
phases in deep learning [
36
], clustering [
22
], and decision trees [
8
].
To allow programmers to develop such applications, secure compu-
tation programming frameworks and languages are designed for
general purposes [
7
,
14
,
24
,
32
,
33
]. However, these approaches do
not consider complex relationships among members and assume
members share their all data or nothing. We view our eorts in
this paper to be complementary to much of these works. CPL can
be integrated into these frameworks to establish partnerships and
manage data exchange policies before a computation starts.
9 CONCLUSIONS
We presented Curie which provides a novel policy language called
CPL to dene the specications of data exchange requirements
securely for use in collaborative learning settings. Members can
assert who and what to exchange separately for data sharing and
data acquisition policies. This allows members to eciently dictate
their policies in complex and asymmetric relationships through
selections, conditionals, and pairwise data-dependent statistics. We
validated Curie in an example real-world healthcare application
through varying policies of consortia members. A secure multi-
party and (optional) dierentially-private model is implemented to
illustrate the policy/performance trade-os. Curie allowed 50 dif-
ferent members to eciently compute a privacy-preserving model
using 5K data samples with 40 inputs in less than a minute. We
also showed how an algorithm with eective use of data exchange
policies could improve the accuracy of the dose prediction model.
Future work will investigate the use of Curie in other collabora-
tive learning settings exploring dierent statistics for data-dependent
conditionals and explore its performance trade-os by integrating
it into other o-the-shelf secure computation frameworks.
ACKNOWLEDGMENT
Research was sponsored by the Army Research Laboratory and
was accomplished under Cooperative Agreement Number W911NF-
13-2-0045 (ARL Cyber Security CRA). This work is also partially
supported by US National Science Foundation (NSF) under the grant
numbers NSF-CNS-1718116 and NSF-CAREER-CNS-1453647. The
statements made herein are solely the responsibility of the authors.
The views and conclusions contained in this document are those
of the authors and should not be interpreted as representing the
ocial policies, either expressed or implied, of the Army Research
Laboratory or the U.S. Government. The U.S. Government is autho-
rized to reproduce and distribute reprints for Government purposes
notwithstanding any copyright notation here on.
REFERENCES
[1]
Abbas Acar et al
.
2017. Achieving Secure and Dierentially Private Computations
in Multiparty Settings. In IEEE Privacy-Aware Computing (PAC).
[2]
Abbas Acar, Hidayet Aksu, A. Selcuk Uluagac, and Mauro Conti. 2017. A Survey
on Homomorphic Encryption Schemes: Theory and Implementation. CoRR
abs/1704.03578 (2017). arXiv:1704.03578 http://arxiv.org/abs/1704.03578
[3]
American Recovery and Reinvestment Act of 2009. 2017. https://en.wikipedia.org/
wiki/American_Recovery_and_Reinvestment_Act_of_2009. [Online; accessed
01-June-2018].
[4]
An Implementation of Homomorphic Encryption. 2017. https://github.com/shaih/
HElib. [Online; accessed 01-January-2017].
[5]
Rohan Anil, Gabriel Pereyra, Alexandre Passos, Robert Ormandi, George E Dahl,
and Georey E Hinton. 2018. Large scale distributed neural network training
through online distillation. arXiv preprint arXiv:1804.03235 (2018).
[6]
Carlo Blundo et al
.
2013. EsPRESSo: Ecient Privacy-preserving Evaluation of
Sample Set Similarity. In Data Privacy Management Security.
[7]
Dan Bogdanov et al
.
2016. Rmind: a Tool for Cryptographically Secure Statistical
Analysis. IEEE Transactions on Dependable and Secure Computing (2016).
[8]
Raphael Bost, Raluca Ada Popa, Stephen Tu, and Sha Goldwasser. 2015. Machine
Learning Classication over Encrypted Data. In NDSS.
[9]
Z. Berkay Celik, David Lopez-Paz, and Patrick McDaniel. 2016. Patient-Driven
Privacy Control through Generalized Distillation. IEEE Symposium on Privacy-
Aware Computing (2016).
[10]
Fida K Dankar. 2015. Privacy Preserving Linear Regression on Distributed
Databases. Transactions on Data Privacy (2015).
[11]
Emiliano De Cristofaro, Paolo Gasti, and Gene Tsudik. 2012. Fast and Private
Computation of Cardinality of Set Intersection and Union. In Cryptology and
Network Security.
[12]
Jerey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark
Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V Le, et al
.
2012. Large scale
distributed deep networks. In NIPS.
[13]
Li Duan, Yang Zhang, Chen, et al
.
2016. Automated Policy Combination for
Secure Data Sharing in Cross-Organizational Collaborations. IEEE Access (2016).
[14]
Khaled El Emam et al
.
2013. A Secure Dist. Logistic Regression Protocol for the
Detection of Rare Adverse Drug Events. American Medical Informatics (2013).
[15]
Eslam Elnikety et al
.
2016. Thoth: Comprehensive Policy Compliance in Data
Retrieval Systems. In USENIX Security.
[16]
U.S. Food and Drug Administration. 2017. Medication guide, Caumadin (warfarin
sodium). http://www.fda.gov. [Online; accessed 01-June-2018].
[17]
NATO Standard Allied Joint Doctrine for Medical Support. 2017. http://www.
nato.int. [Online; accessed 01-June-2018].
[18]
Julien Freudiger, Emiliano De Cristofaro, and Alejandro E Brito. 2015. Controlled
Data Sharing for Collaborative Predictive Blacklisting. In DIMVA.
[19]
Roberto Garrido-Pelaz et al
.
2016. Shall We Collaborate?: A model to Analyse
the Benets of Information Sharing. In ACM Workshop on Information Sharing
and Collaborative Security.
[20]
Robin C Geyer, Tassilo Klein, and Moin Nabi. 2017. Dierentially Private Fed-
erated Learning: A Client Level Perspective. arXiv preprint arXiv:1712.07557
(2017).
10
Session 3: Data Security and Privacy
CODASPY ’19, March 25–27, 2019, Richardson, TX, USA
130
[21]
Oded Goldreich. 2009. Foundations of Cryptography: Basic Applications. Cam-
bridge university press.
[22]
Thore Graepel, Kristin Lauter, and Michael Naehrig. 2012. ML Condential:
Machine Learning on Encrypted Data. In Information Security and Cryptology.
[23]
Health Information Technology for Economic and Clinical Health Act. 2017.
https://en.wikipedia.org. [Online; accessed 01-June-2018].
[24]
Wilko Henecka et al
.
2010. TASTY: Tool for Automating Secure Two-party
Computations. In ACM CCS.
[25]
Yan Huang et al
.
2011. Faster Secure Two-Party Computation Using Garbled
Circuits. In USENIX Security Symposium.
[26]
International Warfarin Pharmacogenetics Consortium. 2009. Estimation of the
Warfarin Dose with Clinical and Pharmacogenetic Data. The New England Journal
of Medicine (2009).
[27]
Stephen E Kimmel et al
.
2013. A pharmacogenetic versus a Clinical Algorithm
for Warfarin Dosing. New England Journal of Medicine (2013).
[28]
Yehuda Lindell and Benny Pinkas. 2009. Secure Multiparty Computation for
Privacypreserving Data Mining. Journal of Privacy and Condentiality (2009).
[29]
Chang Liu et al
.
2015. Oblivm: A programming Framework for Secure Computa-
tion. In Security and Privacy.
[30]
Patrick McDaniel and Atul Prakash. 2006. Methods and Limitations of Security
Policy Reconciliation. ACM TISSEC (2006).
[31]
Payman Mohassel and Yupeng Zhang. 2017. SecureML: A system for scalable
privacy-preserving machine learning. In Security and Privacy (SP).
[32]
Olga Ohrimenko et al
.
2016. Oblivious Multi-Party Machine Learning on Trusted
Processors. In USENIX Security Symposium.
[33]
Aseem Rastogi et al
.
2014. Wysteria: A programming language for generic,
mixed-mode multiparty computations. In IEEE Security and Privacy (SP).
[34]
European Commission Report. 2017. Overview of the National Laws on Electronic
Health Records in the EU Member States. http://ec.europa.eu. [Online; accessed
01-June-2018].
[35]
Ana C Riekstin et al
.
2016. Orchestration of Energy eciency Capabilities in
Networks. Journal of Network and Computer Applications (2016).
[36] Reza Shokri et al. 2015. Privacy-preserving Deep Learning. In ACM CCS.
[37]
Virginia Smith, Chao-Kai Chiang, Maziar Sanjabi, and Ameet S Talwalkar. 2017.
Federated Multi-Task Learning. In NIPS.
[38] Daniel J Solove and Paul M Schwartz. 2015. Information Privacy Law. Aspen.
[39]
Marten Van Dijk and Ari Juels. 2010. On the Impossibility of Cryptography Alone
for Privacy-preserving Cloud Computing. HotSec (2010).
[40]
Marten Van Dijk and Ari Juels. 2010. On the Impossibility of Cryptography Alone
for Privacy-preserving Cloud Computing. In USENIX Hot Topics in Security.
[41]
Fang-Jing Wu, Yu-Fen Kao, et al
.
2011. From Wireless Sensor Networks Towards
Cyber Physical Systems. Pervasive and Mobile Computing (2011).
[42]
Xi Wu et al
.
2015. Revisiting Dierentially Private Regression: Lessons from
Learning Theory and their Consequences. arXiv:1512.06388 (2015).
[43]
Jun Zhang et al
.
2012. Functional Mechanism: Regression Analysis under Dier-
ential Privacy. VLDB (2012).
A CURIE POLICY LANGUAGE
This section presents the Backus Naur Form of Curie data exchange
policy language.
curie_policy::= statements
statements::= statement;’ [statements]
statement::= share_clause
|acquire_clause
|attribute
|sub_clause
; share clauses dened as follows:
share_clause::= ‘share’ ‘:’ [members] ‘:’ [conditionals]
::selections
; acquisition clauses dened as follows:
acquire_clause::= ‘acquire’ ‘:
[
members
]
:
[
conditionals
]
::selections
; attributes are dened as follows:
attribute::= identier:=’ ‘<value>
|identier:=’ ‘<value_list>
; user dened sub-clauses dened as follows:
sub_clause::= tag:’ [conditionals] ‘::selections
; conditionals including data-dependent functions dened as fol-
lows:
conditionals::= var=value[‘,conditionals]
| ‘evaluate’ ‘(data_ref,alg_arg,
threshold_arg)’ [‘,conditionals] | ‘’
selections::= lters
|tag
lters::= lter[‘,lters]
lter::= var⟩ ⟨operation⟩ ⟨value| ‘’
data_ref::= ‘&identier
alg_arg::= algorithms
algorithms::= ‘Intersection size
| ‘Jaccard index
| ‘Pearson correlation
| ‘Cosine similarity
threshold_arg::= oating_point_number
operation::= ‘=’|‘<’|‘>’|‘!=’|in|
value_list::= ‘{value}’ [‘,value_list]
members::= member[‘,members]
member::= identier| ‘’
; for completeness, trivial items dened as follows:
identier::= word
var::= ‘$identier
value::= string
tag::= word
string::= ‘"stringchars"
stringchars::= stringletter[stringchars]
stringletter::= 0x10 | 0x13|0x20| ... | 0x7F
word::= char[word]
char::= letter|digit
letter::= ’A’ | ’B’ | ... | ’Z’ | ’a’ | ’b’ | ... | ’z’ | 0x80 | 0x81 | ... | 0xFF
digit::= ’0’ | ’1’ | ... | ’9’
oating_point_number::= decimal_number’.’ [decimal_number]
decimal_number::= digit[decimal_number]
B DIFFERENTIALLY-PRIVATE DOSE
ALGORITHM
We presented how members compute a privacy-preserving dose
model on negotiated data through their policies. In this section,
we consider individual privacy that allows a member to guarantee
no information leakage on the targeted individual (i.e., patient)
involved in the computation. Specically, while members compute
a secure dose model using the data obtained as a result of their
policy negotiations, they also ensure that an adversary cannot infer
whether any particular individual is included in computations to
build the dose algorithm. In warfarin study, this corresponds to a
dierentially-private secure dose algorithm on shared data.
To implement a dierentially-private secure algorithm, we use a
functional mechanism technique [
42
,
43
]. The technique accepts
a dataset (
D
), an objective function (
fD(η)
), and a privacy budget
(
ϵ
) as an input and returns
ϵ
-dierentially-private coecients
Hη
of an algorithm. The intuition behind the functional mechanism
is perturbing the objective function of the optimization problem.
Perturbation includes both sensitivity analysis and Laplacian noise
insertion as opposed to perturbing the results via dierentially-
private synthetic data generation.
To inter-operate the functional mechanism with the secure dose
protocol, members convert each column from [min, max] to [-1,1]
before negotiation starts. This processing ensures that sucient
noise is added to the objective function on negotiated data. Then,
members proceed with the protocol. At the nal stage of the secure
algorithm protocol, a member gets clear statistics of
O=XX
and
V=XY
and input dimension
d
that is the size of
O
or
V
. These
statistics are the exact quantities that are minimized in the objective
of the functional mechanism [
43
]. Using these statistics, a member
11
Session 3: Data Security and Privacy
CODASPY ’19, March 25–27, 2019, Richardson, TX, USA
131
0.25 5 10 20 50 100
0 (Privacy Budget)
0
10
20
30
40
50
60
70
80
90
100
Error(%)
Global (DP)
Global (Non-DP)
NATO (DP)
NATO (Non-DP)
Single Source (DP)
Single Source (Non-DP)
Figure 1: Non-private secure algorithm (Non-DP) vs. dierentially-
private secure algorithm (DP) performance of a member in U.S. mea-
sured against various policies depicted in Figure 8.
may (optionally) compute
ϵ
-dierential private secure algorithm
without any additional data exchange and computational overhead.
Dierential Privacy Results.
To protect individual privacy in
secure dose algorithm, members may compute the dierentially-
private secure algorithm on their negotiated data. This section
presents the results of using the dierential-private secure algo-
rithm (DP) instead of using secure dose algorithm (Non-DP). To
establish a baseline performance, we constructed non-private se-
cure algorithms of a member. We then build the dierential-private
secure algorithm for dierent privacy budgets (
ϵ
= 0.25, 1, 5, 20, 50
and 100). Finally, we compare the results of two algorithms through
dierent policies of a member. Figure 1 shows the results of a mem-
ber in the U.S. that applies both algorithms to predict the dosage.
The algorithms are constructed for the single source, NATO, and
global consortia. In this, the member dictates acquisition policy
for complete data and other members complies with their share
policy. The average error over 100 distinct model for each budget
value is reported. The use of DP degrades the accuracy as the
ϵ
value increases. For instance, the accuracy improvement obtained
through NATO policy over single source degrades with the privacy
budget less than or equal to 20. We note that other consortia and
policies with use of selections and conditionals show similar eect
on the dose accuracy.
C ANALYSIS OF THE DOSE ALGORITHM
We present security and privacy guarantees of the dose algorithm
provided to all members through the share of encrypted integrated
statistics, (
Oi=XX
and
V
i=XY
matrices). Since all data
exchange among parties is encrypted through the use of HE, the se-
curity of the algorithm against any adversary outside the authorized
parties is based on the underlying HE cryptosystem.
An adversary not involving session initiator.
Assume for now
that a session initiator does not collude with other parties. Loosely
speaking, since all computations are performed on the encrypted
data, none of the parties learn anything about other parties’ input.
We consider a party
Pi+1
in Figure 4. The party
Pi+1
has the pub-
lic key generated by the session initiator
Ki
, the encryption of local
statistics of previous parties
Mi=(E(Oi)K,E(V
i)K)
. Its input is
(V
i+1,Oi+1)
and its output is
Mi+1=(E(Oi+Oi+1),E(V
i+V
i+1))
.
A simulator
S
selects random values for its own inputs
(V
i+1,O
i+1)
and encrypts them using the public key published by the session
initiator. Then, the simulator
S
performs the homomorphic oper-
ation on the received message
Mi
and outputs
M
i+1=(E(Oi+
O
i+1)K,E(V
i+V
i+1)K)
. Here, we assume the underlying HE is
semantically secure. Therefore, the output of the simulator
M
i+1
is computationally indistinguishable from output of the real exe-
cution of the protocol
Mi+1
for every input pairs. Therefore, using
the denition in [
21
] the protocol privately computes the function
in the presence of one semi-honest corrupted party. The extension
to multi-corrupted semi-honest adversaries is straightforward as
the only dierence is the view of a subset of parties having many
encrypted messages. Since the semantic security of the underly-
ing HE is hold for any pair of these many encrypted messages, no
information leaks about the corresponding plaintexts.
Adversary involving session initiator.
We consider the case
when the session initiator is corrupted. The corrupted parties in-
cluding session initiator can infer the input of an honest party if
the predecessor (previous party) and successor (next party) of an
honest party are both corrupted. We consider the possible cases
for data leakage: (1) 2-party: The session initiator is corrupted, and
another party is honest. In this case, predecessor and successor of
the honest party are both the corrupted session initiator. There-
fore, the input of honest party is learned by the corrupted party,
(2) 3-party: A corrupted session initiator is either predecessor or
successor; thus it can learn inputs of the one of the honest party
only if another party is corrupted, and (3) n-party (
n>
3): To learn
an honest party’s input, at least two parties must be corrupted and
placed in previous and next of the honest party.
While the individual raw data of members does not leak, the risk
of inappropriate disclosures from local summary statistics exists in
some extreme cases [
14
]. Consider the exchange of plain matrix
Vi=
XY
among two parties; a party may use the extreme values found
in
Vi
to identify particular patients. For instance, in dose algorithm,
taking inducers such as Rifadin and Dilantin could indicate high
dose prescriptions. If the values of
Vi
are high, then a party may
infer a patient that takes enzyme inducers and the presence of high
dosage warfarin intake. Similarly, exchange of
Oi=XX
may leak
information about the number of observations and represent the
number of 0s or 1s in a column. For instance, for the former rst
entry in the matrix,
XX
, gives the total number of patients. For
the latter,
(XX)j,j
gives the number of 1s in the column. This type
information lets a party infer knowledge, particularly when binary
inputs (e.g., use of the medicine) are used.
D CURIE DEPLOYMENT DETAILS
We use a dataset collected by the International Warfarin Pharma-
cogenetics Consortium (IWPC), to date the most comprehensive
database containing patient data collected from 24 medical institu-
tions from 9 countries [
26
]. The dataset does not include the name
of the medical institutions, yet there is a separate ethnicity dataset
provided for identifying the genomic impacts of the algorithm. We
use the race (reported by patients) and race categories (dened by
the Oce of Management and Budget) to predict the country of
a patient
3
. For instance, we consider a medical institution with a
high number of Japanese race is located in Japan. We use subsets of
patient records that have no missing inputs for accurate evaluation.
We split the dataset into two cohorts: training cohort is used to
learn the algorithm, and validation cohort is used to assign dose to
the new patients based on the consortia and data exchange policies.
3
The authors indicated via personal communication that they cannot provide the exact
name of the institutions due to the privacy concerns.
12
Session 3: Data Security and Privacy
CODASPY ’19, March 25–27, 2019, Richardson, TX, USA
132
... These attacks specifically aim to obtain Bitcoin and blockchain users' private keys through social engineering methods [169]- [171], fake wallets [172], [173], and key-stealing trojan malware [174]- [176]. Although these attacks and their countermeasures [177]- [179] have been studied extensively in the literature [180], their impact in the Bitcoin and blockchain domain has not been investigated yet and can lead to new research directions. ...
... These attacks specifically aim to obtain Bitcoin and blockchain users' private keys through social engineering methods [169]- [171], fake wallets [172], [173], and key-stealing trojan malware [174]- [176]. Although these attacks and their countermeasures [177]- [179] have been studied extensively in the literature [180], their impact in the Bitcoin and blockchain domain has not been investigated yet and can lead to new research directions. ...
Preprint
Emerging blockchain and cryptocurrency-based technologies are redefining the way we conduct business in cyberspace. Today, a myriad of blockchain and cryptocurrency systems, applications, and technologies are widely available to companies, end-users, and even malicious actors who want to exploit the computational resources of regular users through \textit{cryptojacking} malware. Especially with ready-to-use mining scripts easily provided by service providers (e.g., Coinhive) and untraceable cryptocurrencies (e.g., Monero), cryptojacking malware has become an indispensable tool for attackers. Indeed, the banking industry, major commercial websites, government and military servers (e.g., US Dept. of Defense), online video sharing platforms (e.g., Youtube), gaming platforms (e.g., Nintendo), critical infrastructure resources (e.g., routers), and even recently widely popular remote video conferencing/meeting programs (e.g., Zoom during the Covid-19 pandemic) have all been the victims of powerful cryptojacking malware campaigns. Nonetheless, existing detection methods such as browser extensions that protect users with blacklist methods or antivirus programs with different analysis methods can only provide a partial panacea to this emerging cryptojacking issue as the attackers can easily bypass them by using obfuscation techniques or changing their domains or scripts frequently. Therefore, many studies in the literature proposed cryptojacking malware detection methods using various dynamic/behavioral features.
Chapter
Harmful repercussions from sharing sensitive or personal data can hamper institutions’ willingness to engage in data exchange. Thus, institutions consider Authenticity-Enhancing Technologies (AETs) and Privacy-Enhancing Technologies (PETs) to engage in Sovereign Data Exchange (SDE), i.e., sharing data with third parties without compromising their own or their users’ data sovereignty. However, these technologies are often technically complex, which impedes their adoption. To support practitioners select PETs and AETs for SDE use cases and highlight SDE challenges researchers and practitioners should address, this study empirically constructs a challenge-oriented technology mapping. First, we compile challenges of SDE by conducting a systematic literature review and expert interviews. Second, we map PETs and AETs to the SDE challenges and identify which technologies can mitigate which challenges. We validate the mapping through investigator triangulation. Although the most critical challenge concerns data usage and access control, we find that the majority of PETs and AETs focus on data processing issues.KeywordsSovereign data exchangeTechnology mappingPrivacy-enhancing technologiesAuthenticity-enhancing technologies
Preprint
Full-text available
Harmful repercussions from sharing sensitive or personal data can hamper institutions' willingness to engage in data exchange. Thus, institutions consider Authenticity Enhancing Technologies (AETs) and Privacy-Enhancing Technologies (PETs) to engage in Sovereign Data Exchange (SDE), i.e., sharing data with third parties without compromising their own or their users' data sovereignty. However, these technologies are often technically complex, which impedes their adoption. To support practitioners select PETs and AETs for SDE use cases and highlight SDE challenges researchers and practitioners should address, this study empirically constructs a challenge-oriented technology mapping. First, we compile challenges of SDE by conducting a systematic literature review and expert interviews. Second, we map PETs and AETs to the SDE challenges and identify which technologies can mitigate which challenges. We validate the mapping through investigator triangulation. Although the most critical challenge concerns data usage and access control, we find that the majority of PETs and AETs focus on data processing issues.
Article
Full-text available
Techniques such as ensembling and distillation promise model quality improvements when paired with almost any base model. However, due to increased test-time cost (for ensembles) and increased complexity of the training pipeline (for distillation), these techniques are challenging to use in industrial settings. In this paper we explore a variant of distillation which is relatively straightforward to use as it does not require a complicated multi-stage setup or many new hyperparameters. Our first claim is that online distillation enables us to use extra parallelism to fit very large datasets about twice as fast. Crucially, we can still speed up training even after we have already reached the point at which additional parallelism provides no benefit for synchronous or asynchronous stochastic gradient descent. Two neural networks trained on disjoint subsets of the data can share knowledge by encouraging each model to agree with the predictions the other model would have made. These predictions can come from a stale version of the other model so they can be safely computed using weights that only rarely get transmitted. Our second claim is that online distillation is a cost-effective way to make the exact predictions of a model dramatically more reproducible. We support our claims using experiments on the Criteo Display Ad Challenge dataset, ImageNet, and the largest to-date dataset used for neural language modeling, containing $6\times 10^{11}$ tokens and based on the Common Crawl repository of web data.
Conference Paper
Full-text available
Sharing and working on sensitive data in distributed settings from healthcare to finance is a major challenge due to security and privacy concerns. Secure multiparty computation (SMC) is a viable panacea for this, allowing distributed parties to make computations while the parties learn nothing about their data, but the final result. Although SMC is instrumental in such distributed settings, it does not provide any guarantees not to leak any information about individuals to adversaries. Differential privacy (DP) can be utilized to address this; however, achieving SMC with DP is not a trivial task, either. In this paper, we propose a novel Secure Multiparty Distributed Differentially Private (SM-DDP) protocol to achieve secure and private computations in a multiparty environment. Specifically, with our protocol, we simultaneously achieve SMC and DP in distributed settings focusing on linear regression on horizontally distributed data. That is, parties do not see each others’ data and further, can not infer information about individuals from the final constructed statistical model. Any statistical model function that allows independent calculation of local statistics can be computed through our protocol. The protocol implements homomorphic encryption for SMC and functional mechanism for DP to achieve the desired security and privacy guarantees. In this work, we first introduce the theoretical foundation for the SM-DDP protocol and then evaluate its efficacy and performance on two different datasets. Our results show that one can achieve individual-level privacy through the proposed protocol with distributed DP, which is independently applied by each party in a distributed fashion. Moreover, our results also show that the SM-DDP protocol incurs minimal computational overhead, is scalable, and provides security and privacy guarantees.
Article
Full-text available
Sharing and working on sensitive data in distributed settings from healthcare to finance is a major challenge due to security and privacy concerns. Secure multiparty computation (SMC) is a viable panacea for this, allowing distributed parties to make computations while the parties learn nothing about their data, but the final result. Although SMC is instrumental in such distributed settings, it does not provide any guarantees not to leak any information about individuals to adversaries. Differential privacy (DP) can be utilized to address this; however, achieving SMC with DP is not a trivial task, either. In this paper, we propose a novel Secure Multiparty Distributed Differentially Private (SM-DDP) protocol to achieve secure and private computations in a multiparty environment. Specifically, with our protocol, we simultaneously achieve SMC and DP in distributed settings focusing on linear regression on horizontally distributed data. That is, parties do not see each others' data and further, can not infer information about individuals from the final constructed statistical model. Any statistical model function that allows independent calculation of local statistics can be computed through our protocol. The protocol implements homomorphic encryption for SMC and functional mechanism for DP to achieve the desired security and privacy guarantees. In this work, we first introduce the theoretical foundation for the SM-DDP protocol and then evaluate its efficacy and performance on two different datasets. Our results show that one can achieve individual-level privacy through the proposed protocol with distributed DP, which is independently applied by each party in a distributed fashion. Moreover, our results also show that the SM-DDP protocol incurs minimal computational overhead, is scalable, and provides security and privacy guarantees.
Article
Full-text available
Legacy encryption systems depend on sharing a key (public or private) among the peers involved in exchanging an encrypted message. However, this approach poses privacy concerns. The users or service providers with the key have exclusive rights on the data. Especially with popular cloud services, the control over the privacy of the sensitive data is lost. Even when the keys are not shared, the encrypted material is shared with a third party that does not necessarily need to access the content. Indeed, Homomorphic Encryption (HE), a special kind of encryption scheme, can address these concerns as it allows any third party to operate on the encrypted data without decrypting it in advance. Although this extremely useful feature of the HE scheme has been known for over 30 years, the first plausible and achievableFully Homomorphic Encryption (FHE) scheme, which allows any computable function to perform on the encrypted data, was introduced by Craig Gentry in 2009. Even though this was a major achievement, different implementations so far demonstrated that FHE still needs to be improved significantly to be practical on every platform. Therefore, this survey focuses on HE and FHE schemes. First, we present the basics of HE and the details of the well-known Partially Homomorphic Encryption (PHE) and Somewhat Homomorphic Encryption (SWHE), which are important pillars of achieving FHE. Then, the main FHE families, which have become the base for the other follow-up FHE schemes are presented.Furthermore, the implementations and new improvements in Gentry-type FHE schemes are also surveyed. Finally, further research directions are discussed. We believe this survey can give a clear knowledge and foundation to researchers and practitioners interested in knowing, applying, as well as extending the state of the art HE, PHE, SWHE, and FHE systems.
Conference Paper
Full-text available
Data retrieval systems process data from many sources, each subject to its own data use policy. Ensuring compliance with these policies despite bugs, misconfiguration, or operator error in a large, complex, and fast evolving system is a major challenge. Thoth provides an efficient, kernel-level compliance layer for data use policies. Declarative policies are attached to the systems’ input and output files, key-value tuples, and network connections, and specify the data’s integrity and confidentiality requirements. Thoth tracks the flow of data through the system, and enforces policy regardless of bugs, misconfigurations, compromises in application code, or actions by unprivileged operators. Thoth requires minimal changes to an existing system and has modest overhead, as we show using a prototype Thoth-enabled data retrieval system based on the popular Apache Lucene.
Conference Paper
Deep learning based on artificial neural networks is a very popular approach to modeling, classifying, and recognizing complex data such as images, speech, and text. The unprecedented accuracy of deep learning methods has turned them into the foundation of new AI-based services on the Internet. Commercial companies that col- lect user data on a large scale have been the main beneficiaries of this trend since the success of deep learning techniques is directly proportional to the amount of data available for training. Massive data collection required for deep learning presents ob- vious privacy issues. Users' personal, highly sensitive data such as photos and voice recordings is kept indefinitely by the companies that collect it. Users can neither delete it, nor restrict the purposes for which it is used. Furthermore, centrally kept data is subject to legal subpoenas and extra-judicial surveillance. Many data own- ers—for example, medical institutions that may want to apply deep learning methods to clinical records—are prevented by privacy and confidentiality concerns from sharing the data and thus benefitting from large-scale deep learning. In this paper, we design, implement, and evaluate a practical sys- tem that enables multiple parties to jointly learn an accurate neural- network model for a given objective without sharing their input datasets. We exploit the fact that the optimization algorithms used in modern deep learning, namely, those based on stochastic gradi- ent descent, can be parallelized and executed asynchronously. Our system lets participants train independently on their own datasets and selectively share small subsets of their models' key parameters during training. This offers an attractive point in the utility/privacy tradeoff space: participants preserve the privacy of their respective data while still benefitting from other participants' models and thus boosting their learning accuracy beyond what is achievable solely on their own inputs. We demonstrate the accuracy of our privacy- preserving deep learning on benchmark datasets.
Article
Federated learning poses new statistical and systems challenges in training machine learning models over distributed networks of devices. In this work, we show that multi-task learning is naturally suited to handle the statistical challenges of this setting, and propose a novel systems-aware optimization method, MOCHA, that is robust to practical systems issues. Our method and theory for the first time consider issues of high communication cost, stragglers, and fault tolerance for distributed multi-task learning. The resulting method achieves significant speedups compared to alternatives in the federated setting, as we demonstrate through simulations on real-world federated datasets.
Conference Paper
Nowadays, both the amount of cyberattacks and their sophistication have considerably increased, and their prevention concerns many organizations. Cooperation by means of information sharing is a promising strategy to address this problem, but unfortunately it poses many challenges. Indeed, looking for a win-win environment is not straightforward and organizations are not properly motivated to share information. This work presents a model to analyse the benefits and drawbacks of information sharing among organizations that present a certain level of dependency. The proposed model applies functional dependency network analysis to emulate attacks propagation and game theory for information sharing management. We present a simulation framework implementing the model that allows for testing different sharing strategies under several network and attack settings. Experiments using simulated environments show how the proposed model provides insights on which conditions and scenarios are beneficial for information sharing.