Conference PaperPDF Available

# Curie: Policy-based Secure Data Exchange

Authors:

## Abstract and Figures

Data sharing among partners---users, companies, organizations---is crucial for the advancement of collaborative machine learning in many domains such as healthcare, finance, and security. Sharing through secure computation and other means allow these partners to perform privacy-preserving computations on their private data in controlled ways. However, in reality, there exist complex relationships among members (partners). Politics, regulations, interest, trust, data demands and needs prevent members from sharing their complete data. Thus, there is a need for a mechanism to meet these conflicting relationships on data sharing. This paper presents, an approach to exchange data among members who have complex relationships. A novel policy language, CPL, that allows members to define the specifications of data exchange requirements is introduced. With CPL, members can easily assert who and what to exchange through their local policies and negotiate a global sharing agreement. The agreement is implemented in a distributed privacy-preserving model that guarantees sharing among members will comply with the policy as negotiated. The use of Curie is validated through an example healthcare application built on recently introduced secure multi-party computation and differential privacy frameworks, and policy and performance trade-offs are explored.
Content may be subject to copyright.
Curie: Policy-based Secure Data Exchange
Z. Berkay Celik
SIIS Laboratory, Department of CSE
The Pennsylvania State University
zbc102@cse.psu.edu
Abbas Acar, Hidayet Aksu
CPS Security Lab, Department of ECE
Florida International University
aacar001,haksu@u.edu
Ryan Sheatsley
SIIS Laboratory, Department of CSE
The Pennsylvania State University
rms5643@cse.psu.edu
Patrick McDaniel
SIIS Laboratory, Department of CSE
The Pennsylvania State University
mcdaniel@cse.psu.edu
A. Selcuk Uluagac
CPS Security Lab, Department of ECE
Florida International University
suluagac@u.edu
ABSTRACT
Data sharing among partners—users, companies, organizations—is
crucial for the advancement of collaborative machine learning in
many domains such as healthcare, nance, and security. Sharing
through secure computation and other means allow these part-
ners to perform privacy-preserving computations on their private
data in controlled ways. However, in reality, there exist complex
relationships among members (partners). Politics, regulations, inter-
est, trust, data demands and needs prevent members from sharing
their complete data. Thus, there is a need for a mechanism to meet
these conicting relationships on data sharing. This paper presents
Curie
1
, an approach to exchange data among members who have
complex relationships. A novel policy language, CPL, that allows
members to dene the specications of data exchange requirements
is introduced. With CPL, members can easily assert who and what
to exchange through their local policies and negotiate a global
sharing agreement. The agreement is implemented in a distributed
privacy-preserving model that guarantees sharing among members
will comply with the policy as negotiated. The use of Curie is vali-
dated through an example healthcare application built on recently
introduced secure multi-party computation and dierential privacy
frameworks, and policy and performance trade-os are explored.
CCS CONCEPTS
Information systems Data exchange
;
Security and pri-
vacy Economics of security and privacy.
KEYWORDS
Collaborative learning; policy language; secure data exchange
1 INTRODUCTION
Inter-organizational data sharing is crucial to the advancement of
many domains including security, health care, and nance. Previous
works have shown the benet of data sharing within distributed,
1
Our paper named after Marie Curie. She is physicist and chemist who conducted
pioneering research in health care and won Nobel prize twice.
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specic permission and/or a
fee. Request permissions from permissions@acm.org.
CODASPY ’19, March 25–27, 2019, Richardson, TX, USA
ACM ISBN 978-1-4503-6099-9/19/03. . . $15.00 https://doi.org/10.1145/3292006.3300042 U.S.1 UK RU U.S.2 Figure 1: An illustration of data exchange requirements of countries learning a predictive model on their shared data. Arrows show the data requirements of countries. collaborative, and federated learning [ 5 , 12 , 37 ]. Privacy-preserving machine learning oers data sharing among multiple members while avoiding the risks of disclosing the sensitive data (e.g., health- care records, personally identiable information) [ 14 ]. For example, secure multiparty computation enables multiple members, each with its training dataset, to collaboratively learn a shared predictive model without revealing their datasets [ 31 ]. These approaches solve the privacy concerns of members during model computation, yet do not consider the complex relationships such as regulations, compet- itive advantage, data sovereignty, and jurisdiction among members on private data sharing. Members want to be able to articulate and enforce their conicting requirements on data sharing. To illustrate such complex data sharing requirements, consider health care organizations that collaborate for a joint prediction model of diagnosis of patients experiencing blood clots (see Fig- ure 1). Members wish to dictate their needs through their legal and political limitations as follows: U.S.1 is able to share its complete data for nation-wide members ( U.S.2 ) [ 3 , 23 ], yet it is obliged to share the data of patients deployed in NATO countries with NATO members ( UK ) [ 17 ]. However, U.S.1 wishes to acquire all patient data from other countries. UK is able to share and acquire complete data from NATO members, yet it desires to acquire only data of certain race groups from U.S1 to increase its data diversity. RU wishes to share and acquire complete data from all members, yet members limit their data share to Russian citizens who live in their countries. Such complex data sharing requirements also commonly occur today in non-healthcare systems [ 28 , 38 ]. For instance, Na- tional Security Agency has varying restrictions on how human intelligence is shared with other countries; nancial companies share data based on trust, and competition among each other. This paper presents a policy-based data exchange approach, called Curie, that allows secure data exchange among members 1 Session 3: Data Security and Privacy CODASPY ’19, March 25–27, 2019, Richardson, TX, USA 121 that have such complex relationships. Members specify their re- quirements on data exchange using a policy language (CPL). The requirements dened with the use of CPL form the local data ex- change policies of members. Local policies are dened separately for data sharing and data acquisition policies. This property allows asymmetric relations on data exchange. For example, a member does not necessarily have to acquire the data that the other mem- bers dictate to share. By using these two policies, members spec- ify statements of who to share/acquire and what to share/acquire. The statements are dened using conditional and selection expres- sions. Selections allow members to lter data and limit the data to be exchanged, whereas conditional expressions allow members to dene logical statements. Another advanced property of CPL is predened data-dependent conditionals for calculating the sta- tistical metrics between member’s data. For instance, members can dene a conditional to compute the intersection size of data columns without disclosing their data. This allows members to de- ne content-dependent conditional data exchange in their policies. Once members have dened their local policies, they negotiate a sharing agreement. The guarantee provided by Curie is that all data exchanged among members will respect the agreement. The agreement is executed in a multi-party privacy-preserving predic- tion model enhanced with optional dierential privacy guarantees. In this work, we make the following contributions: We introduce Curie, an approach for secure data exchange among members that have complex relationships. Curie in- cludes CPL policy language allowing members to dene complex specications of data exchange requirements, nego- tiate an agreement, and execute agreements in a multi-party predictive model that policies respect the negotiated policy. We validate Curie through an example of real healthcare application used to prescribe warfarin dosage. A privacy- preserving joint dose model among medical institutions is compiled with the use of various data exchange policies while protecting the privacy of members’ healthcare records. We show Curie incurs low overhead and policies are eective at improving the dose accuracy of medical institutions. We begin in the next section by dening the analysis task and outlining the security and attacker models. 2 PROBLEM SCOPE AND ATTACKER MODEL Problem Scope. We introduce Curie Policy Language (CPL) to express data exchange requirements of distributed members. Un- like the programming languages used for writing secure multi- party computation (MPC) [ 24 , 33 ] and the frameworks designed for privacy-preserving machine learning (ML) [ 7 , 14 , 29 , 31 , 32 ], CPL is a policy language in a Backus Normal Form (BNF) notation to express the conicting relationships of members on data sharing. Members can express data exchange requirements using the condi- tionals, selections, and secure pairwise data-dependent statistics. Curie then enforces the policy agreements in a shared predictive model through an MPC protocol that ensures members comply with the policies as negotiated. We integrate Curie into 24 medical institutions. Without deploy- ment of Curie, institutions compute warfarin dosage of a patient using a model computed on their local patient records. Curie allows institutions to construct various consortia wherein each member denes a data exchange policy for other members via CPL. This Consor&um)and)Local)Policies) Policy)Nego&a&ons) Members) Computa(ons++ on+shared+ data+ a+ b+ c+ d+ Figure 2: Curie data exchange process in a collaborative learning setting. The dashed boxes show data remains condential. enables institutions to acquire the patient records based on regula- tions as well as the records that they need to improve the accuracy of their dose predictions. Curie implements a privacy-preserving dose model through homomorphic encryption (HE) to enforce the policy agreements of the members. We note that a centralized party in HE cannot provide a privacy-preserving model on negotiated data [ 39 ]. However, Curie implements a novel protocol that allows institutions to perform local computations by aggregating the inter- mediate results of the dose model. Additionally, Curie implements an optional dierential private (DP) mechanism that allows insti- tutions to perform dierentially-private (DP) secure dose model. DP guarantees that no information leaks on the targeted individual (i.e., patient) with high condence from the released dose model. Threat Model. We consider a semi-honest adversary model. That is, members in a consortium runs the protocol exactly as specied, yet they try to learn the dataset inputs of the other members as much as possible from their views of the protocol. Additionally, we consider non-adaptive adversary wherein members cannot modify inputs of their dataset once the protocol on shared data is initiated. 3 ORGANIZATIONAL DATA EXCHANGE Depicted in Figure 2, Curie includes two independent parts: policy management and multiparty secure computation. Policy Management. We dene a consortium that is a group made up of two or more members–individuals, companies or governments ( a ). Members of a consortium aim to compute a predictive model m over their condential data in a secure manner. For instance, data may be curated from medical history of patients or nancial reports of companies with the objective of building an ML model. Moreover, each member wants to enforce a set of local constraints toward other consortium members to control their requirements on how and with whom they share their condential data. These constraints dene a member’s interest, trust, regulations and data demands, and also impacts the accuracy of a model m . Thus, there is a need for connecting data needs of members to the privacy-preserving models. In Curie, each member of a consortium denes a local policy ( b ). The local policy of a member dictates the requirements of data exchange as follows: (1) The member wishes to specify with whom to share and acquire data (partnership requirement). (2) The member wishes to dene what data to share and acquire (sharing and acquisition requirement). In this, the member wishes to rene its sharing and acquisition requirements to express the following: (1) The member wishes to dictate a set of conditions to restrict data sharing and select which data to be acquired (conditional selective share and acquisition); and 2 Session 3: Data Security and Privacy CODASPY ’19, March 25–27, 2019, Richardson, TX, USA 122 (2) The member wishes to dictate conditionals based on the other member’s data (data-dependent conditionals). The policy of members need not be-nor are likely to be-symmetric. Local policy is dened with requirements for sharing and acquisi- tion that is tailored to each partner member in the consortium–thus allowing each pairwise sharing to be unique. Here, the local poli- cies are used to negotiate pairwise sharing within the consortium. To illustrate how members negotiate an agreement, consider the consortium of three members in Figure 3. Figure 3: An example consortium of three members. Each member initiates pairwise policy negotiations with other members to reconcile contradictions between acquisition and share policies ( c ). A member starts the negotiation by sending a request message including the acquisition policy dened for a member. When a member receives the acquisition policy, it reconciles the received acquisition policy with its share policy specied for that member. Three negotiation outcomes are possible: the acquisition policy is entirely satised, partially satised with the intersection of acquisition and share policies or is an empty set. A member completes its negotiations after all of its acquisition policies for interested parties are negotiated. Computations on Negotiated Data. Once members negotiate their policies ( d ), Curie provides a multiparty data exchange device using secure multi-party computation techniques enhanced with (optional) dierential privacy guarantees. This device ensures data and individual privacy. The guarantee provided by Curie is that all computations among members will respect their policies. To ensure data privacy, Curie includes cryptographic primitives such as Homomorphic Encryption (HE) and garbled circuits from the secure multi-party computation literature that allows members to perform computations on negotiated data with no disclosed data from any single member. At the end of the secure computation, all of the parties obtain a nal predictive model based on their policy negotiations. To ensure the privacy of the individuals in the dataset, which the nal model is computed on, Curie integrates Dierential Privacy (DP). DP protects against an attacker who tries to extract a particular individual’s data in the dataset from the nal computed model at the end of the secure computation protocol. 4 CURIE POLICY DESCRIPTION LANGUAGE We now illustrate the format and semantics of the Curie Policy Lan- guage (CPL). A BNF description of CPL is presented in Appendix A. Turning to the example consortium in Figure 3 established with three members, each member denes its requirements for other members on a dataset having the columns of age, race, genotype, and weight (see Table 1). The criteria dened by members are used throughout to construct their local policies. Consortia member: M1 M2desires to acquire complete data of users who are older than 25 M2shares its complete data M3 desires to acquire Asian users such that the Jaccard similarity of its age column and M3’s age column is greater than 0.3 M3shares its complete data Consortia member: M2 M1desires to acquire complete data M1 limits its share to EU and NATO citizen users if M 1 is both NATO and EU member and located in North America. Otherwise, it shares only White users M3desires to acquire complete data if M3is a NATO member M3shares its complete data Consortia member: M3 M1desires to acquire complete data of users having genotype ‘A/A M1 share complete data if intersection size of its and M 1 ’s genotype column is less than 10. Otherwise, it shares data of users that weigh more than 100 pounds M2desires to acquire complete data M2 shares complete data if M 2 is EU member and its data size is greater than 1K Table 1: An example of member’s data exchange requirements. Share and Acquisition Clauses. Curie policies are collections of clauses. The collection of clauses for partners denes the local policy of a member. The clauses allow each member to dictate a member specic policy for each other member. Clauses have the following structure: clause tag:members:conditionals:: selections; Clause tags are reference names for policy entries. Share and ac- quire are two reserved tags. Those clauses are comprised of three parts. The rst part, members, denes a list of members with whom to share and acquire. This can be a single member or a comma- separated list of members. An empty member entry matches all members. The second part, conditionals, is a list of conditions con- trolling when this clause will be executed. A condition is a Boolean function which expresses whether the share or acquire is allowed or not. For instance, a member may dene a condition where the data size is greater than a specic value. Only if all conditions listed in conditionals are true, then this clause is executed. Last part, se- lections, states what to share or acquire. It can be a list of lters on a member’s data. For instance, a member may dene a lter on a column of a dataset to limit acquisition to a subset of the dataset. More complex selections can be assigned using member dened sub-clauses. A sub-clause has the following structure: tag:conditionals:: selections; where tag is the name of sub-clause; conditionals is, as explained above, a list of conditions stating whether this clause will be exe- cuted; selections is a list of lters or a reference to a new sub-clause. Complex data selection can be addressed with nested sub-clauses. CPL allows members to dene multiple clauses. For instance, a member may share a distinct subset of data for dierent conditions. CPL evaluates multiple clauses in a top-down order. When condi- tionals of a clause evaluate to false, it moves to the next clause until a clause is matched or it reaches end of the policy le. Conditionals and Selections. We present the use of conditionals and selections through policies with examples. Their format and semantics are detailed. Consider an example of two members, M 1 and M2, within a consortium. They dene their local policies as: @M1acquire : M2: :: s1; share : M2: :: ; @M2acquire : M1: :: ; share : M1: c1, c2:: fine-select ; fine-select : c3:: s2; fine-select : :: s3; 3 Session 3: Data Security and Privacy CODASPY ’19, March 25–27, 2019, Richardson, TX, USA 123 where c 1 ,c 2 and c 3 are conditionals, s 1 ,s 2 and s 3 are selections and fine-select is a tag dened by M2. The acquire clause of M 1 states that data is requested from M 2 after it applies s 1 selection (e.g., age > 25) to its data. In contrast, its share clause allows complete share of its data if M 2 requests. On the other hand, the acquisition clause of M 1 dictates requesting com- plete data from M 2 . However, M 2 allows data sharing if the acquisi- tion clause issued by M 1 holds c 1 c 2 conditions (e.g., is both NATO and EU member). Then, M 2 delegates selection to member-dened fine-select sub-clauses. fine-select states that if the request satises the c 3 condition (located in North America) then the request is met with the data that is selected by the s 2 selection (e.g., limits share of its data to NATO and EU member country citizens). Otherwise, it shares data that is specied by selection s3(White users). CPL supports selections through lters. A lter contains zero or more operations over data inputs describing the share and acquisi- tion criteria to be enforced. Operations are dened as keywords or symbols such as < , > , = , in , lik e , and so on. Selections and lters are dened in CPL as follows: selections::= <filters> | <tag> <filters> ::= <filter> [‘,’ <filters>] <filter> ::= <var> <operation> <value> | ‘’ Selections are executed when conditionals evaluated to be true. Conditionals can be consortium and dataset-specic. For instance, a member may require other members to be in a particular country or to be in an alliance such as NATO and to have their dataset size greater than a particular value. Such conditionals do not require any data exchange between members to be evaluated. However, members may want to incorporate a relation between their data and other member’s data into their policies as detailed next. Data-dependent Conditionals. A member’s decision on whether to share or to acquire data can depend on other member’s data. Simply put, one example of a data-dependent conditional among two members could be whether the intersection size of the two sets (e.g., a specic column of a dataset) is not too high. Considering such knowledge, a member can make a conditional decision about share or acquisition of that data. For instance, consider a list of private IP addresses used for blacklisting the domains. If a member knows that the intersection size is close to zero, then the member may dictate an acquire clause to request complete features from that member based on IP addresses [18]. CPL denes an evaluate keyword for data-dependent condition- als through functions on data. Data-dependent conditionals take the following form: conditionals::= <var>‘=’<value> [‘,’ <conditionals>] | ‘evaluate’ ‘(’ <data_ref> ‘,’ <alg_arg> ‘,’ <thshold_- arg> ‘)’ [‘,’ <conditionals>] | ‘’ A member that uses the data-dependent conditionals denes a reference data (data_ref) required for a such computation, an algo- rithm (alg_arg) and a threshold (thshold_arg) that is compared with the output of the computation. CPL includes four algorithms for data-dependent conditionals (see Table 2). To be brief, intersection size measures the size of the overlap between two sets; Jaccard index is a statistic measure of similarity between sets; Pearson correlation is a statistical measure of how much two sets are linearly depen- dent; and Cosine similarity is a measure of similarity between two vectors. Each algorithm is based on a dierent assumption about the underlying reference data. However, central to all of them is to privately (without leaking any sensitive data) measure a relation Pairwise alg. Output Private protocol Proof Intersection size |Di∩ Dj|Intersection cardinality [11] Jaccard index (|DiDj|)/(| Di∪ Dj|)Jaccard similarity [6] Pearson correlation (COV (Di,Dj) )/(σDiσDj)Garbled circuits [25] Cosine similarity (DiDj)/(Di∥ ∥ Dj)Garbled circuits [25] Table 2: CPL data-dependent conditional algorithms. Two members of a consortium use the conditionals to compute the pairwise sta- tistics. The members then use the output of the algorithm to deter- mine whether to acquire or share data from another party. (Diand Djare the inputs of a dataset, and σis std. deviation). between two members’ data to oer an eective data exchange. We note that these algorithms are found to be eective in capturing input relations in datasets [18, 19]. Data-dependent conditionals are implemented through private protocols (as dened in Table 2). These protocols are implemented with the cryptographic tools of garbled circuits and private func- tions. Protocols preserve the condentiality of data. That is, each member gets the output indicated in Table 2 without revealing their sensitive data in plain text. After the private protocol terminates, the output of the algorithm is compared with a threshold value set by the requester. If the output is below the threshold value, the conditional is evaluated to true. Turning to above example M 3 joins the consortium. M1and M2extend their local policies for M3: @M1acquire : M3: evaluate(local data, ’Jaccard’, 0.3) :: race=Asian; share : M3: :: ; @M2acquire : M3: M3in$NATO :: ;
share : M3: :: ;
@M3acquire : M1: :: Genotype = ’A/A’ ;
share : M1: evaluate(local data,’intersection size’, 10) :: ;
share : M1: :: weight>150 ;
acquire : M2: :: ;
share : M2: M2in $EU, size(data)>1K :: ; The acquire clause of M 1 denes a data-dependent conditional for M 3 . It denes a Jaccard measure on its local data through evaluate keyword and sets its threshold value equal to 0.3. M 3 agrees to share its local data with M 1 if intersection size of its local data is less then 10. Otherwise, it consults the next share clause dened for M 1 which states that an individual’s weight greater than 150 pounds will be shared. All other share and acquire clauses are trivial. Members agree to share and acquire complete data based on data size (data size > 1K), alliance membership (e.g., NATO or EU member) and inputs (e.g., genotype). Putting pieces together, CPL allows members independently de- ne a data exchange policy with share and acquire clauses. The policies are dictated through conditionals and selections. This al- lows members to dictate policies in complex and asymmetric rela- tionships. Dened in Section 3, CPL provides members to dictate partnership, share, acquisition, and data-dependent conditionals. Policy Negotiation and Conicts. Data exchange between mem- bers is governed by matching share and acquire clauses in each member’s respective policies. Both share and acquire clauses state conditions and selections on the data exchanged. Consider two ex- ample local policies with a share clause @ m2 ( share : m1 : c1 :: s1 ) and matching acquire clause @ m1 ( acquir e : m2 : c2 : s2 ). Curie’s negotiation algorithm respects both autonomy of the data owner and the needs of the requester. It conservatively negotiates share 4 Session 3: Data Security and Privacy CODASPY ’19, March 25–27, 2019, Richardson, TX, USA 124 Policy ID Consortium Name Policy Denition Acquisition Policy Share Policy P.1 Single Source Each member uses its local patient dataset to learn warfarin dose model. ✗ ✗ P.2 Nation-wide Members in the same country establish a consortium based on state and country laws. ✓ ✓ P.3 Regional Members in the same continent establish a consortium. ✓ ✓ P.4 NATO-EU NATO and EU members establish a consortium independently based on their mutual agreements. ✓ ✓ P.5 Global Members exchange their complete data to build the warfarin dose model. ✓ ✓ Table 3: Consortia constructed among members. Acquisition and share policies of members for each consortium are studied in Section 6. and acquire clauses such that it will return the intersection of respec- tive data sets in resulting policy assignment. The resolved policy in this example is share : m1 : c1c2 :: s1s2 which states that the data exchange from m2 to m1 is subject to both c1 and c2 condition- als and resulting sharing has s1 and s2 selections on m2 ’s data. This authoritative negotiation makes sure no member’s data is shared beyond its explicit intent, regardless how the other members’ poli- cies are dened. This is because negotiation fullling the criteria for each clause is based on the union of logical expressions dened in two policies. Each member runs the negotiation algorithm for mem- bers found in their member list. After all members terminate their negotiations, the negotiated policy is enforced in computations. 5 DEPLOYMENT OF CURIE To validate Curie in a real application, we integrated Curie into 24 medical institutions. Each institution wants to compute a warfarin dose model on the distributed dataset without disclosing the pa- tient health-care records. Without deployment of Curie, institutions compute warfarin dosage of a patient using a model computed on their local patient data. Curie rst enables institutions to negotiate their data exchange requirements through CPL. In this, Curie al- lows members to construct various consortia wherein each member denes a data exchange policy for other members. The next step is to compute a privacy-preserving dose model such that each party does not learn any information about the patient’s records of other medical institutions and respects the policy negotiated. Curie im- plements a secure dose protocol through homomorphic encryption (HE) to enforce the policy agreements of the members. We next present the deployment of Curie to institutions (Section 5.1) and in- tegration of policy agreements in warfarin dose model (Section 5.2). 5.1 Deployment Setup Warfarin- known as the brand name Coumadin is a widely pre- scribed (over 20 million times each year in the United States) anti- coagulant medication. It is mainly used to treat (or prevent) blood clots (thrombosis) in veins or arteries. Taking high-dose warfarin causes thin blood which may result in intracranial and extracranial bleeding. Taking low doses causes thick blood which may result in embolism and stroke. Current clinical practices suggest a xed initial dose of 5 or 10 mg/day. Patients regularly have a blood test to check how long it takes for blood to clot (international normal- ized ratio (INR)). Based on the INR, subsequent doses are adjusted to maintain the patient’s INR at the desired level. Therefore, it is important to predict the proper warfarin dose for the patients. Consortium Members. 24 medical institutions from nine coun- tries and four continents individually collected the largest patient data for predicting personalized warfarin dose (see Appendix D for details of members involved in the study). Members collect 68 Secure Dose Algorithm Protocol PiPi+1 Initialize: Random values:Vi=XiTYi,Oi=XiTXi Generate HE key pair (Ki;K pi) Secure Data Transfer: Mi:<E(Oi)Ki,E(Vi)Ki,Ki> Phase 1 Phase n Post-Reconciliation: Compute:Vj=XjTYj,Oj=XjTXj Phase 2 Secure Computation: E(Oj)Ki;E(Vj)Ki HA: E(Oj)Ki+E(Oi)Ki HA: E(Vj)Ki+E(Vi)Ki ::: Pn From Pn:ΣE(Oi)Ki,E(Vi)Ki O,V=D(Σ(E(Oi)Ki,E(Vi)Ki)Kpi True values:Vt i=XiTYi;Ot i=XiTXi O=O-Oi+Ot i,V=V-Vi+Vt i Global parameters: η=O1V Secure Data Transfer: Mi+1:<E(Oj)Ki+E(Oi)Ki, ::: ::: ::: E(Vj)Ki+E(Vi)Ki,Ki>to Pi -" Figure 4: Secure dose algorithm protocol: Member (Pi) starts the pro- tocol, the procedures and message ow among members are high- lighted in boldface. At the nal phase, Piis able to compute the dose model coecients from the negotiated data. inputs from patients’ genotypic, demographic, background infor- mation, yet a long study concluded that eight inputs are sucient for proper prescriptions [26]. Warfarin Dose Prediction Model. To determine the proper per- sonalized warfarin dosage, a long line of work concluded with an algorithm of an ordinary linear regression model [ 26 ]. The model is a function f : X → Y that aim at predicting targets of warfarin dose y∈ Y given a set of patient inputs x∈ X . We represent the patient dataset of each member Di={(xi,yi)}n i=1 , and a loss func- tion : Y × Y [0 ,) . The loss function penalizes deviations between true dose and predictions. Learning is then searching for a dose model fminimizing the average loss: L(D,f)=1 n n X i=1 (f(xi),yi).(1) The dose model reduces to minimizing the average loss L(D,f) with respect to the parameters of the model f . The model is lin- ear, i.e., f(x)=αx+β , and the loss function is the squared loss (f(x),y)=(f(x)y)2 . The dose model gives as well or better re- sults than other more complex numerical methods and outperforms xed-dose approach 2 [ 26 ]. We re-implemented the algorithm in Python by direct translation from the authors’ implementation and found that the accuracy of our implementation has no statistically signicant dierence. Consortia and Member Policies. We dene consortia among medical institutions that they state partnerships for data exchange. Table 3 summarizes the consortia. The consortia are dened based on statute and regulations between members, as well as regional, and national partnerships are studied based on their countries [ 3 , 17 , 23 , 34 ]. For example, NATO allied medical support doctrine allows strategic relationships that are otherwise not obtainable by non- NATO members. Each member in a consortium exchanges data with 2 The model has been released online http://www.warfarindosing.org to help doctors and other clinicians for predicting ideal dose of warfarin. 5 Session 3: Data Security and Privacy CODASPY ’19, March 25–27, 2019, Richardson, TX, USA 125 other members based on its CPL policy. Various acquisition and share policies of CPL are studied via conditionals and selections in Section 6. We note that policy construction is a subjective enterprise. Depending on the nature and constraints of a given environment, any number of policies are appropriate. Such is the promise of policy dened behavior; alternate interpretations leading to other application requirements can be addressed through CPL. 5.2 Privacy-preserving Dose Prediction Model The computation of local dose model of a medical institution is straightforward: a member calculates the dose model through Equa- tion 2 with the use of patient data collected locally. To implement a privacy-preserving dose model among consortia members of med- ical institutions, we dene the dose prediction formula stated in Equation 1 in a matrix form by minimizing with maximum likeli- hood estimation: β=(XX)1XY,(2) where X is the input matrix, Y is the dose matrix, and β is the coecients of the dose model. Curie allows members to collaboratively learn a dose model with- out disclosing their patient records and guarantees data sharing complies with the policy as negotiated. As shown in Equation 3, each member translates its negotiated data into neutral input ma- trices [ 41 ]. Particularly, patient samples to be exchanged by each member are computed as an input matrix X0, . . . , Xn and dose ma- trix Y 0, . . . , Y n . The transformation denes each member’s local statistics Oi=XX and V i=XY . Local statistics is the output of the negotiation of each member in a consortium. The aggregation of the local statistics corresponds to a negotiated dataset which is the exact amount that a member negotiates to obtain from other mem- bers in a consortium. Curie constructs the dose algorithm of the negotiated dataset as a concatenation of members’ local statistics as follows: XX=X 1|. . . | X nX1|. . . |Xn = n X i=1 X iXi= n X i=1 V i=V XY=X 1|. . . | X nY 1|. . . | Xn = n X i=1 X iY i= n X i=1 Oi=O(3) In Equation 3, a member computes model coecients using the sum of other members local statistics. The local statistics includes m×m constant matrices where m is the number inputs (independent of number of dataset size). Using this observation, a party computes the coecients of the negotiated dataset: η(ne дot i at e d )=(XX)1XY=O1V(4) In Equation 4, while the accuracy objective of the dose model is guaranteed using the coecients obtained from the sum of local statistics, the exchange of clear statistics among parties may leak information about members’ data. A member can infer knowledge about the distribution of each input of other members from matrices of Oi and V i [ 14 ]. Furthermore, an adversary may sni data trac to control and modify exchanged messages. To solve these problems, we use homomorphic encryption (HE) that allows computation on ciphertexts [ 2 ]. HE allows members to perform the computation of joint of function without requiring additional communication complexity other than the data exchange. We note that HE itself cannot preserve the condentiality of data from multiple parties in centralized settings [ 40 ]. However, Curie implements a distributed privacy-preserving multi-party dose model, as shown in Figure 4. To illustrate, we consider an example session of n members authorized for data exchange in a consortium. In this example, a ring topology is used for secure group communication (i.e., Pi talks to Pi+1 , and similarly Pn talks to Pi ). P1 initially generates a pair of encryption keys using the homomorphic cryptosystem and broadcasts the public key to the members in its member list. P1 then generates random V i , Oi and encrypts them E(Oi)Ki and E(V i)Ki using its public key Ki . It starts the session by sending them to the next member in the ring. When next member receives the encrypted message, it adds its local V i and Oi matrices through homomorphic addition to the output of its policy reconciliation for P1 and passes to the next member. Remaining members take the similar steps. Secure computation executes one round per member in which the computation for the particular member visits other members. This allows Curie to enforce HE on shared data of a particular member in each round uses and does not suer insecurities associated with centralized HE constructions [40]. At the nal stage of the protocol, P1 receives the sum statistics of Oi and V i from Pn . P1 decrypts the sum of the statistics using its private key and then subtracts the initial random values of V i , Oi and adds its true values used for computation of the local dose model coecients. The nal result O and V is the coecients of the dose model that respects P1 ’s policy negotiations. Other consortium members similarly start the protocol and compute the coecients. We present the security analysis of the dose protocol in Appendix C, and show its dierentially-private extension in Appendix B. 6 EVALUATION This section details the operation of the Curie through policies. We show how exible data exchange policies are implemented and operated. We focus on the following questions: (1) What are the performance trade-os in conguring CPL? (2) Can members reliably use Curie to integrate various policies? (3) Do members improve the accuracy of dose predictions with the use of CPL? The answers to the rst two questions are addressed in Sec- tion 6.1, and the last question is answered in Section 6.2. As de- tailed throughout, Curie allows 50 members to compute the privacy- preserving model using 5K data samples with 40 inputs in less than a minute. We also show how an algorithm with exible data ex- change policies can improve–often substantially–the accuracy of the warfarin dose model accuracy. Experimental Setup. The experiments were performed on a clus- ter of machines with 32 GB of maximum memory and 16-core Intel Xeon CPU at 1.90 GHz, where we use one core to get a lower bound estimate. Each member is simulated in a server that stores its data. Secure computation protocols of Curie are implemented using the open-source HElib library [ 4 ]. We set the security parameter of HElib as 128 bits. Multiplication level is optimized per member to increase the number of allowed homomorphic operations without decryption failure and to reduce the computation time. We validate the accuracy of dose model in various consortia dened in Table 3 with members dening dierent data exchange policies. The dataset used in our experiments contains 5700 patient records from 21 members. Dose model accuracy of each member 6 Session 3: Data Security and Privacy CODASPY ’19, March 25–27, 2019, Richardson, TX, USA 126 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 Consortia size 100 101 102 103 104 Total messages for policy agreement P.2 UK P.3 N. America P.4 NATO P.2 U.S. P.3 Asia P.3 Europe P.4 EU P.5 Global Figure 5: CPL negotiation cost - Costs associated with a number of varying members in a consortium. Each member denes asymmet- ric share and acquisition policy for other members. The number of members in warfarin consortia is marked with red circles. 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 Consortia size 0 2 4 6 8 10 12 14 16 18 Negotiation time (s) Intersection size Jaccard Distance Pearson Correlation Cosine Similarity Figure 6: CPL selections and data-dependent conditional costs - Costs associated with varying members and algorithms. All consor- tia members agree on policy including a dierent data-dependent conditional and selections over one input of having 200 samples. 0 10 20 30 40 50 Number of members 0 2 4 6 8 Time (s) (a) Number of data samples = 5000 0 10 20 30 40 50 Number of members 0 20 40 60 80 Time (s) with key genaration (b) Number of data samples = 5000 inputs=8 inputs=16 inputs=24 inputs=32 inputs=40 2000 4000 6000 8000 10000 Number of data samples 0 2 4 6 8 10 12 14 Time (s) (c) Members = 20 2000 4000 6000 8000 10000 Number of data samples 0 20 40 60 80 100 Time (s) with key genaration (d) Members = 20 Figure 7: CPL performance on privacy-preserving and dierential private protocol - All members dene an asymmetric share and acquisition policy through selections and conditionals. The agreements of CPL policies between consortia members are studied with the dierent number of consortia members, data samples, and input size. (Std. dev. of ten runs is ±3.6 and ±0.3 sec. with and without homomorphic key generation.) is validated with Mean Absolute Percentage Error (MAPE). MAPE measures the percentage of how far predicted dosages are away from true dosage. Lower values indicate better quality of treatment. 6.1 Performance Evaluation We present the costs associated with various Curie mechanisms. We illustrate the cost of the CPL in policy negotiations, in the use of data-dependent conditionals, and in the dose algorithm. 6.1.1 CPL Benchmarks. Our rst set of experiments characterize the policy construction and negotiation costs. Various consortia and policies are instrumented to analyze the overhead of the number of messages and time required to compute the CPL selections and data- dependent conditionals. All the costs not specic to the policies are excluded in measurements (e.g., network latency). The benchmark results are summarized in Figure 5 and 6 and discussed below. Figure 5 shows the number of messages for policy construction required for dierent consortia size. The number of members in warfarin study is also labeled. For instance, NATO consortium has 13 members; ten members from U.S. and three from UK. The ex- periments illustrate the upper bound results wherein each member denes a dierent share and acquisition policy for other members (i.e., asymmetric relations). In this, each member sends acquisition policy request to consortium members. After a member gets the acquisition request, it reconciles with its share policy and output of negotiation message is returned. The number of messages asso- ciated with varying number of selections and conditionals dictated by the members does not require any additional messages. For in- stance, the acquisition request of a member includes arguments when conditionals are dened (e.g., reference data and a thresh- old value for data-dependent conditionals such as pairwise Jaccard distance), and the result is returned with the negotiation output message. However, the use of the selections and data-dependent conditionals brings additional processing cost as detailed next. Figure 6 shows the costs associated with the use of CPL se- lection and data-dependent conditionals. All the members dictate data-dependent conditionals and selections on a single input. The members input size for the data-dependent conditional computa- tions is set to 200 real values. This is the average number of inputs found in members’ dataset. Since selections and conditionals rec- oncile contradictions between acquisition and share policies, they do not require any additional computation overhead and yield a processing time of milliseconds. However, the time associated with varying data-dependent conditionals depend on the protocol of associated secure pairwise algorithm. In our experiments, cosine similarity and intersection size exhibited shorter computation time than Pearson correlation and Jaccard distance. Overall, we found that 25 members compute the metrics less than 18 seconds. Note that the results serve as an upper bound that all members dene a set of selections and a data-dependent conditional on one input. 6.1.2 Dose Model Benchmarks. Our second series of experiments characterize the impact of CPL on the average time of computing privacy-preserving dose model with varying number of members and dataset sizes. Though the warfarin study includes eight inputs, 7 Session 3: Data Security and Privacy CODASPY ’19, March 25–27, 2019, Richardson, TX, USA 127 U.S.(M1) U.S.(M2)Brasil UK Israel Taiwan S. Korea 0 10 20 30 40 50 60 70 80 Error (%) (a) P.1 Single Source (No Consortium) (Members treat their patients) U.S. Brasil UK Israel Taiwan S. Korea 0 10 20 30 40 50 60 70 80 Error (%) (f) P.5 Global Consortium U.S. Brasil UK Israel Taiwan S. Korea 0 10 20 30 40 50 60 70 80 Error (%) (b) Cross-border (No Consortium) (Members treat other members' patients) N. America S. America Europe Asia 0 10 20 30 40 50 60 70 80 Error (%) (d) P.3 Regional Consortium same region other regions NATO NATO and Allies EU 0 10 20 30 40 50 60 70 80 Error (%) (e) P.4 NATO-EU Consortium members non-members M1Brasil UK Israel Taiwan S. Korea 0 10 20 30 40 50 60 70 80 Error (%) (c) P.2 Nation-wide (U.S.) Consortium (a member and nation-wide consortium treat patients) Single organization (M1) in U.S. Nation-wide (U.S.) Data Exchange Policy: Acquisition ×, Share ×Data Exchange Policy: Acquisition ×, Share × Figure 8: The implication of policies on model accuracy - errors are validated in various consortia through data exchange policies. Figure 6(c-f): The local acquisition policies of members comply with the sharing policy within a consortium (i.e., members acquire complete data of the consortia members. Std. devs. of errors are within %5, if not illustrated). evaluations are repeated with the input size of 8, 16, 24, 32, and 40 through various dataset sample sizes for completeness. The input and sample size together represents the total dataset shared for a member as a result of the policy agreements. Our experiments show that 80% of computation overhead is attributed to HE key genera- tion. The cost of the dierential privacy takes microseconds, as the members can calculate the (optional) dierential private algorithm model at the end of the secure dose protocol. Computations are instrumented to classify the overheads incurred by key generation, encryption, decryption, and evaluation. We next present the costs with and without key generation to study the impact of the number of members and data size. Figure 7 (a-b) presents the computation cost with varying number of members. Each member’s dataset includes 5000 data samples which acquired as a result of the policy negotiations. Figure 7 (a) presents the cost of the total computation time excluding HE key generation. There is a linear increase in time with the growing number of members. This is the fundamental cost of encryption and evaluation operations dominated by matrix encryption and addition. To prole the generation of key cost, in Figure 7 (b), we conducted similar experiments. Each input size cost increases because of the key generation overhead. The increase is quadratic as a number of slots (plaintext elements) are set to square of input size not to lose any data during input conversion. It is important to note that the cost is independent of the member size because a member generates the key only once in a computation of a consortium. We note that the time overhead of key generation is not a limiting factor as members may generate keys before a consortium is established. In Figure 7 (c-d), we show the costs associated with dierent data samples. The number of members in a consortium is set to 20. Sim- ilar to the previous experiments, the key generation dominates the computation costs. Our experiments also reported no relationship between the cost and number of samples. That is, even though the size of the data samples increases, the overhead is amortized over the operations on the local statistics of the computations (which is the square matrix of the input size in the warfarin dataset); thus the time of computing dose algorithm converges to the number of dataset inputs. This explains the similar trends observed in plots. 6.2 Eectiveness of Policies We validate the performance of privacy-preserving dose model quantitatively and qualitatively. For the warfarin study, these are translated to the following questions: How do policies impact the accuracy of members’ warfarin dose prediction? (Section 6.2.1), and Does policies help to prevent the adverse impacts of dose errors on patient health? (Section 6.2.2). 6.2.1 Implications of CPL on Model Accuracy. In our rst set of experiments, we validate how well a member prescribe warfarin dose for its local patients and patient’s of the consortium members without using CPL. These results are used as a baseline for compar- ison of varying consortia and data exchange policies throughout. Figure 8 (a) sought to identify the local algorithm errors ( P.1 ). The errors signicantly dier between countries and for the members of the same country (depicted as M 1 and M 2 in the U.S.). The low results are due to having homogeneous data; all the inputs in these countries have similar traits. For instance, similar age and ethnicity found in a dataset produce over-tted computation results for its local patients. These ndings are validated with use of local algo- rithms for treatment of other countries’ patients. As illustrated in Figure 8 (b), the dose errors yield signicantly high for particu- lar countries’ patients. The results indicate that improvements in dose predictions of local patients and members’ patients lay in the creation of data exchange policies to increase the patient diversity. The next experiments measure the impact of CPL in nation-wide ( P.2 ), regional ( P.3 ), NATO-EU ( P.4 ) and global ( P.5 ) consortia. Each member creates a local acquisition policy to acquire the com- plete data of consortia members (i.e., the acquisition policy of a consortium member complies with the share policy of the requested member). We make three major observations. First, varying part- nerships yield dierent dose accuracy. For instance, members of nation-wide consortium get better dose accuracy than their lo- cal results. This result is validated through nationwide consortia 8 Session 3: Data Security and Privacy CODASPY ’19, March 25–27, 2019, Richardson, TX, USA 128 Member Agreement of policy negotiations U.S. f(Race=“Asian”)(EVALUATE(age))(height <160) (weight <65)(CYP2C9 IN ( 2*/*2, 2*/*3)(Amiodarone=“Y”)(Enzyme=“Y”)g Brasil f(Race=“Asian”)(height <165)(CYP2C9 IN (2*/*2, 2*/*3)EVALUATE (Amiodarone)(Enzyme=“Y”)g UK f(Race,“White”)(age BETWEEN 20-29 AND >80)(height<165)(60<weight <100)EVALUATE(CYP2C9)(Amiodarone=“Y”), (Enzyme=“Y”)g Israel f(Race,“White”)(height <160cm)(weight <60)(CYP2C9=3*/*3)(Amiodarone=“Y”)(Enzyme Inducer =“Y”)g Taiwan f(Race=All)(age BETWEEN 20-29)(height >170)(weight >65)(CYP2C9 IN (1*/*2, 2*/*2, 2*/*3, 3*/*3)(VK0RC1=“G/G”)(Amiodarone=“Y”)(Enzyme=“Y”)g S. Korea f(Race=All)(age BETWEEN 20-29)(height >165)(weight >60)(CYP2C9 IN (1*/*2, 2*/*2, 2*/*3, 3*/*3)(VK0RC1=“G/G”)(Amiodarone=“Y”)(Enzyme=“Y”)g Table 4: An exploration of CPL policies in the global consortium (illustrated as a plain language): Each member denes asymmetric local policy based on its data diversity. The agreement of share and acquisition policies are depicted as a policy clause in a single row. The agreement result of each member for other members is not presented for brevity. and a single member (M 1 ) in United States (see Figure 8 (c)). Sec- ond, supporting previous ndings, all regional (excluding Asia) and NATO-EU policies decrease the error for both treatment of their patients and the other countries’ patients (see Figure 8 (d-e)). However, Asia consortium results in unexpected dose errors for the treatment of other regions’ patients. This is because nation-wide, regional, and NATO-EU policies include patient population hav- ing dierent characteristics; thus the data obtained through policy negotiations better generalize to the dosages. In contrast, Asia col- laboration lacks large enough White and Black groups. Third, the global consortium results in higher dose errors when evaluated for particular countries such as Brazil and Taiwan (see Figure 8 (f)). To conclude, while CPL is eective in reducing dose error of a member, the results highlight the need for the systematic use of CPL through selections and conditionals to obtain better results. In these experiments, each member dictates a dierent acquisi- tion policy based on its racial groups. Members aim at having an ideal patient population uniformity. To do so, each member denes a local acquisition policy and negotiates it with other members. Each member sets its share policy to conditionals of being in the same consortium and data size greater than 200; thus, the policy of each member is asymmetric. Table 4 shows the simplied notation of the policy agreements in the global consortium. For instance, a member having a small number of white patients denes selections to solely acquire that group and a member having large enough pa- tients for all genotypes sets data-dependent conditionals to obtain patient inputs that are not similar in its data samples (e.g., acquires dierent genotypes). Figure 9 presents a subset of results on dose errors per patient race. The errors of the other races yield similar for each member. The results without CPL conditionals and selections are plotted as a dashed line for comparison. We nd that members can improve the dose accuracy with the use of policies. We note that the use of dierent data-dependent conditionals dened in evaluate does not result in statistically signicant accuracy gain. 6.2.2 Implications of CPL on Patient Health. We examine the im- pact of the dose errors found in the previous section to better quantify the eectiveness of policies on patient health. To identify the adverse eects of warfarin, we use a clinical study to evaluate the clinical relevance of prediction errors [ 9 ] and a medical guide to identify the consequences of over- and under- prescriptions [ 16 ]. We dene errors that are inside and outside of the warfarin safety window, and the under- or over prescriptions. We consider weekly errors for each patient because using weekly values eliminates the errors posed by the initial (daily) dose. The weekly dose is in the safety window if an estimated dose falls within 20% of its corresponding clinically-deduced value [ 26 , 27 ]. Taiwan(Asian) S.Korea(Asian)S.Korea(White) Brazil(Asian) U.S.(Asian) UK(Asian) Israel(Black) 0 5 10 15 20 25 30 35 40 45 Error (%) Figure 9: Dose accuracy of members using CPL policies dened in Table 4. Members construct a model per race after they reconcile the policies. The dashed line is the average error found without the use of conditionals and selections in policies. Consortium U SW O Selections Conditionals Single Source 37.7% 43.4% 18.8% ✗ ✗ Nation-wide 18.9% 52.3% 28.8% ✓ ✓ NATO 19.3% 51.5% 29.2% ✓ ✓ Regional 19% 51.3% 29.7% ✓ ✓ Global 21.2% 46.8% 32% ✓ ✓ Table 5: Impact of policies on health-related risks: Results are from a global consortium patients using policy agreement of a member located in the U.S. The member uses the policy dened in Table 4. (U: Under-prescription, SW: Safety Window, O: Over-prescription) The deviations falling outside of the safety window is an under- or over prescriptions, and cause health-related risks. Table 5 presents the percentage of patients falls in safety window, over- and -under prescriptions with varying policies of a member. We nd that use of CPL increases the number of patients in the safety window. For instance, a member has 43.4% patient with using its local data (single source model), and the member increases the percentage of patients in a safety window with varying consortia and policies, for instance, it is 52.4% in the nation-wide consortium. We conclude that CPL might be useful in preventing errors that introduce health-related risks. 7 LIMITATIONS AND DISCUSSION One requirement for correctly interpreting the CPL policies is a shared schema for solving the compatibility issues among members. For instance, members may interpret the data columns (e.g., column names and types) dierently or may not have the information about consortium members (e.g., membership status of an alliance). CPL implements a shared schema describing column names, their types, and explanations of data elds as well as consortium-specic 9 Session 3: Data Security and Privacy CODASPY ’19, March 25–27, 2019, Richardson, TX, USA 129 information. Members can negotiate the schema similar to the policy negotiations and revise the schema based on the schema of a negotiation initiator. CPL provides a set of data-dependent statistical functions (e.g., co- sine similarity) to compute pairwise statistics among member’s local data. However, there might be a need for other functions that help members decide their data exchange policies. For example, data exchange among nance companies may require calculating the similarity between data distributions. Future work will investigate the integration of dierent data-dependent statistics into CPL. Lastly, we did not focus much on the reasons of policy impacts on the prediction success of the dose algorithm and its adverse outcomes on patient health over time. While our evaluation results showed that members could express both complex relations and constraints on the data exchange through CPL policies, members require establishing true partnerships to improve the prediction model accuracy. While this explanation matches both our intu- ition and the experimental results, a further domain-specic formal analysis is needed. We plan to pursue this in future work. 8 RELATED WORK Policy has been used in several contexts as a vehicle for represent- ing conguration of secure groups [ 30 ], network management [ 35 ], threat mitigation [ 18 ], access control [ 13 ], and data retrieval sys- tems [ 15 ]. These approaches dene a schema for their target prob- lem and do not consider the challenges in secure data exchange. In contrast, Curie denes a formal policy language to dictate the data exchange requirements of members and enforces the agreement in collaborative ML settings. On the other hand, secure computation on sensitive proprietary data has recently attracted attention. Federated learning [ 20 , 37 ], anonymization [ 14 ], multi-site statistical models [ 10 ], secure multi- party computation [ 28 ], and secure and dierentially-private multi- party computation [ 1 ] have started to shed light on this issue. Such techniques have been used both for training and classication phases in deep learning [ 36 ], clustering [ 22 ], and decision trees [ 8 ]. To allow programmers to develop such applications, secure compu- tation programming frameworks and languages are designed for general purposes [ 7 , 14 , 24 , 32 , 33 ]. However, these approaches do not consider complex relationships among members and assume members share their all data or nothing. We view our eorts in this paper to be complementary to much of these works. CPL can be integrated into these frameworks to establish partnerships and manage data exchange policies before a computation starts. 9 CONCLUSIONS We presented Curie which provides a novel policy language called CPL to dene the specications of data exchange requirements securely for use in collaborative learning settings. Members can assert who and what to exchange separately for data sharing and data acquisition policies. This allows members to eciently dictate their policies in complex and asymmetric relationships through selections, conditionals, and pairwise data-dependent statistics. We validated Curie in an example real-world healthcare application through varying policies of consortia members. A secure multi- party and (optional) dierentially-private model is implemented to illustrate the policy/performance trade-os. Curie allowed 50 dif- ferent members to eciently compute a privacy-preserving model using 5K data samples with 40 inputs in less than a minute. We also showed how an algorithm with eective use of data exchange policies could improve the accuracy of the dose prediction model. Future work will investigate the use of Curie in other collabora- tive learning settings exploring dierent statistics for data-dependent conditionals and explore its performance trade-os by integrating it into other o-the-shelf secure computation frameworks. ACKNOWLEDGMENT Research was sponsored by the Army Research Laboratory and was accomplished under Cooperative Agreement Number W911NF- 13-2-0045 (ARL Cyber Security CRA). This work is also partially supported by US National Science Foundation (NSF) under the grant numbers NSF-CNS-1718116 and NSF-CAREER-CNS-1453647. The statements made herein are solely the responsibility of the authors. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the ocial policies, either expressed or implied, of the Army Research Laboratory or the U.S. Government. The U.S. Government is autho- rized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation here on. REFERENCES [1] Abbas Acar et al . 2017. Achieving Secure and Dierentially Private Computations in Multiparty Settings. In IEEE Privacy-Aware Computing (PAC). [2] Abbas Acar, Hidayet Aksu, A. Selcuk Uluagac, and Mauro Conti. 2017. A Survey on Homomorphic Encryption Schemes: Theory and Implementation. CoRR abs/1704.03578 (2017). arXiv:1704.03578 http://arxiv.org/abs/1704.03578 [3] American Recovery and Reinvestment Act of 2009. 2017. https://en.wikipedia.org/ wiki/American_Recovery_and_Reinvestment_Act_of_2009. [Online; accessed 01-June-2018]. [4] An Implementation of Homomorphic Encryption. 2017. https://github.com/shaih/ HElib. [Online; accessed 01-January-2017]. [5] Rohan Anil, Gabriel Pereyra, Alexandre Passos, Robert Ormandi, George E Dahl, and Georey E Hinton. 2018. Large scale distributed neural network training through online distillation. arXiv preprint arXiv:1804.03235 (2018). [6] Carlo Blundo et al . 2013. EsPRESSo: Ecient Privacy-preserving Evaluation of Sample Set Similarity. In Data Privacy Management Security. [7] Dan Bogdanov et al . 2016. Rmind: a Tool for Cryptographically Secure Statistical Analysis. IEEE Transactions on Dependable and Secure Computing (2016). [8] Raphael Bost, Raluca Ada Popa, Stephen Tu, and Sha Goldwasser. 2015. Machine Learning Classication over Encrypted Data. In NDSS. [9] Z. Berkay Celik, David Lopez-Paz, and Patrick McDaniel. 2016. Patient-Driven Privacy Control through Generalized Distillation. IEEE Symposium on Privacy- Aware Computing (2016). [10] Fida K Dankar. 2015. Privacy Preserving Linear Regression on Distributed Databases. Transactions on Data Privacy (2015). [11] Emiliano De Cristofaro, Paolo Gasti, and Gene Tsudik. 2012. Fast and Private Computation of Cardinality of Set Intersection and Union. In Cryptology and Network Security. [12] Jerey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V Le, et al . 2012. Large scale distributed deep networks. In NIPS. [13] Li Duan, Yang Zhang, Chen, et al . 2016. Automated Policy Combination for Secure Data Sharing in Cross-Organizational Collaborations. IEEE Access (2016). [14] Khaled El Emam et al . 2013. A Secure Dist. Logistic Regression Protocol for the Detection of Rare Adverse Drug Events. American Medical Informatics (2013). [15] Eslam Elnikety et al . 2016. Thoth: Comprehensive Policy Compliance in Data Retrieval Systems. In USENIX Security. [16] U.S. Food and Drug Administration. 2017. Medication guide, Caumadin (warfarin sodium). http://www.fda.gov. [Online; accessed 01-June-2018]. [17] NATO Standard Allied Joint Doctrine for Medical Support. 2017. http://www. nato.int. [Online; accessed 01-June-2018]. [18] Julien Freudiger, Emiliano De Cristofaro, and Alejandro E Brito. 2015. Controlled Data Sharing for Collaborative Predictive Blacklisting. In DIMVA. [19] Roberto Garrido-Pelaz et al . 2016. Shall We Collaborate?: A model to Analyse the Benets of Information Sharing. In ACM Workshop on Information Sharing and Collaborative Security. [20] Robin C Geyer, Tassilo Klein, and Moin Nabi. 2017. Dierentially Private Fed- erated Learning: A Client Level Perspective. arXiv preprint arXiv:1712.07557 (2017). 10 Session 3: Data Security and Privacy CODASPY ’19, March 25–27, 2019, Richardson, TX, USA 130 [21] Oded Goldreich. 2009. Foundations of Cryptography: Basic Applications. Cam- bridge university press. [22] Thore Graepel, Kristin Lauter, and Michael Naehrig. 2012. ML Condential: Machine Learning on Encrypted Data. In Information Security and Cryptology. [23] Health Information Technology for Economic and Clinical Health Act. 2017. https://en.wikipedia.org. [Online; accessed 01-June-2018]. [24] Wilko Henecka et al . 2010. TASTY: Tool for Automating Secure Two-party Computations. In ACM CCS. [25] Yan Huang et al . 2011. Faster Secure Two-Party Computation Using Garbled Circuits. In USENIX Security Symposium. [26] International Warfarin Pharmacogenetics Consortium. 2009. Estimation of the Warfarin Dose with Clinical and Pharmacogenetic Data. The New England Journal of Medicine (2009). [27] Stephen E Kimmel et al . 2013. A pharmacogenetic versus a Clinical Algorithm for Warfarin Dosing. New England Journal of Medicine (2013). [28] Yehuda Lindell and Benny Pinkas. 2009. Secure Multiparty Computation for Privacypreserving Data Mining. Journal of Privacy and Condentiality (2009). [29] Chang Liu et al . 2015. Oblivm: A programming Framework for Secure Computa- tion. In Security and Privacy. [30] Patrick McDaniel and Atul Prakash. 2006. Methods and Limitations of Security Policy Reconciliation. ACM TISSEC (2006). [31] Payman Mohassel and Yupeng Zhang. 2017. SecureML: A system for scalable privacy-preserving machine learning. In Security and Privacy (SP). [32] Olga Ohrimenko et al . 2016. Oblivious Multi-Party Machine Learning on Trusted Processors. In USENIX Security Symposium. [33] Aseem Rastogi et al . 2014. Wysteria: A programming language for generic, mixed-mode multiparty computations. In IEEE Security and Privacy (SP). [34] European Commission Report. 2017. Overview of the National Laws on Electronic Health Records in the EU Member States. http://ec.europa.eu. [Online; accessed 01-June-2018]. [35] Ana C Riekstin et al . 2016. Orchestration of Energy eciency Capabilities in Networks. Journal of Network and Computer Applications (2016). [36] Reza Shokri et al. 2015. Privacy-preserving Deep Learning. In ACM CCS. [37] Virginia Smith, Chao-Kai Chiang, Maziar Sanjabi, and Ameet S Talwalkar. 2017. Federated Multi-Task Learning. In NIPS. [38] Daniel J Solove and Paul M Schwartz. 2015. Information Privacy Law. Aspen. [39] Marten Van Dijk and Ari Juels. 2010. On the Impossibility of Cryptography Alone for Privacy-preserving Cloud Computing. HotSec (2010). [40] Marten Van Dijk and Ari Juels. 2010. On the Impossibility of Cryptography Alone for Privacy-preserving Cloud Computing. In USENIX Hot Topics in Security. [41] Fang-Jing Wu, Yu-Fen Kao, et al . 2011. From Wireless Sensor Networks Towards Cyber Physical Systems. Pervasive and Mobile Computing (2011). [42] Xi Wu et al . 2015. Revisiting Dierentially Private Regression: Lessons from Learning Theory and their Consequences. arXiv:1512.06388 (2015). [43] Jun Zhang et al . 2012. Functional Mechanism: Regression Analysis under Dier- ential Privacy. VLDB (2012). A CURIE POLICY LANGUAGE This section presents the Backus Naur Form of Curie data exchange policy language. curie_policy::= statements statements::= statement;’ [statements] statement::= share_clause |acquire_clause |attribute |sub_clause ; share clauses dened as follows: share_clause::= ‘share’ ‘:’ [members] ‘:’ [conditionals] ::selections ; acquisition clauses dened as follows: acquire_clause::= ‘acquire’ ‘: [ members ] : [ conditionals ] ::selections ; attributes are dened as follows: attribute::= identier:=’ ‘<value> |identier:=’ ‘<value_list> ; user dened sub-clauses dened as follows: sub_clause::= tag:’ [conditionals] ‘::selections ; conditionals including data-dependent functions dened as fol- lows: conditionals::= var=value[‘,conditionals] | ‘evaluate’ ‘(data_ref,alg_arg, threshold_arg)’ [‘,conditionals] | ‘’ selections::= lters |tag lters::= lter[‘,lters] lter::= var⟩ ⟨operation⟩ ⟨value| ‘’ data_ref::= ‘&identier alg_arg::= algorithms algorithms::= ‘Intersection size | ‘Jaccard index | ‘Pearson correlation | ‘Cosine similarity threshold_arg::= oating_point_number operation::= ‘=’|‘<’|‘>’|‘!=’|in| value_list::= ‘{value}’ [‘,value_list] members::= member[‘,members] member::= identier| ‘’ ; for completeness, trivial items dened as follows: identier::= word var::= ‘$identier
value::= string
tag::= word
string::= ‘"stringchars"
stringchars::= stringletter[stringchars]
stringletter::= 0x10 | 0x13|0x20| ... | 0x7F
word::= char[word]
char::= letter|digit
letter::= ’A’ | ’B’ | ... | ’Z’ | ’a’ | ’b’ | ... | ’z’ | 0x80 | 0x81 | ... | 0xFF
digit::= ’0’ | ’1’ | ... | ’9’
oating_point_number::= decimal_number’.’ [decimal_number]
decimal_number::= digit[decimal_number]
B DIFFERENTIALLY-PRIVATE DOSE
ALGORITHM
We presented how members compute a privacy-preserving dose
model on negotiated data through their policies. In this section,
we consider individual privacy that allows a member to guarantee
no information leakage on the targeted individual (i.e., patient)
involved in the computation. Specically, while members compute
a secure dose model using the data obtained as a result of their
policy negotiations, they also ensure that an adversary cannot infer
whether any particular individual is included in computations to
build the dose algorithm. In warfarin study, this corresponds to a
dierentially-private secure dose algorithm on shared data.
To implement a dierentially-private secure algorithm, we use a
functional mechanism technique [
42
,
43
]. The technique accepts
a dataset (
D
), an objective function (
fD(η)
), and a privacy budget
(
ϵ
) as an input and returns
ϵ
-dierentially-private coecients
Hη
of an algorithm. The intuition behind the functional mechanism
is perturbing the objective function of the optimization problem.
Perturbation includes both sensitivity analysis and Laplacian noise
insertion as opposed to perturbing the results via dierentially-
private synthetic data generation.
To inter-operate the functional mechanism with the secure dose
protocol, members convert each column from [min, max] to [-1,1]
before negotiation starts. This processing ensures that sucient
noise is added to the objective function on negotiated data. Then,
members proceed with the protocol. At the nal stage of the secure
algorithm protocol, a member gets clear statistics of
O=XX
and
V=XY
and input dimension
d
that is the size of
O
or
V
. These
statistics are the exact quantities that are minimized in the objective
of the functional mechanism [
43
]. Using these statistics, a member
11
Session 3: Data Security and Privacy
CODASPY ’19, March 25–27, 2019, Richardson, TX, USA
131
0.25 5 10 20 50 100
0 (Privacy Budget)
0
10
20
30
40
50
60
70
80
90
100
Error(%)
Global (DP)
Global (Non-DP)
NATO (DP)
NATO (Non-DP)
Single Source (DP)
Single Source (Non-DP)
Figure 1: Non-private secure algorithm (Non-DP) vs. dierentially-
private secure algorithm (DP) performance of a member in U.S. mea-
sured against various policies depicted in Figure 8.
may (optionally) compute
ϵ
-dierential private secure algorithm
Dierential Privacy Results.
To protect individual privacy in
secure dose algorithm, members may compute the dierentially-
private secure algorithm on their negotiated data. This section
presents the results of using the dierential-private secure algo-
rithm (DP) instead of using secure dose algorithm (Non-DP). To
establish a baseline performance, we constructed non-private se-
cure algorithms of a member. We then build the dierential-private
secure algorithm for dierent privacy budgets (
ϵ
= 0.25, 1, 5, 20, 50
and 100). Finally, we compare the results of two algorithms through
dierent policies of a member. Figure 1 shows the results of a mem-
ber in the U.S. that applies both algorithms to predict the dosage.
The algorithms are constructed for the single source, NATO, and
global consortia. In this, the member dictates acquisition policy
for complete data and other members complies with their share
policy. The average error over 100 distinct model for each budget
value is reported. The use of DP degrades the accuracy as the
ϵ
value increases. For instance, the accuracy improvement obtained
through NATO policy over single source degrades with the privacy
budget less than or equal to 20. We note that other consortia and
policies with use of selections and conditionals show similar eect
on the dose accuracy.
C ANALYSIS OF THE DOSE ALGORITHM
We present security and privacy guarantees of the dose algorithm
provided to all members through the share of encrypted integrated
statistics, (
Oi=XX
and
V
i=XY
matrices). Since all data
exchange among parties is encrypted through the use of HE, the se-
curity of the algorithm against any adversary outside the authorized
parties is based on the underlying HE cryptosystem.
An adversary not involving session initiator.
Assume for now
that a session initiator does not collude with other parties. Loosely
speaking, since all computations are performed on the encrypted
data, none of the parties learn anything about other parties’ input.
We consider a party
Pi+1
in Figure 4. The party
Pi+1
has the pub-
lic key generated by the session initiator
Ki
, the encryption of local
statistics of previous parties
Mi=(E(Oi)K,E(V
i)K)
. Its input is
(V
i+1,Oi+1)
and its output is
Mi+1=(E(Oi+Oi+1),E(V
i+V
i+1))
.
A simulator
S
selects random values for its own inputs
(V
i+1,O
i+1)
initiator. Then, the simulator
S
performs the homomorphic oper-
Mi
and outputs
M
i+1=(E(Oi+
O
i+1)K,E(V
i+V
i+1)K)
. Here, we assume the underlying HE is
semantically secure. Therefore, the output of the simulator
M
i+1
is computationally indistinguishable from output of the real exe-
cution of the protocol
Mi+1
for every input pairs. Therefore, using
the denition in [
21
] the protocol privately computes the function
in the presence of one semi-honest corrupted party. The extension
to multi-corrupted semi-honest adversaries is straightforward as
the only dierence is the view of a subset of parties having many
encrypted messages. Since the semantic security of the underly-
ing HE is hold for any pair of these many encrypted messages, no
information leaks about the corresponding plaintexts.
We consider the case
when the session initiator is corrupted. The corrupted parties in-
cluding session initiator can infer the input of an honest party if
the predecessor (previous party) and successor (next party) of an
honest party are both corrupted. We consider the possible cases
for data leakage: (1) 2-party: The session initiator is corrupted, and
another party is honest. In this case, predecessor and successor of
the honest party are both the corrupted session initiator. There-
fore, the input of honest party is learned by the corrupted party,
(2) 3-party: A corrupted session initiator is either predecessor or
successor; thus it can learn inputs of the one of the honest party
only if another party is corrupted, and (3) n-party (
n>
3): To learn
an honest party’s input, at least two parties must be corrupted and
placed in previous and next of the honest party.
While the individual raw data of members does not leak, the risk
of inappropriate disclosures from local summary statistics exists in
some extreme cases [
14
]. Consider the exchange of plain matrix
Vi=
XY
among two parties; a party may use the extreme values found
in
Vi
to identify particular patients. For instance, in dose algorithm,
taking inducers such as Rifadin and Dilantin could indicate high
dose prescriptions. If the values of
Vi
are high, then a party may
infer a patient that takes enzyme inducers and the presence of high
dosage warfarin intake. Similarly, exchange of
Oi=XX
may leak
information about the number of observations and represent the
number of 0s or 1s in a column. For instance, for the former rst
entry in the matrix,
XX
, gives the total number of patients. For
the latter,
(XX)j,j
gives the number of 1s in the column. This type
information lets a party infer knowledge, particularly when binary
inputs (e.g., use of the medicine) are used.
D CURIE DEPLOYMENT DETAILS
We use a dataset collected by the International Warfarin Pharma-
cogenetics Consortium (IWPC), to date the most comprehensive
database containing patient data collected from 24 medical institu-
tions from 9 countries [
26
]. The dataset does not include the name
of the medical institutions, yet there is a separate ethnicity dataset
provided for identifying the genomic impacts of the algorithm. We
use the race (reported by patients) and race categories (dened by
the Oce of Management and Budget) to predict the country of
a patient
3
. For instance, we consider a medical institution with a
high number of Japanese race is located in Japan. We use subsets of
patient records that have no missing inputs for accurate evaluation.
We split the dataset into two cohorts: training cohort is used to
learn the algorithm, and validation cohort is used to assign dose to
the new patients based on the consortia and data exchange policies.
3
The authors indicated via personal communication that they cannot provide the exact
name of the institutions due to the privacy concerns.
12
Session 3: Data Security and Privacy
CODASPY ’19, March 25–27, 2019, Richardson, TX, USA
132
... These attacks specifically aim to obtain Bitcoin and blockchain users' private keys through social engineering methods [169]- [171], fake wallets [172], [173], and key-stealing trojan malware [174]- [176]. Although these attacks and their countermeasures [177]- [179] have been studied extensively in the literature [180], their impact in the Bitcoin and blockchain domain has not been investigated yet and can lead to new research directions. ...
Conference Paper
Full-text available
... These attacks specifically aim to obtain Bitcoin and blockchain users' private keys through social engineering methods [169]- [171], fake wallets [172], [173], and key-stealing trojan malware [174]- [176]. Although these attacks and their countermeasures [177]- [179] have been studied extensively in the literature [180], their impact in the Bitcoin and blockchain domain has not been investigated yet and can lead to new research directions. ...
Preprint
Emerging blockchain and cryptocurrency-based technologies are redefining the way we conduct business in cyberspace. Today, a myriad of blockchain and cryptocurrency systems, applications, and technologies are widely available to companies, end-users, and even malicious actors who want to exploit the computational resources of regular users through \textit{cryptojacking} malware. Especially with ready-to-use mining scripts easily provided by service providers (e.g., Coinhive) and untraceable cryptocurrencies (e.g., Monero), cryptojacking malware has become an indispensable tool for attackers. Indeed, the banking industry, major commercial websites, government and military servers (e.g., US Dept. of Defense), online video sharing platforms (e.g., Youtube), gaming platforms (e.g., Nintendo), critical infrastructure resources (e.g., routers), and even recently widely popular remote video conferencing/meeting programs (e.g., Zoom during the Covid-19 pandemic) have all been the victims of powerful cryptojacking malware campaigns. Nonetheless, existing detection methods such as browser extensions that protect users with blacklist methods or antivirus programs with different analysis methods can only provide a partial panacea to this emerging cryptojacking issue as the attackers can easily bypass them by using obfuscation techniques or changing their domains or scripts frequently. Therefore, many studies in the literature proposed cryptojacking malware detection methods using various dynamic/behavioral features.
Chapter
Harmful repercussions from sharing sensitive or personal data can hamper institutions’ willingness to engage in data exchange. Thus, institutions consider Authenticity-Enhancing Technologies (AETs) and Privacy-Enhancing Technologies (PETs) to engage in Sovereign Data Exchange (SDE), i.e., sharing data with third parties without compromising their own or their users’ data sovereignty. However, these technologies are often technically complex, which impedes their adoption. To support practitioners select PETs and AETs for SDE use cases and highlight SDE challenges researchers and practitioners should address, this study empirically constructs a challenge-oriented technology mapping. First, we compile challenges of SDE by conducting a systematic literature review and expert interviews. Second, we map PETs and AETs to the SDE challenges and identify which technologies can mitigate which challenges. We validate the mapping through investigator triangulation. Although the most critical challenge concerns data usage and access control, we find that the majority of PETs and AETs focus on data processing issues.KeywordsSovereign data exchangeTechnology mappingPrivacy-enhancing technologiesAuthenticity-enhancing technologies
Preprint
Full-text available
Harmful repercussions from sharing sensitive or personal data can hamper institutions' willingness to engage in data exchange. Thus, institutions consider Authenticity Enhancing Technologies (AETs) and Privacy-Enhancing Technologies (PETs) to engage in Sovereign Data Exchange (SDE), i.e., sharing data with third parties without compromising their own or their users' data sovereignty. However, these technologies are often technically complex, which impedes their adoption. To support practitioners select PETs and AETs for SDE use cases and highlight SDE challenges researchers and practitioners should address, this study empirically constructs a challenge-oriented technology mapping. First, we compile challenges of SDE by conducting a systematic literature review and expert interviews. Second, we map PETs and AETs to the SDE challenges and identify which technologies can mitigate which challenges. We validate the mapping through investigator triangulation. Although the most critical challenge concerns data usage and access control, we find that the majority of PETs and AETs focus on data processing issues.
Article
Full-text available
Techniques such as ensembling and distillation promise model quality improvements when paired with almost any base model. However, due to increased test-time cost (for ensembles) and increased complexity of the training pipeline (for distillation), these techniques are challenging to use in industrial settings. In this paper we explore a variant of distillation which is relatively straightforward to use as it does not require a complicated multi-stage setup or many new hyperparameters. Our first claim is that online distillation enables us to use extra parallelism to fit very large datasets about twice as fast. Crucially, we can still speed up training even after we have already reached the point at which additional parallelism provides no benefit for synchronous or asynchronous stochastic gradient descent. Two neural networks trained on disjoint subsets of the data can share knowledge by encouraging each model to agree with the predictions the other model would have made. These predictions can come from a stale version of the other model so they can be safely computed using weights that only rarely get transmitted. Our second claim is that online distillation is a cost-effective way to make the exact predictions of a model dramatically more reproducible. We support our claims using experiments on the Criteo Display Ad Challenge dataset, ImageNet, and the largest to-date dataset used for neural language modeling, containing $6\times 10^{11}$ tokens and based on the Common Crawl repository of web data.
Conference Paper
Full-text available
Sharing and working on sensitive data in distributed settings from healthcare to finance is a major challenge due to security and privacy concerns. Secure multiparty computation (SMC) is a viable panacea for this, allowing distributed parties to make computations while the parties learn nothing about their data, but the final result. Although SMC is instrumental in such distributed settings, it does not provide any guarantees not to leak any information about individuals to adversaries. Differential privacy (DP) can be utilized to address this; however, achieving SMC with DP is not a trivial task, either. In this paper, we propose a novel Secure Multiparty Distributed Differentially Private (SM-DDP) protocol to achieve secure and private computations in a multiparty environment. Specifically, with our protocol, we simultaneously achieve SMC and DP in distributed settings focusing on linear regression on horizontally distributed data. That is, parties do not see each others’ data and further, can not infer information about individuals from the final constructed statistical model. Any statistical model function that allows independent calculation of local statistics can be computed through our protocol. The protocol implements homomorphic encryption for SMC and functional mechanism for DP to achieve the desired security and privacy guarantees. In this work, we first introduce the theoretical foundation for the SM-DDP protocol and then evaluate its efficacy and performance on two different datasets. Our results show that one can achieve individual-level privacy through the proposed protocol with distributed DP, which is independently applied by each party in a distributed fashion. Moreover, our results also show that the SM-DDP protocol incurs minimal computational overhead, is scalable, and provides security and privacy guarantees.
Article
Full-text available
Sharing and working on sensitive data in distributed settings from healthcare to finance is a major challenge due to security and privacy concerns. Secure multiparty computation (SMC) is a viable panacea for this, allowing distributed parties to make computations while the parties learn nothing about their data, but the final result. Although SMC is instrumental in such distributed settings, it does not provide any guarantees not to leak any information about individuals to adversaries. Differential privacy (DP) can be utilized to address this; however, achieving SMC with DP is not a trivial task, either. In this paper, we propose a novel Secure Multiparty Distributed Differentially Private (SM-DDP) protocol to achieve secure and private computations in a multiparty environment. Specifically, with our protocol, we simultaneously achieve SMC and DP in distributed settings focusing on linear regression on horizontally distributed data. That is, parties do not see each others' data and further, can not infer information about individuals from the final constructed statistical model. Any statistical model function that allows independent calculation of local statistics can be computed through our protocol. The protocol implements homomorphic encryption for SMC and functional mechanism for DP to achieve the desired security and privacy guarantees. In this work, we first introduce the theoretical foundation for the SM-DDP protocol and then evaluate its efficacy and performance on two different datasets. Our results show that one can achieve individual-level privacy through the proposed protocol with distributed DP, which is independently applied by each party in a distributed fashion. Moreover, our results also show that the SM-DDP protocol incurs minimal computational overhead, is scalable, and provides security and privacy guarantees.
Article
Full-text available
Legacy encryption systems depend on sharing a key (public or private) among the peers involved in exchanging an encrypted message. However, this approach poses privacy concerns. The users or service providers with the key have exclusive rights on the data. Especially with popular cloud services, the control over the privacy of the sensitive data is lost. Even when the keys are not shared, the encrypted material is shared with a third party that does not necessarily need to access the content. Indeed, Homomorphic Encryption (HE), a special kind of encryption scheme, can address these concerns as it allows any third party to operate on the encrypted data without decrypting it in advance. Although this extremely useful feature of the HE scheme has been known for over 30 years, the first plausible and achievableFully Homomorphic Encryption (FHE) scheme, which allows any computable function to perform on the encrypted data, was introduced by Craig Gentry in 2009. Even though this was a major achievement, different implementations so far demonstrated that FHE still needs to be improved significantly to be practical on every platform. Therefore, this survey focuses on HE and FHE schemes. First, we present the basics of HE and the details of the well-known Partially Homomorphic Encryption (PHE) and Somewhat Homomorphic Encryption (SWHE), which are important pillars of achieving FHE. Then, the main FHE families, which have become the base for the other follow-up FHE schemes are presented.Furthermore, the implementations and new improvements in Gentry-type FHE schemes are also surveyed. Finally, further research directions are discussed. We believe this survey can give a clear knowledge and foundation to researchers and practitioners interested in knowing, applying, as well as extending the state of the art HE, PHE, SWHE, and FHE systems.
Conference Paper
Full-text available
Data retrieval systems process data from many sources, each subject to its own data use policy. Ensuring compliance with these policies despite bugs, misconfiguration, or operator error in a large, complex, and fast evolving system is a major challenge. Thoth provides an efficient, kernel-level compliance layer for data use policies. Declarative policies are attached to the systems’ input and output files, key-value tuples, and network connections, and specify the data’s integrity and confidentiality requirements. Thoth tracks the flow of data through the system, and enforces policy regardless of bugs, misconfigurations, compromises in application code, or actions by unprivileged operators. Thoth requires minimal changes to an existing system and has modest overhead, as we show using a prototype Thoth-enabled data retrieval system based on the popular Apache Lucene.
Conference Paper
Deep learning based on artificial neural networks is a very popular approach to modeling, classifying, and recognizing complex data such as images, speech, and text. The unprecedented accuracy of deep learning methods has turned them into the foundation of new AI-based services on the Internet. Commercial companies that col- lect user data on a large scale have been the main beneficiaries of this trend since the success of deep learning techniques is directly proportional to the amount of data available for training. Massive data collection required for deep learning presents ob- vious privacy issues. Users' personal, highly sensitive data such as photos and voice recordings is kept indefinitely by the companies that collect it. Users can neither delete it, nor restrict the purposes for which it is used. Furthermore, centrally kept data is subject to legal subpoenas and extra-judicial surveillance. Many data own- ers—for example, medical institutions that may want to apply deep learning methods to clinical records—are prevented by privacy and confidentiality concerns from sharing the data and thus benefitting from large-scale deep learning. In this paper, we design, implement, and evaluate a practical sys- tem that enables multiple parties to jointly learn an accurate neural- network model for a given objective without sharing their input datasets. We exploit the fact that the optimization algorithms used in modern deep learning, namely, those based on stochastic gradi- ent descent, can be parallelized and executed asynchronously. Our system lets participants train independently on their own datasets and selectively share small subsets of their models' key parameters during training. This offers an attractive point in the utility/privacy tradeoff space: participants preserve the privacy of their respective data while still benefitting from other participants' models and thus boosting their learning accuracy beyond what is achievable solely on their own inputs. We demonstrate the accuracy of our privacy- preserving deep learning on benchmark datasets.
Article
Federated learning poses new statistical and systems challenges in training machine learning models over distributed networks of devices. In this work, we show that multi-task learning is naturally suited to handle the statistical challenges of this setting, and propose a novel systems-aware optimization method, MOCHA, that is robust to practical systems issues. Our method and theory for the first time consider issues of high communication cost, stragglers, and fault tolerance for distributed multi-task learning. The resulting method achieves significant speedups compared to alternatives in the federated setting, as we demonstrate through simulations on real-world federated datasets.
Conference Paper
Nowadays, both the amount of cyberattacks and their sophistication have considerably increased, and their prevention concerns many organizations. Cooperation by means of information sharing is a promising strategy to address this problem, but unfortunately it poses many challenges. Indeed, looking for a win-win environment is not straightforward and organizations are not properly motivated to share information. This work presents a model to analyse the benefits and drawbacks of information sharing among organizations that present a certain level of dependency. The proposed model applies functional dependency network analysis to emulate attacks propagation and game theory for information sharing management. We present a simulation framework implementing the model that allows for testing different sharing strategies under several network and attack settings. Experiments using simulated environments show how the proposed model provides insights on which conditions and scenarios are beneficial for information sharing.