PreprintPDF Available

Asymmetric Private Set Intersection with Applications to Contact Tracing and Private Vertical Federated Machine Learning

Authors:
  • Ecole Superieure en Informatique, Sidi Bel Abbes, Algeria
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

We present a multi-language, cross-platform, open-source library for asymmetric private set intersection (PSI) and PSI-Cardinality (PSI-C). Our protocol combines traditional DDH-based PSI and PSI-C protocols with compression based on Bloom filters that helps reduce communication in the asymmetric setting. Currently, our library supports C++, C, Go, WebAssembly, JavaScript, Python, and Rust, and runs on both traditional hardware (x86) and browser targets. We further apply our library to two use cases: (i) a privacy-preserving contact tracing protocol that is compatible with existing approaches, but improves their privacy guarantees, and (ii) privacy-preserving machine learning on vertically partitioned data.
Content may be subject to copyright.
Asymmetric Private Set Intersection with
Applications to Contact Tracing and Private Vertical
Federated Machine Learning
Nick Angelou
Morfix / OpenMined
angelou.nick@gmail.com
Ayoub Benaissa
École Supérieure en Informatique, Sidi Bel Abbès / OpenMined
a.benaissa@esi-sba.dz
Bogdan Cebere
Bitdefender / OpenMined
bogdan.cebere@gmail.com
William Clark
OpenMined
will@willclark.tech
Adam James Hall
Edinburgh Napier University / OpenMined
adam@openmined.org
Michael A. Hoeh
apheris AI
m.hoeh@apheris.com
Daniel Liu
University of California Los Angeles / OpenMined
daniel.liu02@gmail.com
Pavlos Papadopoulos
Edinburgh Napier University / apheris AI
pavlos.papadopoulos@napier.ac.uk
Robin Roehm
apheris AI
r.roehm@apheris.com
Robert Sandmann
apheris AI
r.sandmann@apheris.com
Phillipp Schoppmann
Humboldt-Universität zu Berlin / OpenMined
schoppmann@informatik.hu-berlin.de
Tom Titcombe
Tessella / OpenMined
thomas.titcombe@tessella.com
Abstract
We present a multi-language, cross-platform, open-source library for asymmetric
private set intersection (PSI) and PSI-Cardinality (PSI-C). Our protocol combines
traditional DDH-based PSI and PSI-C protocols with compression based on Bloom
filters that helps reduce communication in the asymmetric setting. Currently, our
library supports C++, C, Go, WebAssembly, JavaScript, Python, and Rust, and
runs on both traditional hardware (x86) and browser targets. We further apply our
library to two use cases: (i) a privacy-preserving contact tracing protocol that is
compatible with existing approaches, but improves their privacy guarantees, and
(ii) privacy-preserving machine learning on vertically partitioned data.
1 Introduction
In recent years, preserving privacy has became a fundamental requirement for any organization. Huge
amounts of data are generated by social media, smart devices, or service providers (hospitals, banks,
etc.), and analyzing them in a privacy preserving manner is not easy. Lately, privacy preserving
NeurIPS 2020 Workshop on Privacy Preserving Machine Learning (PPML 2020).
arXiv:2011.09350v1 [cs.CR] 18 Nov 2020
techniques have emerged, including federated learning [
29
,
49
], secure multi-party computation [
19
],
or homomorphic encryption [18, 32], which lead a much-needed privacy revolution.
Private Set Intersection (PSI) is a multi-party computation cryptographic protocol that allows two
parties, each holding sets, to compute the intersection [
6
,
7
,
16
,
22
,
33
] or the cardinality of the
intersection [
8
,
41
] by comparing encrypted versions of these sets. Recently, several real-world
applications of PSI and its variants have emerged. Ion et al.
[24]
applied a PSI-Sum protocol to
advertising conversions aggregation, trying to solve the privacy problem between the ad supplier,
which knows the users that have seen a particular ad, and the company, which knows who made a
purchase and what they spent. Buddhavarapu et al.
[4]
improved these results with a new library and
implementation, Private-ID, with applications in randomized controlled trials or machine learning
training. Other applications include private contact discovery [
9
] or plagiarism detection [
23
]. Finally,
there are several research implementations of PSI protocols [
6
,
26
,
28
,
33
,
35
]. However, all of
these have in common that they focus on a single setting, and only provide bindings for a single
programming language. In contrast, our goal is to enable the application of PSI in further use cases
by providing an open-source library that is easy to use across language barriers and platforms.
1.1 Contributions
We present a versatile open-source library for asymmetric private set intersection (PSI) and PSI-
Cardinality (PSI-C). Our library combines the well-studied PSI protocol based on the decisional
Diffie-Hellman (DDH) assumption with a Bloom filter compression to reduce the communication
complexity. Our library supports bindings for C++, C, Go, JavaScript, Python, and Rust, as well
as multiple platforms including browser targets. To the best of our knowledge, ours is the first PSI
library to support such a wide set of languages. Our implementation is available under an open source
license at https://github.com/OpenMined/PSI.
We describe the details of our architecture in Section 3. Then, in Section 4, the practicality of our
library is exemplified with two use cases: privacy-preserving contact tracing (Section 4.1) based
on PSI-Cardinality, and secure matching of vertically partitioned data (Section 4.2) based on PSI.
Finally, in Section 5, we provide an experimental evaluation of our library.
2 Protocol Description
Protocol 1 and Figure 2 in the appendix describe our variant of the DDH-based PSI and PSI-C
protocols. The main difference to the presentation in Trieu et al.
[41
, Figure 4
]
is that our protocol
uses a Bloom filter to compress the server’s encrypted data set. As observed in previous work [
28
],
this can dramatically reduce communication in the asymmetric case. Thus, our protocol can be seen
as a hybrid of the works of Trieu et al.
[41]
and Kiss et al.
[28]
. We note that Bloom filters were
chosen here for simplicity, and can be replaced by smaller data structures such as cuckoo filters [
14
]
or Golomb-compressed sets [
34
]. We also implement a more communication-efficient variant of our
protocol using the latter. While we don’t provide a full evaluation for that variant, we show the size
differences between the data structures in Appendix B.
2.1 Security
The security of our protocol follows directly from previous work. See for example De Cristofaro
et al.
[8]
for a security proof for the PSI-C variant without Bloom filters. Since our main difference
is computing a Bloom filter with public parameters, which can be trivially simulated, security of
Protocol 1 follows.
Note that after the setup phase, communication only depends on
n
, the client’s input size. Thus, our
protocol is especially useful when the server set is static and can be reused across multiple queries.
For the PSI case, this can be done in a straight-forward way, since the server’s secret key
k
is never
revealed. For PSI-C, we need to be a bit more careful, since repeated queries may make it easier for
the client to learn the elements in the intersection in addition to the intersection size. For our contact
tracing application (Section 4.1), we therefore suggest to employ rate limiting, and rotate the server
key by re-running the setup phase repeatedly.
2
Protocol 1:
Our variant of the DDH-based PSI protocol. Adapted from Trieu et al.
[41
, Figure 4
]
.
Parameters: A cyclic group Gof order p, a hash function Hthat maps inputs to G, and a
boolean flag RevealIntersection.
Inputs: Server: a set X={x1, . . . , xN}, false-positive rate p.
Client: a set Y={y1, . . . , yn}.
Server Setup:
1. Server chooses a random key kZp.
2. For each i[N], the server computes ui=H(xi)k.
3. Server inserts {ui|i[N]}into a Bloom filter BF such that nqueries give a false positive
with probability at most p, and sends BF to the client.
Protocol:
1. Client randomly samples rZp, and for each yiYsends mi=H(yi)rto the server.
2. For each i[n], the server computes m0
i=mk
i.
3. If RevealIntersection is true, the server sends {m0
i|i[n]}to the client ordered by i,
otherwise ordered by m0
i.
4. For each i[n], the client computes vi= (m0
i)1/r.
5. Client queries the Bloom filter for each viand computes S={i[n]|viBF}. If
RevealIntersection is true, the client outputs S, otherwise |S|.
3 Architecture
The main focus of the library is on performance and versatility. To achieve that, the core of the library
is built using C++, and our other language bindings all use this C++ core. We use the elliptic curve
P-256 to instantiate group
G
in Protocol 1, relying on the implementation of Ion et al.
[24]
. For
message serialization, we use Protocol Buffers [
2
], and our project is built using Bazel [
1
] to support
builds across languages and platforms.
Figure 1 in the appendix depicts our library’s components and their interdependencies. We use our C
bindings to interface with languages like Go and Rust that don’t have a native C++ interface. For
Python and JavaScript, we use the C++ core library directly.
4 Applications
4.1 Privacy-Preserving Contact Tracing
Smartphone based contact tracing allows to notify people who may have been exposed to infections,
thereby allowing them to self-isolate or seek treatment. However contact tracing data poses risks
of discrimination based on health status, or location leakage [
3
]. Hence preserving the privacy of
people’s data is key for widespread adoption, especially considering that stopping a pandemic such
as COVID-19 requires adoption of approximately 60% of the whole population [42].
Several approaches for contact tracing have been developed this year, with similar patterns: Data
of infected people is collected on a central server (e.g. by national health authority). Meanwhile
people’s smartphones exchange IDs between them and collect which other IDs they have been in
contact with lately (e.g. based on Bluetooth) in a local list, regularly download a list of infected
people’s IDs from the server, and compute the intersection of both server and local ID lists to find out
whether they have been exposed. Notable protocols include TCN [39] and DP3T [11]).
However, these protocols do not protect against linkage attacks, as essentially the server provides a
list of infected IDs to the clients. This gives rise to the possibility of re-identifying infected persons,
e.g. by recording IDs of people and using additional information (e.g., from cameras on the street).
Linkage attacks can be mitigated by using PSI-C for computing the intersection. So instead of the
server providing a list of infected IDs, only the intersection size is computed by following Protocol 1.
Hence no list of infected IDs is provided by the server, rather both server and client encrypt their
sets and jointly compute the intersection. Using PSI-C for contact tracing has also been proposed
3
in concurrent work [
41
]. To show the feasibility of our approach, and in particular its compatibility
to existing protocols, we implement a privacy-preserving contact tracing library on top of the TCN
protocol [5].
4.2 Privacy-Preserving Machine Learning on Vertically Partitioned Data
Vertical Federated Learning (VFL) [
15
,
30
,
49
] applies federated learning [
25
] to vertically distributed
data, i.e., datasets that share partial information about the same entity, differing in the features of
each dataset. Different hospitals for example may have differing data about the same patient, but
cannot simply merge this data across institutions due to privacy reasons. For such situations several
approaches have been many suggested in the past [
12
,
13
,
17
,
27
,
37
,
43
,
48
], including logistic
regression [21, 31].
One approach for such scenarios is Split Learning (SL) [
20
,
44
,
45
] using a Split Neural Network
(SplitNN), where the Neural Network (NN) is split among participants, and each model segment acts
as a self-contained NN. Each model segment trains and forwards its result to the next segment until
completion. The security of SplitNN and its information leakage is being questioned [
25
], but an
enhanced privacy-preserving variant of SplitNN has been proposed in which the information leakage
is reduced using distance correlation [
38
,
46
,
47
]. Still, additional privacy-preserving methods could
be incorporated in SL when dealing with very sensitive datasets.
In our work, we have realized a privacy-preserving implementation of SplitNN trained on vertically
distributed data, called PyVertical [
40
]. In our proof-of-concept (Appendix C), we first utilise PSI
for identifying matching entries in two vertically distributed datasets in a privacy preserving manner,
and then train a SplitNN on this data, thereby ensuring the privacy of the raw data. PyVertical is
built upon the PySyft library [
36
], which provides security features and mechanisms for training
without sacrificing data privacy. Even though SplitNN does not provide formal security guarantees,
PSI allows us to hide those data points that are not part of the intersection. Hence PSI is beneficial
for just about any computation on vertically partitioned data (secure or not secure) if the parties want
to hide elements that are not in the intersection.
5 Evaluation and Conclusion
We benchmark our PSI library on an Amazon EC2 T3a.xlarge instance with 4 vCPUs at 2.5 GHz
and 16 GiB memory. Our results are presented in Table 1. It can be seen that there is little difference
between languages that bind directly to our C++ core. This is to be expected, as the running time is
dominated by elliptic curve operations, compared to which the overhead of crossing language barriers
(including protobuf serialization) is small. This changes when comparing WebAssembly and pure
Javascript, which rely on cross-compilation of our library using Emscripten. Still, the overhead in
modern browsers is less than 10x compared to the native version, and on the client side stays under 3
minutes for all sizes we tested.
The closest libraries to ours in previous work are the implementations of Kiss et al.
[28]
and Kales
et al.
[26]
. Both focus on PSI for mobile applications, for example private contact discovery. We
compare our results with the ECC-NR-PSI protocol from Kales et al.
[26]
for
N= 1M, n = 1k
.
As noted there, this protocol already outperforms Kiss et al.
[28]
. As can be seen in Table 1, the
size of the setup message is smaller for Kales et al.
[26]
. This is to be expected due to the fact that
they use more efficient cuckoo filters [
14
], and our library has an additional overhead due to our
use of protocol buffers. However, in both the online phase and in terms of total communication,
our implementation improves on the results of Kales et al.
[26]
. In Appendix B we also present
experiments on a more communication-efficient variant of our protocol using Golomb-compressed
sets. We stress that the online times from [
26
] are not comparable to ours, since their experiments
were run on different hardware, in particular using a smartphone for the client.
In conclusion, our results show that our library is highly competitive in terms of running time and
communication. At the same time, it is flexible enough to support multiple platforms and languages,
including browsers. Possible improvements can be made by reducing the setup communication, for
example using cuckoo filters [
14
,
26
]. Finally, we see extensions like PrivateID [
4
] as promising
future work.
4
Table 1: Benchmarks in seconds for
n
client elements and
N= 1M
server elements. The probability
of a false positive over
n
lookups was set to
p= 109
. An asterisk (
) indicates experiments that
were run on a smaller server set (
N= 100k
) and extrapolated from there. Numbers for previous
work were taken from [
26
, Table 6]. Cells with dagger (
) were summed up in [
26
], so we only report
the sum.
Operation Size C++ C Go Python WebAssembly JS [26] Comm. [26] (Comm.)
Server: Setup
n= 1k184.1 181.5 183.9 188.9 1573.89128241.54 6.85 MiB 4.19 MiB
n= 10k184.2 181.7 184.0 188.4 1571.89273- 7.42 MiB -
n= 100k184.2 181.8 184.3 189.1 15739139.1- 7.99 MiB -
Client: Request
n= 1k0.18 0.17 0.18 0.185 1.57 9.1 2.9234.18 KiB 2.00 MiB
n= 10k1.8 1.77 1.8 1.870 15.6 92 - 341.79 KiB -
n= 100k18.09 17.8 18.2 18.2 156.8 909.4 - 3.33 MiB -
Server: Response
n= 1k0.11 0.11 0.17 0.115 1.2 6.96 34.17 KiB 4.07 MiB
n= 10k1.13 1.13 3.02 1.154 12 70.8 - 341.79 KiB -
n= 100k11.3 11.3 29.5 11.5 120.9 701 - 3.33 MiB -
Client: Intersection
n= 1k0.11 0.11 0.9 0.120 1.2 6.97 - -
n= 10k1.17 1.16 4.6 1.193 12.1 71.39 - - -
n= 100k11.8 11.6 41.6 11.9 122 710 - - -
Acknowledgements
Realizing both the PSI library and the two implementations for contact tracing and vertically parti-
tioned federated learning has only been possible with the help of many people who contributed to
this. We extend our deepest gratitude to everybody that has been and continues to be involved in this
effort.
References
[1]
Bazel – a fast, scalable, multi-language and extensible build system. URL
https://bazel.
build.
[2]
Protocol buffers – Google’s data interchange format. URL
https://github.com/
protocolbuffers/protobuf.
[3]
William J Buchanan, Muhammad Ali Imran, Masood Ur-Rehman, Lei Zhang, Qammer H
Abbasi, Christos Chrysoulas, David Haynes, Nikolaos Pitropakis, and Pavlos Papadopoulos.
Review and critical analysis of privacy-preserving infection tracking and contact tracing. arXiv
preprint arXiv:2009.05126, 2020.
[4]
Prasad Buddhavarapu, Andrew Knox, Payman Mohassel, Shubho Sengupta, Erik Taubeneck,
and Vlad Vlaskin. Private matching for compute. IACR Cryptol. ePrint Arch., 2020:599, 2020.
[5]
Bogdan Cebere. TCN protocol based on Private Set Intersection Cardinality, 2020. URL
https://github.com/bcebere/TCN-PSI.
[6]
Melissa Chase and Peihan Miao. Private set intersection in the internet setting from lightweight
oblivious prf. In Annual International Cryptology Conference, pages 34–63. Springer, 2020.
[7]
Emiliano De Cristofaro and Gene Tsudik. Practical private set intersection protocols with linear
complexity. In International Conference on Financial Cryptography and Data Security, pages
143–159. Springer, 2010.
[8]
Emiliano De Cristofaro, Paolo Gasti, and Gene Tsudik. Fast and private computation of
cardinality of set intersection and union. In International Conference on Cryptology and
Network Security, pages 218–231. Springer, 2012.
[9]
Daniel Demmler, Peter Rindal, Mike Rosulek, and Ni Trieu. Pir-psi: Scaling private contact
discovery. IACR Cryptol. ePrint Arch., 2018.
[10]
Li Deng. The mnist database of handwritten digit images for machine learning research [best of
the web]. IEEE Signal Processing Magazine, 29(6):141–142, 2012.
5
[11]
DP3T. DP3T - Decentralized Privacy-Preserving Proximity Tracing, 2020. URL
https:
//github.com/DP-3T/documentsN.
[12]
Wenliang Du and Mikhail J Atallah. Privacy-preserving cooperative statistical analysis. In
Seventeenth Annual Computer Security Applications Conference, pages 102–110. IEEE, 2001.
[13]
Wenliang Du, Yunghsiang S Han, and Shigang Chen. Privacy-preserving multivariate statistical
analysis: Linear regression and classification. In Proceedings of the 2004 SIAM international
conference on data mining, pages 222–233. SIAM, 2004.
[14]
Bin Fan, David G. Andersen, Michael Kaminsky, and Michael Mitzenmacher. Cuckoo filter:
Practically better than bloom. In CoNEXT, pages 75–88. ACM, 2014.
[15]
Siwei Feng and Han Yu. Multi-participant multi-class vertical federated learning. arXiv preprint
arXiv:2001.11154, 2020.
[16]
Michael J Freedman, Kobbi Nissim, and Benny Pinkas. Efficient private matching and set
intersection. In International conference on the theory and applications of cryptographic
techniques, pages 1–19. Springer, 2004.
[17]
Adrià Gascón, Phillipp Schoppmann, Borja Balle, Mariana Raykova, Jack Doerner, Samee
Zahur, and David Evans. Privacy-preserving distributed linear regression on high-dimensional
data. Proceedings on Privacy Enhancing Technologies, 2017(4):345–364, 2017.
[18]
Craig Gentry. Fully homomorphic encryption using ideal lattices. In Proceedings of the
forty-first annual ACM symposium on Theory of computing, pages 169–178, 2009.
[19] Oded Goldreich. Secure multi-party computation. Manuscript. Preliminary version, 78, 1998.
[20]
Otkrist Gupta and Ramesh Raskar. Distributed learning of deep neural network over multiple
agents. Journal of Network and Computer Applications, 116:1–8, 2018.
[21]
Stephen Hardy, Wilko Henecka, Hamish Ivey-Law, Richard Nock, Giorgio Patrini, Guillaume
Smith, and Brian Thorne. Private federated learning on vertically partitioned data via entity
resolution and additively homomorphic encryption. arXiv preprint arXiv:1711.10677, 2017.
[22]
Yan Huang, David Evans, and Jonathan Katz. Private set intersection: Are garbled circuits
better than custom protocols? In NDSS, 2012.
[23]
Cornelius Ihle, Moritz Schubotz, Norman Meuschke, and Bela Gipp. A first step towards content
protecting plagiarism detection. JCDL ’20: Proceedings of the ACM/IEEE Joint Conference on
Digital Libraries in 2020, 2020.
[24]
Mihaela Ion, Ben Kreuter, Ahmet Erhan Nergiz, Sarvar Patel, Mariana Raykova, Shobhit
Saxena, Karn Seth, David Shanahan, and Moti Yung. On deploying secure computing: Private
intersection-sum-with-cardinality. In EuroS&P. IEEE, 2020.
[25]
Peter Kairouz, H. Brendan McMahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Ar-
jun Nitin Bhagoji, Keith Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings,
Rafael G. L. D’Oliveira, Salim El Rouayheb, David Evans, Josh Gardner, Zachary Garrett,
Adrià Gascón, Badih Ghazi, Phillip B. Gibbons, Marco Gruteser, Zaïd Harchaoui, Chaoyang
He, Lie He, Zhouyuan Huo, Ben Hutchinson, Justin Hsu, Martin Jaggi, Tara Javidi, Gauri
Joshi, Mikhail Khodak, Jakub Konecný, Aleksandra Korolova, Farinaz Koushanfar, Sanmi
Koyejo, Tancrède Lepoint, Yang Liu, Prateek Mittal, Mehryar Mohri, Richard Nock, Ayfer
Özgür, Rasmus Pagh, Mariana Raykova, Hang Qi, Daniel Ramage, Ramesh Raskar, Dawn
Song, Weikang Song, Sebastian U. Stich, Ziteng Sun, Ananda Theertha Suresh, Florian Tramèr,
Praneeth Vepakomma, Jianyu Wang, Li Xiong, Zheng Xu, Qiang Yang, Felix X. Yu, Han Yu,
and Sen Zhao. Advances and open problems in federated learning. CoRR, abs/1912.04977,
2019.
[26]
Daniel Kales, Christian Rechberger, Thomas Schneider, Matthias Senker, and Christian Weinert.
Mobile private contact discovery at scale. In USENIX Security Symposium, pages 1447–1464.
USENIX Association, 2019.
6
[27]
Alan F Karr, Xiaodong Lin, Ashish P Sanil, and Jerome P Reiter. Privacy-preserving analysis
of vertically partitioned data using secure matrix products. Journal of Official Statistics, 25(1):
125, 2009.
[28]
Ágnes Kiss, Jian Liu, Thomas Schneider, N Asokan, and Benny Pinkas. Private set intersection
for unequal set sizes with mobile applications. Proceedings on Privacy Enhancing Technologies,
2017(4):177–197, 2017. URL https://encrypto.de/papers/KLSAP17.pdf.
[29]
Jakub Koneˇ
cn
`
y, H Brendan McMahan, Felix X Yu, Peter Richtárik, Ananda Theertha Suresh,
and Dave Bacon. Federated learning: Strategies for improving communication efficiency. arXiv
preprint arXiv:1610.05492, 2016.
[30]
Yang Liu, Xiong Zhang, and Libin Wang. Asymmetrically vertical federated learning. arXiv
preprint arXiv:2004.07427, 2020.
[31]
Richard Nock, Stephen Hardy, Wilko Henecka, Hamish Ivey-Law, Giorgio Patrini, Guillaume
Smith, and Brian Thorne. Entity resolution and federated learning get a federated resolution.
arXiv preprint arXiv:1803.04035, 2018.
[32]
L. T. Phong, Y. Aono, T. Hayashi, L. Wang, and S. Moriai. Privacy-preserving deep learning via
additively homomorphic encryption. IEEE Transactions on Information Forensics and Security,
13(5):1333–1345, 2018.
[33]
Benny Pinkas, Thomas Schneider, and Michael Zohner. Scalable private set intersection based
on ot extension. ACM Transactions on Privacy and Security (TOPS), 21(2):1–35, 2018.
[34]
Felix Putze, Peter Sanders, and Johannes Singler. Cache-, hash-, and space-efficient bloom
filters. ACM J. Exp. Algorithmics, 14, 2009.
[35] Peter Rindal. libpsi. URL https://github.com/osu- crypto/libPSI.
[36]
Theo Ryffel, Andrew Trask, Morten Dahl, Bobby Wagner, Jason Mancuso, Daniel Rueckert,
and Jonathan Passerat-Palmbach. A generic framework for privacy preserving deep learning.
arXiv preprint arXiv:1811.04017, 2018.
[37]
Ashish P Sanil, Alan F Karr, Xiaodong Lin, and Jerome P Reiter. Privacy preserving regression
modelling via distributed computation. In Proceedings of the tenth ACM SIGKDD international
conference on Knowledge discovery and data mining, pages 677–682, 2004.
[38]
Gábor J. Székely, Maria L. Rizzo, and Nail K. Bakirov. Measuring and testing depen-
dence by correlation of distances. Ann. Statist., 35(6):2769–2794, 12 2007. doi: 10.1214/
009053607000000505. URL https://doi.org/10.1214/009053607000000505.
[39] TCNCoalition. TCN Protocol, 2020. URL https://github.com/TCNCoalition/TCN.
[40]
Tom Titcombe, Pavlos Papadopoulos, Adam Hall, and Robert Sandmann. PyVertical, 2020.
URL https://github.com/OpenMined/PyVertical.
[41]
Ni Trieu, Kareem Shehata, Prateek Saxena, Reza Shokri, and Dawn Song. Epione: Lightweight
contact tracing with strong privacy. IEEE Data Eng. Bull., 43(2):95–107, 2020.
[42]
Oxford University. Digital contact tracing can slow or even stop coronavirus transmission
and ease us out of lockdown, 2020. URL
https://www.research.ox.ac.uk/Article/
2020-04-16-digital-contact-tracing-can-slow-or-even-stop-coronavirus-
transmission-and-ease-us-out-of-lockdown.
[43]
Jaideep Vaidya and Chris Clifton. Privacy preserving association rule mining in vertically
partitioned data. In Proceedings of the eighth ACM SIGKDD international conference on
Knowledge discovery and data mining, pages 639–644, 2002.
[44]
Praneeth Vepakomma, Otkrist Gupta, Tristan Swedish, and Ramesh Raskar. Split learn-
ing for health: Distributed deep learning without sharing raw patient data. arXiv preprint
arXiv:1812.00564, 2018.
7
Figure 1: PSI library high-level architecture and language dependencies
[45]
Praneeth Vepakomma, Tristan Swedish, Ramesh Raskar, Otkrist Gupta, and Abhimanyu Dubey.
No peek: A survey of private distributed deep learning, 2018.
[46]
Praneeth Vepakomma, Chetan Tonde, Ahmed Elgammal, et al. Supervised dimensionality
reduction via distance correlation maximization. Electronic Journal of Statistics, 12(1):960–
984, 2018.
[47]
Praneeth Vepakomma, Otkrist Gupta, Abhimanyu Dubey, and Ramesh Raskar. Reducing
leakage in distributed deep learning for sensitive health data. arXiv preprint arXiv:1812.00564,
2019.
[48]
Li Wan, Wee Keong Ng, Shuguo Han, and Vincent CS Lee. Privacy-preservation for gradient
descent methods. In Proceedings of the 13th ACM SIGKDD international conference on
Knowledge discovery and data mining, pages 775–783, 2007.
[49]
Qiang Yang, Yang Liu, Tianjian Chen, and Yongxin Tong. Federated machine learning: Concept
and applications. ACM Transactions on Intelligent Systems and Technology (TIST), 10(2):1–19,
2019.
A Protocol Flow and Library Architecture
PSI library high-level flow can be seen in Figure 2. PSI high-level architecture and the associated
languages dependencies can be seen in Figure 1.
B Communication Size
For many use cases of PSI, the server must send a large set of setup data to the client. Therefore, in
addition to Bloom filters, we also implement Golomb coding [
34
] for better compression of the setup
set. We will refer to this data structure as Golomb-compressed sets (GCS). We compare the sizes of
Golomb-compressed sets and Bloom filters that needs to be sent to the client in Table 2. Internally,
the GCS implementation uses a one-pass encoding scheme on the server side, and an on-the-fly
decoding scheme for bulk intersection on the client side, so it is similar in speed to Bloom filters for
bulk insertion and intersection. However, Bloom filters are faster at random accesses and insertions
of a small number elements. Therefore, it is important to weigh the costs of communication with the
costs of random access lookups when deciding which underlying data structure to use.
8
Figure 2: PSI library high-level flow.
Table 2: Communication sizes for Bloom filters, Golomb-compressed sets, and naively storing 64-bit
integers with varying false-positive rates. We insert
104
elements and calculate the size of each data
structure.
FPR Naive Bloom filter GCS
10680 KB 36 KB 27 KB
10780 KB 42 KB 31 KB
10880 KB 48 KB 35 KB
10980 KB 54 KB 39 KB
1010 80 KB 60 KB 43 KB
1011 80 KB 66 KB 48 KB
1012 80 KB 72 KB 52 KB
C PyVertical proof-of-concept process
For our proof-of-concept of PyVertical we used the MNIST dataset of handwritten images and their
labels [
10
]. The overall process is shown in Figure 3. MNIST is a horizontally distributed dataset. As
shown in Figure 3, a) we have added a new IDs field to this data which assigns unique IDs to each
data point. Furthermore, as shown in Figure 3, b), we split the full dataset into two datasets; one with
the handwritten images and their IDs, and one with the labels and their IDs. In the next step, we
shuffle the datasets and randomly remove data points from both.
In the actual experiment, we utilise PSI to identify matching IDs in the two datasets, as shown in
Figure 3, c), hence link the data points with shared IDs, and order both datasets accordingly. Data
points that are not elements of the intersection are being purged in this example. Finally, we create a
9
SplitNN, as shown in Figure 3, d), where one part of the network is trained on the images dataset
and the other part of the network is trained on the labels dataset. This allows to train the model in a
privacy-preserving manner by leveraging PySyft’s pointers functionality [
36
] which keeps the raw
training data hidden throughout the training process.
(a) Full dataset (b) Split images and labels datasets
(c) PSI linkage and ordering (d) SplitNN training
Figure 3: PyVertical proof-of-concept implementation
10
... (3) Preliminaries: Fig. 4 shows the basic PSI protocol adapted from Angelou et al. (2020). The protocol combines Diffie-Hellman (DDH), based PSI, and PSI-Cardinality; and uses Bloom filter compression to minimize communication time. ...
... "uj" corresponds to the item mj (Trieu et al., 2020). On the other hand, SMDPA then generates private key r, hashed its datasets, and then calculates Dj = (mi) r for each hashed value and sends the result to AGS, which uses its private key to compute Dj', (Angelou et al., 2020;Trieu et al., 2020). Finally, this protocol is extended to reveal only the intersection size based on the Paillier cryptosystem. ...
Article
Full-text available
The aim of this article is to identify a range of changes and challenges that present-day technologies often present to contemporary societies, particularly in the context of smart city logistics, especially during crises. For example, the long-term consequences of the COVID-19 pandemic, such as life losses, economic damages, and privacy and security violations, demonstrate the extent to which the existing designs and deployments of technological means are inadequate. The article proposes a privacy-preserving, decentralized, secure protocol to safeguard individual boundaries and supply governments and public health organizations with cost-effective information, particularly regarding vaccination. The contribution of this article is threefold: (i) conducting a systematic review of most of the privacy-preserving apps and their protocols created during pandemics, and we found that most apps pose security and privacy violations. (ii) Proposing an agent-based, decentralized private set intersection (PSI) protocol for securely sharing individual digital personal and health passport information. The proposed scheme is called secure mobile digital passport agent (SMDPA). (iii) Providing a simulation measurement of the proposed protocol to assess performance. The performance result proves that SMDPA is a practical solution and better than the proposed active data bundles using secure multi-party computation (ADB-SMC), as the average CPU load for SMDPA is approximately 775 milliseconds (ms) compared to about 900 ms for ADB-SMC.
Conference Paper
Full-text available
Mobile messengers like WhatsApp perform contact discovery by uploading the user's entire address book to the service provider. This allows the service provider to determine which of the user's contacts are registered to the messaging service. However, such a procedure poses significant privacy risks and legal challenges. As we find, even messengers with privacy in mind currently do not deploy proper mechanisms to perform contact discovery privately. The most promising approaches addressing this problem revolve around private set intersection (PSI) protocols. Unfortunately, even in a weak security model where clients are assumed to follow the protocol honestly, previous protocols and implementations turned out to be far from practical when used at scale. This is due to their high computation and/or communication complexity as well as lacking optimization for mobile devices. In our work, we remove most obstacles for large-scale global deployment by significantly improving two promising protocols by Kiss et al. (PoPETS'17) while also allowing for malicious clients. Concretely, we present novel precomputation techniques for correlated oblivious transfers (reducing the online communication by factor 2x), Cuckoo filter compression (with a compression ratio of 70%), as well as 4.3x smaller Cuckoo filter updates. In a protocol performing oblivious PRF evaluations via garbled circuits, we replace AES as the evaluated PRF with a variant of LowMC (Albrecht et al., EUROCRYPT'15) for which we determine optimal parameters, thereby reducing the communication by factor 8.2x. Furthermore, we implement both protocols with security against malicious clients in C/C++ and utilize the ARM Cryptography Extensions available in most recent smartphones. Compared to previous smartphone implementations, this yields a performance improvement of factor 1000x for circuit evaluations. The online phase of our fastest protocol takes only 2.92s measured on a real WiFi connection (6.53s on LTE) to check 1024 client contacts against a large-scale database with 2^28 entries. As a proof-of-concept, we integrate our protocols in the client application of the open-source messenger Signal.
Article
Full-text available
An important initialization step in many social-networking applications is contact discovery, which allows a user of the service to identify which of its existing social contacts also use the service. Naïve approaches to contact discovery reveal a user’s entire set of social/professional contacts to the service, presenting a significant tension between functionality and privacy. In this work, we present a system for private contact discovery, in which the client learns only the intersection of its own contact list and a server’s user database, and the server learns only the (approximate) size of the client’s list. The protocol is specifically tailored to the case of a small client set and large user database. Our protocol has provable security guarantees and combines new ideas with state-of-the-art techniques from private information retrieval and private set intersection. We report on a highly optimized prototype implementation of our system, which is practical on real-world set sizes. For example, contact discovery between a client with 1024 contacts and a server with 67 million user entries takes 1.36 sec (when using server multi-threading) and uses only 4.28 MiB of communication.
Article
For distributed machine learning with health data we demonstrate how minimizing distance correlation between raw data and intermediary representations (smashed data) reduces leakage of sensitive raw data patterns during client communications while maintaining model accuracy. Leakage (measured using KL Divergence between input and intermediate representation) is the risk associated with the invertibility from intermediary representations, can prevent resource poor health organizations from using distributed deep learning services. We demonstrate that our method reduces leakage in terms of distance correlation between raw data and communication payloads from an order of 0.95 to 0.19 and from 0.92 to 0.33 during training with image datasets while maintaining a similar classification accuracy .
Article
Today’s artificial intelligence still faces two major challenges. One is that, in most industries, data exists in the form of isolated islands. The other is the strengthening of data privacy and security. We propose a possible solution to these challenges: secure federated learning. Beyond the federated-learning framework first proposed by Google in 2016, we introduce a comprehensive secure federated-learning framework, which includes horizontal federated learning, vertical federated learning, and federated transfer learning. We provide definitions, architectures, and applications for the federated-learning framework, and provide a comprehensive survey of existing works on this subject. In addition, we propose building data networks among organizations based on federated mechanisms as an effective solution to allowing knowledge to be shared without compromising user privacy.
Article
In domains such as health care and finance, shortage of labeled data and computational resources is a critical issue while developing machine learning algorithms. To address the issue of labeled data scarcity in training and deployment of neural network-based systems, we propose a new technique to train deep neural networks over several data sources. Our method allows for deep neural networks to be trained using data from multiple entities in a distributed fashion. We evaluate our algorithm on existing datasets and show that it obtains performance which is similar to a regular neural network trained on a single machine. We further extend it to incorporate semi-supervised learning when training with few labeled samples, and analyze any security concerns that may arise. Our algorithm paves the way for distributed training of deep neural networks in data sensitive applications when raw data may not be shared directly.
Article
We present a privacy-preserving deep learning system in which many learning participants perform neural network-based deep learning over a combined dataset of all, without revealing the participants’ local data to a central server. To that end, we revisit the previous work by Shokri and Shmatikov (ACM CCS 2015) and show that, with their method, local data information may be leaked to an honest-but-curious server. We then fix that problem by building an enhanced system with the following properties: (1) no information is leaked to the server; and (2) accuracy is kept intact, compared to that of the ordinary deep learning system also over the combined dataset. Our system bridges deep learning and cryptography: we utilize asynchronous stochastic gradient descent as applied to neural networks, in combination with additively homomorphic encryption. We show that our usage of encryption adds tolerable overhead to the ordinary deep learning system. IEEE
Article
Private set intersection (PSI) allows two parties to compute the intersection of their sets without revealing any information about items that are not in the intersection. It is one of the best studied applications of secure computation and many PSI protocols have been proposed. However, the variety of existing PSI protocols makes it difficult to identify the solution that performs best in a respective scenario, especially since they were not compared in the same setting. In addition, existing PSI protocols are several orders of magnitude slower than an insecure naïve hashing solution, which is used in practice. In this article, we review the progress made on PSI protocols and give an overview of existing protocols in various security models. We then focus on PSI protocols that are secure against semi-honest adversaries and take advantage of the most recent efficiency improvements in Oblivious Transfer (OT) extension, propose significant optimizations to previous PSI protocols, and suggest a new PSI protocol whose runtime is superior to that of existing protocols. We compare the performance of the protocols, both theoretically and experimentally, by implementing all protocols on the same platform, give recommendations on which protocol to use in a particular setting, and evaluate the progress on PSI protocols by comparing them to the currently employed insecure naïve hashing protocol. We demonstrate the feasibility of our new PSI protocol by processing two sets with a billion elements each.