Available via license: CC BY 4.0
Content may be subject to copyright.
Asymmetric Private Set Intersection with
Applications to Contact Tracing and Private Vertical
Federated Machine Learning
Nick Angelou
Morfix / OpenMined
angelou.nick@gmail.com
Ayoub Benaissa
École Supérieure en Informatique, Sidi Bel Abbès / OpenMined
a.benaissa@esi-sba.dz
Bogdan Cebere
Bitdefender / OpenMined
bogdan.cebere@gmail.com
William Clark
OpenMined
will@willclark.tech
Adam James Hall
Edinburgh Napier University / OpenMined
adam@openmined.org
Michael A. Hoeh
apheris AI
m.hoeh@apheris.com
Daniel Liu
University of California Los Angeles / OpenMined
daniel.liu02@gmail.com
Pavlos Papadopoulos
Edinburgh Napier University / apheris AI
pavlos.papadopoulos@napier.ac.uk
Robin Roehm
apheris AI
r.roehm@apheris.com
Robert Sandmann
apheris AI
r.sandmann@apheris.com
Phillipp Schoppmann
Humboldt-Universität zu Berlin / OpenMined
schoppmann@informatik.hu-berlin.de
Tom Titcombe
Tessella / OpenMined
thomas.titcombe@tessella.com
Abstract
We present a multi-language, cross-platform, open-source library for asymmetric
private set intersection (PSI) and PSI-Cardinality (PSI-C). Our protocol combines
traditional DDH-based PSI and PSI-C protocols with compression based on Bloom
filters that helps reduce communication in the asymmetric setting. Currently, our
library supports C++, C, Go, WebAssembly, JavaScript, Python, and Rust, and
runs on both traditional hardware (x86) and browser targets. We further apply our
library to two use cases: (i) a privacy-preserving contact tracing protocol that is
compatible with existing approaches, but improves their privacy guarantees, and
(ii) privacy-preserving machine learning on vertically partitioned data.
1 Introduction
In recent years, preserving privacy has became a fundamental requirement for any organization. Huge
amounts of data are generated by social media, smart devices, or service providers (hospitals, banks,
etc.), and analyzing them in a privacy preserving manner is not easy. Lately, privacy preserving
NeurIPS 2020 Workshop on Privacy Preserving Machine Learning (PPML 2020).
arXiv:2011.09350v1 [cs.CR] 18 Nov 2020
techniques have emerged, including federated learning [
29
,
49
], secure multi-party computation [
19
],
or homomorphic encryption [18, 32], which lead a much-needed privacy revolution.
Private Set Intersection (PSI) is a multi-party computation cryptographic protocol that allows two
parties, each holding sets, to compute the intersection [
6
,
7
,
16
,
22
,
33
] or the cardinality of the
intersection [
8
,
41
] by comparing encrypted versions of these sets. Recently, several real-world
applications of PSI and its variants have emerged. Ion et al.
[24]
applied a PSI-Sum protocol to
advertising conversions aggregation, trying to solve the privacy problem between the ad supplier,
which knows the users that have seen a particular ad, and the company, which knows who made a
purchase and what they spent. Buddhavarapu et al.
[4]
improved these results with a new library and
implementation, Private-ID, with applications in randomized controlled trials or machine learning
training. Other applications include private contact discovery [
9
] or plagiarism detection [
23
]. Finally,
there are several research implementations of PSI protocols [
6
,
26
,
28
,
33
,
35
]. However, all of
these have in common that they focus on a single setting, and only provide bindings for a single
programming language. In contrast, our goal is to enable the application of PSI in further use cases
by providing an open-source library that is easy to use across language barriers and platforms.
1.1 Contributions
We present a versatile open-source library for asymmetric private set intersection (PSI) and PSI-
Cardinality (PSI-C). Our library combines the well-studied PSI protocol based on the decisional
Diffie-Hellman (DDH) assumption with a Bloom filter compression to reduce the communication
complexity. Our library supports bindings for C++, C, Go, JavaScript, Python, and Rust, as well
as multiple platforms including browser targets. To the best of our knowledge, ours is the first PSI
library to support such a wide set of languages. Our implementation is available under an open source
license at https://github.com/OpenMined/PSI.
We describe the details of our architecture in Section 3. Then, in Section 4, the practicality of our
library is exemplified with two use cases: privacy-preserving contact tracing (Section 4.1) based
on PSI-Cardinality, and secure matching of vertically partitioned data (Section 4.2) based on PSI.
Finally, in Section 5, we provide an experimental evaluation of our library.
2 Protocol Description
Protocol 1 and Figure 2 in the appendix describe our variant of the DDH-based PSI and PSI-C
protocols. The main difference to the presentation in Trieu et al.
[41
, Figure 4
]
is that our protocol
uses a Bloom filter to compress the server’s encrypted data set. As observed in previous work [
28
],
this can dramatically reduce communication in the asymmetric case. Thus, our protocol can be seen
as a hybrid of the works of Trieu et al.
[41]
and Kiss et al.
[28]
. We note that Bloom filters were
chosen here for simplicity, and can be replaced by smaller data structures such as cuckoo filters [
14
]
or Golomb-compressed sets [
34
]. We also implement a more communication-efficient variant of our
protocol using the latter. While we don’t provide a full evaluation for that variant, we show the size
differences between the data structures in Appendix B.
2.1 Security
The security of our protocol follows directly from previous work. See for example De Cristofaro
et al.
[8]
for a security proof for the PSI-C variant without Bloom filters. Since our main difference
is computing a Bloom filter with public parameters, which can be trivially simulated, security of
Protocol 1 follows.
Note that after the setup phase, communication only depends on
n
, the client’s input size. Thus, our
protocol is especially useful when the server set is static and can be reused across multiple queries.
For the PSI case, this can be done in a straight-forward way, since the server’s secret key
k
is never
revealed. For PSI-C, we need to be a bit more careful, since repeated queries may make it easier for
the client to learn the elements in the intersection in addition to the intersection size. For our contact
tracing application (Section 4.1), we therefore suggest to employ rate limiting, and rotate the server
key by re-running the setup phase repeatedly.
2
Protocol 1:
Our variant of the DDH-based PSI protocol. Adapted from Trieu et al.
[41
, Figure 4
]
.
Parameters: A cyclic group Gof order p, a hash function Hthat maps inputs to G, and a
boolean flag RevealIntersection.
Inputs: Server: a set X={x1, . . . , xN}, false-positive rate p.
Client: a set Y={y1, . . . , yn}.
Server Setup:
1. Server chooses a random key k←Zp.
2. For each i∈[N], the server computes ui=H(xi)k.
3. Server inserts {ui|i∈[N]}into a Bloom filter BF such that nqueries give a false positive
with probability at most p, and sends BF to the client.
Protocol:
1. Client randomly samples r←Zp, and for each yi∈Ysends mi=H(yi)rto the server.
2. For each i∈[n], the server computes m0
i=mk
i.
3. If RevealIntersection is true, the server sends {m0
i|i∈[n]}to the client ordered by i,
otherwise ordered by m0
i.
4. For each i∈[n], the client computes vi= (m0
i)1/r.
5. Client queries the Bloom filter for each viand computes S={i∈[n]|vi∈BF}. If
RevealIntersection is true, the client outputs S, otherwise |S|.
3 Architecture
The main focus of the library is on performance and versatility. To achieve that, the core of the library
is built using C++, and our other language bindings all use this C++ core. We use the elliptic curve
P-256 to instantiate group
G
in Protocol 1, relying on the implementation of Ion et al.
[24]
. For
message serialization, we use Protocol Buffers [
2
], and our project is built using Bazel [
1
] to support
builds across languages and platforms.
Figure 1 in the appendix depicts our library’s components and their interdependencies. We use our C
bindings to interface with languages like Go and Rust that don’t have a native C++ interface. For
Python and JavaScript, we use the C++ core library directly.
4 Applications
4.1 Privacy-Preserving Contact Tracing
Smartphone based contact tracing allows to notify people who may have been exposed to infections,
thereby allowing them to self-isolate or seek treatment. However contact tracing data poses risks
of discrimination based on health status, or location leakage [
3
]. Hence preserving the privacy of
people’s data is key for widespread adoption, especially considering that stopping a pandemic such
as COVID-19 requires adoption of approximately 60% of the whole population [42].
Several approaches for contact tracing have been developed this year, with similar patterns: Data
of infected people is collected on a central server (e.g. by national health authority). Meanwhile
people’s smartphones exchange IDs between them and collect which other IDs they have been in
contact with lately (e.g. based on Bluetooth) in a local list, regularly download a list of infected
people’s IDs from the server, and compute the intersection of both server and local ID lists to find out
whether they have been exposed. Notable protocols include TCN [39] and DP3T [11]).
However, these protocols do not protect against linkage attacks, as essentially the server provides a
list of infected IDs to the clients. This gives rise to the possibility of re-identifying infected persons,
e.g. by recording IDs of people and using additional information (e.g., from cameras on the street).
Linkage attacks can be mitigated by using PSI-C for computing the intersection. So instead of the
server providing a list of infected IDs, only the intersection size is computed by following Protocol 1.
Hence no list of infected IDs is provided by the server, rather both server and client encrypt their
sets and jointly compute the intersection. Using PSI-C for contact tracing has also been proposed
3
in concurrent work [
41
]. To show the feasibility of our approach, and in particular its compatibility
to existing protocols, we implement a privacy-preserving contact tracing library on top of the TCN
protocol [5].
4.2 Privacy-Preserving Machine Learning on Vertically Partitioned Data
Vertical Federated Learning (VFL) [
15
,
30
,
49
] applies federated learning [
25
] to vertically distributed
data, i.e., datasets that share partial information about the same entity, differing in the features of
each dataset. Different hospitals for example may have differing data about the same patient, but
cannot simply merge this data across institutions due to privacy reasons. For such situations several
approaches have been many suggested in the past [
12
,
13
,
17
,
27
,
37
,
43
,
48
], including logistic
regression [21, 31].
One approach for such scenarios is Split Learning (SL) [
20
,
44
,
45
] using a Split Neural Network
(SplitNN), where the Neural Network (NN) is split among participants, and each model segment acts
as a self-contained NN. Each model segment trains and forwards its result to the next segment until
completion. The security of SplitNN and its information leakage is being questioned [
25
], but an
enhanced privacy-preserving variant of SplitNN has been proposed in which the information leakage
is reduced using distance correlation [
38
,
46
,
47
]. Still, additional privacy-preserving methods could
be incorporated in SL when dealing with very sensitive datasets.
In our work, we have realized a privacy-preserving implementation of SplitNN trained on vertically
distributed data, called PyVertical [
40
]. In our proof-of-concept (Appendix C), we first utilise PSI
for identifying matching entries in two vertically distributed datasets in a privacy preserving manner,
and then train a SplitNN on this data, thereby ensuring the privacy of the raw data. PyVertical is
built upon the PySyft library [
36
], which provides security features and mechanisms for training
without sacrificing data privacy. Even though SplitNN does not provide formal security guarantees,
PSI allows us to hide those data points that are not part of the intersection. Hence PSI is beneficial
for just about any computation on vertically partitioned data (secure or not secure) if the parties want
to hide elements that are not in the intersection.
5 Evaluation and Conclusion
We benchmark our PSI library on an Amazon EC2 T3a.xlarge instance with 4 vCPUs at 2.5 GHz
and 16 GiB memory. Our results are presented in Table 1. It can be seen that there is little difference
between languages that bind directly to our C++ core. This is to be expected, as the running time is
dominated by elliptic curve operations, compared to which the overhead of crossing language barriers
(including protobuf serialization) is small. This changes when comparing WebAssembly and pure
Javascript, which rely on cross-compilation of our library using Emscripten. Still, the overhead in
modern browsers is less than 10x compared to the native version, and on the client side stays under 3
minutes for all sizes we tested.
The closest libraries to ours in previous work are the implementations of Kiss et al.
[28]
and Kales
et al.
[26]
. Both focus on PSI for mobile applications, for example private contact discovery. We
compare our results with the ECC-NR-PSI protocol from Kales et al.
[26]
for
N= 1M, n = 1k
.
As noted there, this protocol already outperforms Kiss et al.
[28]
. As can be seen in Table 1, the
size of the setup message is smaller for Kales et al.
[26]
. This is to be expected due to the fact that
they use more efficient cuckoo filters [
14
], and our library has an additional overhead due to our
use of protocol buffers. However, in both the online phase and in terms of total communication,
our implementation improves on the results of Kales et al.
[26]
. In Appendix B we also present
experiments on a more communication-efficient variant of our protocol using Golomb-compressed
sets. We stress that the online times from [
26
] are not comparable to ours, since their experiments
were run on different hardware, in particular using a smartphone for the client.
In conclusion, our results show that our library is highly competitive in terms of running time and
communication. At the same time, it is flexible enough to support multiple platforms and languages,
including browsers. Possible improvements can be made by reducing the setup communication, for
example using cuckoo filters [
14
,
26
]. Finally, we see extensions like PrivateID [
4
] as promising
future work.
4
Table 1: Benchmarks in seconds for
n
client elements and
N= 1M
server elements. The probability
of a false positive over
n
lookups was set to
p= 10−9
. An asterisk (
∗
) indicates experiments that
were run on a smaller server set (
N= 100k
) and extrapolated from there. Numbers for previous
work were taken from [
26
, Table 6]. Cells with dagger (
†
) were summed up in [
26
], so we only report
the sum.
Operation Size C++ C Go Python WebAssembly JS [26] Comm. [26] (Comm.)
Server: Setup
n= 1k184.1 181.5 183.9 188.9 1573.8∗9128∗241.54 6.85 MiB 4.19 MiB
n= 10k184.2 181.7 184.0 188.4 1571.8∗9273∗- 7.42 MiB -
n= 100k184.2 181.8 184.3 189.1 1573∗9139.1∗- 7.99 MiB -
Client: Request
n= 1k0.18 0.17 0.18 0.185 1.57 9.1 2.92†34.18 KiB 2.00 MiB
n= 10k1.8 1.77 1.8 1.870 15.6 92 - 341.79 KiB -
n= 100k18.09 17.8 18.2 18.2 156.8 909.4 - 3.33 MiB -
Server: Response
n= 1k0.11 0.11 0.17 0.115 1.2 6.96 †34.17 KiB 4.07 MiB
n= 10k1.13 1.13 3.02 1.154 12 70.8 - 341.79 KiB -
n= 100k11.3 11.3 29.5 11.5 120.9 701 - 3.33 MiB -
Client: Intersection
n= 1k0.11 0.11 0.9 0.120 1.2 6.97 †- -
n= 10k1.17 1.16 4.6 1.193 12.1 71.39 - - -
n= 100k11.8 11.6 41.6 11.9 122 710 - - -
Acknowledgements
Realizing both the PSI library and the two implementations for contact tracing and vertically parti-
tioned federated learning has only been possible with the help of many people who contributed to
this. We extend our deepest gratitude to everybody that has been and continues to be involved in this
effort.
References
[1]
Bazel – a fast, scalable, multi-language and extensible build system. URL
https://bazel.
build.
[2]
Protocol buffers – Google’s data interchange format. URL
https://github.com/
protocolbuffers/protobuf.
[3]
William J Buchanan, Muhammad Ali Imran, Masood Ur-Rehman, Lei Zhang, Qammer H
Abbasi, Christos Chrysoulas, David Haynes, Nikolaos Pitropakis, and Pavlos Papadopoulos.
Review and critical analysis of privacy-preserving infection tracking and contact tracing. arXiv
preprint arXiv:2009.05126, 2020.
[4]
Prasad Buddhavarapu, Andrew Knox, Payman Mohassel, Shubho Sengupta, Erik Taubeneck,
and Vlad Vlaskin. Private matching for compute. IACR Cryptol. ePrint Arch., 2020:599, 2020.
[5]
Bogdan Cebere. TCN protocol based on Private Set Intersection Cardinality, 2020. URL
https://github.com/bcebere/TCN-PSI.
[6]
Melissa Chase and Peihan Miao. Private set intersection in the internet setting from lightweight
oblivious prf. In Annual International Cryptology Conference, pages 34–63. Springer, 2020.
[7]
Emiliano De Cristofaro and Gene Tsudik. Practical private set intersection protocols with linear
complexity. In International Conference on Financial Cryptography and Data Security, pages
143–159. Springer, 2010.
[8]
Emiliano De Cristofaro, Paolo Gasti, and Gene Tsudik. Fast and private computation of
cardinality of set intersection and union. In International Conference on Cryptology and
Network Security, pages 218–231. Springer, 2012.
[9]
Daniel Demmler, Peter Rindal, Mike Rosulek, and Ni Trieu. Pir-psi: Scaling private contact
discovery. IACR Cryptol. ePrint Arch., 2018.
[10]
Li Deng. The mnist database of handwritten digit images for machine learning research [best of
the web]. IEEE Signal Processing Magazine, 29(6):141–142, 2012.
5
[11]
DP3T. DP3T - Decentralized Privacy-Preserving Proximity Tracing, 2020. URL
https:
//github.com/DP-3T/documentsN.
[12]
Wenliang Du and Mikhail J Atallah. Privacy-preserving cooperative statistical analysis. In
Seventeenth Annual Computer Security Applications Conference, pages 102–110. IEEE, 2001.
[13]
Wenliang Du, Yunghsiang S Han, and Shigang Chen. Privacy-preserving multivariate statistical
analysis: Linear regression and classification. In Proceedings of the 2004 SIAM international
conference on data mining, pages 222–233. SIAM, 2004.
[14]
Bin Fan, David G. Andersen, Michael Kaminsky, and Michael Mitzenmacher. Cuckoo filter:
Practically better than bloom. In CoNEXT, pages 75–88. ACM, 2014.
[15]
Siwei Feng and Han Yu. Multi-participant multi-class vertical federated learning. arXiv preprint
arXiv:2001.11154, 2020.
[16]
Michael J Freedman, Kobbi Nissim, and Benny Pinkas. Efficient private matching and set
intersection. In International conference on the theory and applications of cryptographic
techniques, pages 1–19. Springer, 2004.
[17]
Adrià Gascón, Phillipp Schoppmann, Borja Balle, Mariana Raykova, Jack Doerner, Samee
Zahur, and David Evans. Privacy-preserving distributed linear regression on high-dimensional
data. Proceedings on Privacy Enhancing Technologies, 2017(4):345–364, 2017.
[18]
Craig Gentry. Fully homomorphic encryption using ideal lattices. In Proceedings of the
forty-first annual ACM symposium on Theory of computing, pages 169–178, 2009.
[19] Oded Goldreich. Secure multi-party computation. Manuscript. Preliminary version, 78, 1998.
[20]
Otkrist Gupta and Ramesh Raskar. Distributed learning of deep neural network over multiple
agents. Journal of Network and Computer Applications, 116:1–8, 2018.
[21]
Stephen Hardy, Wilko Henecka, Hamish Ivey-Law, Richard Nock, Giorgio Patrini, Guillaume
Smith, and Brian Thorne. Private federated learning on vertically partitioned data via entity
resolution and additively homomorphic encryption. arXiv preprint arXiv:1711.10677, 2017.
[22]
Yan Huang, David Evans, and Jonathan Katz. Private set intersection: Are garbled circuits
better than custom protocols? In NDSS, 2012.
[23]
Cornelius Ihle, Moritz Schubotz, Norman Meuschke, and Bela Gipp. A first step towards content
protecting plagiarism detection. JCDL ’20: Proceedings of the ACM/IEEE Joint Conference on
Digital Libraries in 2020, 2020.
[24]
Mihaela Ion, Ben Kreuter, Ahmet Erhan Nergiz, Sarvar Patel, Mariana Raykova, Shobhit
Saxena, Karn Seth, David Shanahan, and Moti Yung. On deploying secure computing: Private
intersection-sum-with-cardinality. In EuroS&P. IEEE, 2020.
[25]
Peter Kairouz, H. Brendan McMahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Ar-
jun Nitin Bhagoji, Keith Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings,
Rafael G. L. D’Oliveira, Salim El Rouayheb, David Evans, Josh Gardner, Zachary Garrett,
Adrià Gascón, Badih Ghazi, Phillip B. Gibbons, Marco Gruteser, Zaïd Harchaoui, Chaoyang
He, Lie He, Zhouyuan Huo, Ben Hutchinson, Justin Hsu, Martin Jaggi, Tara Javidi, Gauri
Joshi, Mikhail Khodak, Jakub Konecný, Aleksandra Korolova, Farinaz Koushanfar, Sanmi
Koyejo, Tancrède Lepoint, Yang Liu, Prateek Mittal, Mehryar Mohri, Richard Nock, Ayfer
Özgür, Rasmus Pagh, Mariana Raykova, Hang Qi, Daniel Ramage, Ramesh Raskar, Dawn
Song, Weikang Song, Sebastian U. Stich, Ziteng Sun, Ananda Theertha Suresh, Florian Tramèr,
Praneeth Vepakomma, Jianyu Wang, Li Xiong, Zheng Xu, Qiang Yang, Felix X. Yu, Han Yu,
and Sen Zhao. Advances and open problems in federated learning. CoRR, abs/1912.04977,
2019.
[26]
Daniel Kales, Christian Rechberger, Thomas Schneider, Matthias Senker, and Christian Weinert.
Mobile private contact discovery at scale. In USENIX Security Symposium, pages 1447–1464.
USENIX Association, 2019.
6
[27]
Alan F Karr, Xiaodong Lin, Ashish P Sanil, and Jerome P Reiter. Privacy-preserving analysis
of vertically partitioned data using secure matrix products. Journal of Official Statistics, 25(1):
125, 2009.
[28]
Ágnes Kiss, Jian Liu, Thomas Schneider, N Asokan, and Benny Pinkas. Private set intersection
for unequal set sizes with mobile applications. Proceedings on Privacy Enhancing Technologies,
2017(4):177–197, 2017. URL https://encrypto.de/papers/KLSAP17.pdf.
[29]
Jakub Koneˇ
cn
`
y, H Brendan McMahan, Felix X Yu, Peter Richtárik, Ananda Theertha Suresh,
and Dave Bacon. Federated learning: Strategies for improving communication efficiency. arXiv
preprint arXiv:1610.05492, 2016.
[30]
Yang Liu, Xiong Zhang, and Libin Wang. Asymmetrically vertical federated learning. arXiv
preprint arXiv:2004.07427, 2020.
[31]
Richard Nock, Stephen Hardy, Wilko Henecka, Hamish Ivey-Law, Giorgio Patrini, Guillaume
Smith, and Brian Thorne. Entity resolution and federated learning get a federated resolution.
arXiv preprint arXiv:1803.04035, 2018.
[32]
L. T. Phong, Y. Aono, T. Hayashi, L. Wang, and S. Moriai. Privacy-preserving deep learning via
additively homomorphic encryption. IEEE Transactions on Information Forensics and Security,
13(5):1333–1345, 2018.
[33]
Benny Pinkas, Thomas Schneider, and Michael Zohner. Scalable private set intersection based
on ot extension. ACM Transactions on Privacy and Security (TOPS), 21(2):1–35, 2018.
[34]
Felix Putze, Peter Sanders, and Johannes Singler. Cache-, hash-, and space-efficient bloom
filters. ACM J. Exp. Algorithmics, 14, 2009.
[35] Peter Rindal. libpsi. URL https://github.com/osu- crypto/libPSI.
[36]
Theo Ryffel, Andrew Trask, Morten Dahl, Bobby Wagner, Jason Mancuso, Daniel Rueckert,
and Jonathan Passerat-Palmbach. A generic framework for privacy preserving deep learning.
arXiv preprint arXiv:1811.04017, 2018.
[37]
Ashish P Sanil, Alan F Karr, Xiaodong Lin, and Jerome P Reiter. Privacy preserving regression
modelling via distributed computation. In Proceedings of the tenth ACM SIGKDD international
conference on Knowledge discovery and data mining, pages 677–682, 2004.
[38]
Gábor J. Székely, Maria L. Rizzo, and Nail K. Bakirov. Measuring and testing depen-
dence by correlation of distances. Ann. Statist., 35(6):2769–2794, 12 2007. doi: 10.1214/
009053607000000505. URL https://doi.org/10.1214/009053607000000505.
[39] TCNCoalition. TCN Protocol, 2020. URL https://github.com/TCNCoalition/TCN.
[40]
Tom Titcombe, Pavlos Papadopoulos, Adam Hall, and Robert Sandmann. PyVertical, 2020.
URL https://github.com/OpenMined/PyVertical.
[41]
Ni Trieu, Kareem Shehata, Prateek Saxena, Reza Shokri, and Dawn Song. Epione: Lightweight
contact tracing with strong privacy. IEEE Data Eng. Bull., 43(2):95–107, 2020.
[42]
Oxford University. Digital contact tracing can slow or even stop coronavirus transmission
and ease us out of lockdown, 2020. URL
https://www.research.ox.ac.uk/Article/
2020-04-16-digital-contact-tracing-can-slow-or-even-stop-coronavirus-
transmission-and-ease-us-out-of-lockdown.
[43]
Jaideep Vaidya and Chris Clifton. Privacy preserving association rule mining in vertically
partitioned data. In Proceedings of the eighth ACM SIGKDD international conference on
Knowledge discovery and data mining, pages 639–644, 2002.
[44]
Praneeth Vepakomma, Otkrist Gupta, Tristan Swedish, and Ramesh Raskar. Split learn-
ing for health: Distributed deep learning without sharing raw patient data. arXiv preprint
arXiv:1812.00564, 2018.
7
Figure 1: PSI library high-level architecture and language dependencies
[45]
Praneeth Vepakomma, Tristan Swedish, Ramesh Raskar, Otkrist Gupta, and Abhimanyu Dubey.
No peek: A survey of private distributed deep learning, 2018.
[46]
Praneeth Vepakomma, Chetan Tonde, Ahmed Elgammal, et al. Supervised dimensionality
reduction via distance correlation maximization. Electronic Journal of Statistics, 12(1):960–
984, 2018.
[47]
Praneeth Vepakomma, Otkrist Gupta, Abhimanyu Dubey, and Ramesh Raskar. Reducing
leakage in distributed deep learning for sensitive health data. arXiv preprint arXiv:1812.00564,
2019.
[48]
Li Wan, Wee Keong Ng, Shuguo Han, and Vincent CS Lee. Privacy-preservation for gradient
descent methods. In Proceedings of the 13th ACM SIGKDD international conference on
Knowledge discovery and data mining, pages 775–783, 2007.
[49]
Qiang Yang, Yang Liu, Tianjian Chen, and Yongxin Tong. Federated machine learning: Concept
and applications. ACM Transactions on Intelligent Systems and Technology (TIST), 10(2):1–19,
2019.
A Protocol Flow and Library Architecture
PSI library high-level flow can be seen in Figure 2. PSI high-level architecture and the associated
languages dependencies can be seen in Figure 1.
B Communication Size
For many use cases of PSI, the server must send a large set of setup data to the client. Therefore, in
addition to Bloom filters, we also implement Golomb coding [
34
] for better compression of the setup
set. We will refer to this data structure as Golomb-compressed sets (GCS). We compare the sizes of
Golomb-compressed sets and Bloom filters that needs to be sent to the client in Table 2. Internally,
the GCS implementation uses a one-pass encoding scheme on the server side, and an on-the-fly
decoding scheme for bulk intersection on the client side, so it is similar in speed to Bloom filters for
bulk insertion and intersection. However, Bloom filters are faster at random accesses and insertions
of a small number elements. Therefore, it is important to weigh the costs of communication with the
costs of random access lookups when deciding which underlying data structure to use.
8
Figure 2: PSI library high-level flow.
Table 2: Communication sizes for Bloom filters, Golomb-compressed sets, and naively storing 64-bit
integers with varying false-positive rates. We insert
104
elements and calculate the size of each data
structure.
FPR Naive Bloom filter GCS
10−680 KB 36 KB 27 KB
10−780 KB 42 KB 31 KB
10−880 KB 48 KB 35 KB
10−980 KB 54 KB 39 KB
10−10 80 KB 60 KB 43 KB
10−11 80 KB 66 KB 48 KB
10−12 80 KB 72 KB 52 KB
C PyVertical proof-of-concept process
For our proof-of-concept of PyVertical we used the MNIST dataset of handwritten images and their
labels [
10
]. The overall process is shown in Figure 3. MNIST is a horizontally distributed dataset. As
shown in Figure 3, a) we have added a new IDs field to this data which assigns unique IDs to each
data point. Furthermore, as shown in Figure 3, b), we split the full dataset into two datasets; one with
the handwritten images and their IDs, and one with the labels and their IDs. In the next step, we
shuffle the datasets and randomly remove data points from both.
In the actual experiment, we utilise PSI to identify matching IDs in the two datasets, as shown in
Figure 3, c), hence link the data points with shared IDs, and order both datasets accordingly. Data
points that are not elements of the intersection are being purged in this example. Finally, we create a
9
SplitNN, as shown in Figure 3, d), where one part of the network is trained on the images dataset
and the other part of the network is trained on the labels dataset. This allows to train the model in a
privacy-preserving manner by leveraging PySyft’s pointers functionality [
36
] which keeps the raw
training data hidden throughout the training process.
(a) Full dataset (b) Split images and labels datasets
(c) PSI linkage and ordering (d) SplitNN training
Figure 3: PyVertical proof-of-concept implementation
10