SHARE: system design and case studies for
statistical health information release
James Gardner,1Li Xiong,2,4Yonghui Xiao,2Jingjing Gao,3Andrew R Post,3,4
Xiaoqian Jiang,5Lucila Ohno-Machado5
1Digital Reasoning Systems Inc,
Franklin, Tennessee, USA
2Department of Mathematics
and Computer Science, Emory
University, Atlanta, Georgia,
3Center for Comprehensive
Informatics, Emory University,
Atlanta, Georgia, USA
4Department of Biomedical
Informatics, Emory University,
Atlanta, California, USA
5Division of Biomedical
Informatics, University of
California San Diego,
San Diego, California, USA
Dr Li Xiong, Department of
Mathematics and Computer
Science, Emory University,
400 Dowman DR, Atlanta,
GA 30322, USA,
Received 14 April 2012
Accepted 11 September 2012
Published Online First
11 October 2012
Objectives We present SHARE, a new system for
statistical health information release with differential
privacy. We present two case studies that evaluate the
software on real medical datasets and demonstrate the
feasibility and utility of applying the differential privacy
framework on biomedical data.
Materials and Methods SHARE releases statistical
information in electronic health records with differential
privacy, a strong privacy framework for statistical data
release. It includes a number of state-of-the-art methods
for releasing multidimensional histograms and longitudinal
patterns. We performed a variety of experiments on two
real datasets, the surveillance, epidemiology and end
results (SEER) breast cancer dataset and the Emory
electronic medical record (EeMR) dataset, to
demonstrate the feasibility and utility of SHARE.
Results Experimental results indicate that SHARE can
deal with heterogeneous data present in medical data,
and that the released statistics are useful. The Kullback–
Leibler divergence between the released multidimensional
histograms and the original data distribution is below 0.5
and 0.01 for seven-dimensional and three-dimensional
data cubes generated from the SEER dataset, respectively.
The relative error for longitudinal pattern queries on the
EeMR dataset varies between 0 and 0.3. While the
results are promising, they also suggest that challenges
remain in applying statistical data release using the
differential privacy framework for higher dimensional data.
Conclusions SHARE is one of the first systems to
provide a mechanism for custodians to release
differentially private aggregate statistics for a variety of
use cases in the medical domain. This proof-of-concept
system is intended to be applied to large-scale medical
Recent studies and advisory reports to the govern-
ment1–3have pointed out that information sharing
with appropriate privacy protection is one of the
most critical challenges of our time, which has the
potential to help revolutionize healthcare. In par-
ticular, the Institute of Medicine’s committee on
health research and the privacy of health informa-
tion concludes3that the current Health Insurance
Portability and Accountability Act (1996) (HIPAA)
privacy rule (http://www.hhs.gov/ocr/privacy/) does
not protect privacy well and calls for an entirely new
approach to protecting privacy in health research.
We present and describe a new software frame-
(SHARE), for releasing statistical health information
with differential privacy, a strong privacy framework
for statistical data release. Through studies with real
medical datasets, we get insight into the feasibility
and utility of applying differentially private statis-
tical data release to medical data.
BACKGROUND AND SIGNIFICANCE
The problem of preserving patient privacy in disse-
minated biomedical datasets has attracted increasing
attention by both the biomedical informatics and
computer science communities.3–7The goal is to
share a ‘sanitized’ version of the individual records
(microdata) that simultaneously provides utility
for data users and privacy protection for the
individuals represented in the records. In the biomed-
ical domain, many text de-identification tools are
focused on extracting identifiers from different
types of medical documents and use simple
identifier removal or replacements according to the
HIPAA safe harbor method for de-identification.7–10
Several studies and reviews have evaluated the
re-identification risks of linking de-identified data by
the HIPAA safe harbor method with external data
such as voter registration lists.11–14Many studies
have proposed or applied formal anonymization
methods on medical data.15–22While still the domin-
ant approach in practice, the main limitation of
microdata release with de-identification is that it
often relies on assumptions of certain background or
external knowledge (eg, availability of voter registra-
tion lists) and only protects against specific attacks
(eg, linking or re-identification attacks).
A complementary research problem to microdata
(ie, original data) release is to release only privacy-
preserving statistical macrodata (ie, derived statis-
tics), which could also be used to construct syn-
thetic data. Differential privacy23–25has emerged as
one of the strongest unconditional privacy guaran-
tees for statistical data release. It makes few assump-
tions on the background or external knowledge of an
attacker, and thus provides a strong provable privacy
guarantee. A statistical aggregation or computation
satisfies ε-differential privacy, ie, is ε-differentially
private, if the outcomes are formally ‘indistinguish-
able’ (‘indistinguishable’ is formally and quantita-
tively defined in Dwork)23(outcome probability
differs by no more than a multiplicative factor eɛ)
when run with and without any particular record in
the dataset, where ɛ is a privacy parameter that
limits the maximum amount of influence a record
can have on the outcome. A common mechanism to
achieve ε-differential privacy is the Laplace mechan-
ism, which adds calibrated noise to a statistical
measure, as determined by a given privacy parameter
ε and the sensitivity of the statistical measure to the
inclusion and exclusion of any record in the dataset.
A more stringent privacy parameter requires more
J Am Med Inform Assoc 2013;20:109–116. doi:10.1136/amiajnl-2012-001032 109
Research and applications
5.Malin B. An evaluation of the current state of genomic data privacy
protection technology and a roadmap for the future. J Am Med Inform Assoc
Ohno-Machado L, Bafna C, Boxwala A, et al. iDASH. Integrating data for analysis,
anonymization, and sharing. J Am Med Inf Assoc 2011;19:196–201. PMID:
22081224 PMCID: PMC3277627.
Meystre SM, Friedlin FJ, South BR, et al. Automatic de-identification of textual
documents in the electronic health record: a review of recent research. BMC Med
Res Methodol 2010;10:70.
Sweeney L. Replacing personally-identifying information in medical records, the
scrub system. Proc AMIA Annu Fall Symp 1996;333–7.
Szarvas G, Farkas R, Busa-Fekete R. State-of-the-art anonymization of medical
records using an iterative machine learning framework. J Am Med Inform Assoc
Uzuner OO, Luo Y, Szolovits P . Evaluating the state-of-the-art in automatic
de-identification. J Am Med Inform Assoc 2007;14:550–63.
El Emam K, Jabbouri S, Sams S, et al. Evaluating common de-identification
heuristics for personal health information. J Med Int Res 2006;8:e28.
El Emam K, Brown A, AbdelMalik P . Evaluating predictors of geographic area
population size cut-offs to manage re-identification risk. J Am Med Inform Assoc
Benitez K, Malin B. Evaluating re-identification risks with respect to the HIPAA
privacy rule. J Am Med Inform Assoc 2010;17:169–77.
El Emam K, Jonker E, Arbuckle L, et al. A systematic review of re-identification
attacks on health data. PLoS One 2011;6:e28071.
Ohrn A, Ohno-Machado L. Using Boolean reasoning to anonymize databases. Artif
Intell Med 1999;15:235–54. PMID: 10206109.
Sweeney L. K-anonymity: a model for protecting privacy. Int J Uncertainty,
Fuzziness and Knowledge-based Syst 2002;10:557–70.
Ohno-Machado L, Silveira PS, Vinterbo S. Protecting patient privacy by quantifiable
control of disclosures in disseminated databases. Int J Med Inf 2004;73:599–606.
El Emam K, Dankar FK. Protecting privacy using k-anonymity. J Am Med Inform
Mohammed N, Fung BCMM, Hung PCKK, et al. Anonymizing healthcare data: a
case study on the blood transfusion service. In: Proceedings of the 15th ACM
SIGKDD international conference on Knowledge discovery and data mining—KDD
’09. New York, USA: ACM Press, 2009:1285.
Malin B, Benitez K, Masys D. Never too old for anonymity: a statistical standard for
demographic data sharing via the HIPAA privacy rule. J Am Med Inform Assoc
Emam KE, Paton D, Dankar F, et al. De-identifying a public use microdata file from
the Canadian national discharge abstract database. BMC Med Inform Decis Making
El Emam K, Arbuckle L, Koru G, et al. De-identification methods for open health data:
the case of the heritage health prize claims dataset. J Med Int Res 2012;14:e33.
Dwork C. Differential privacy. Int Colloquium on Automata, Languages and
Dwork C. Differential privacy: a survey of results. In: Agrawal M, Du DZ, Duan Z,
et al., eds. TAMC, volume 4978 of Lecture Notes in Computer Science. Xi'an, China:
Dwork C. A firm foundation for private data analysis. Commun ACM 2011;54.
McSherry FD. Privacy integrated queries: an extensible platform for
privacy-preserving data analysis. In: Proceedings of the 35th SIGMOD international
conference on Management of data (SIGMOD ‘09); 19–30. Binnige C, Dageville B,
eds. New York, NY, USA: ACM, 2009. doi:10.1145/1559845.1559850 http://doi.acm.
McSherry F. Privacy integrated queries: an extensible platform for
privacy-preserving data analysis. Commun ACM 2010;53:89–97. http://doi.acm.org/
28.Vinterbo S, Sarwate A, Boxwala A. Protecting count queries in study design. J Am
Med Inform Assoc Published Online First: 17 Apr 2012. doi:10.1136/amiajnl-2011–
Dwork C, Naor M, Reingold O, et al. On the complexity of differentially private data
release: efficient algorithms and hardness results. In: Proceedings of the 41st annual
ACM symposium on Symposium on theory of computing—STOC ’09. New York, NY,
USA: ACM Press, 2009:381.
Dankar F, Emam KE. The application of differential privacy to health data. In: EDBT/
ICDT 2012 Joint Conference. Berlin, Germany, 2012:1–9.
Gardner J, Xiong L. HIDE: an integrated system for health information
DE-identification. 2008 21st IEEE International Symposium on Computer-Based
Medical Systems. IEEE, 2008:254–9.
Gardner J, Xiong L, Li K, et al. HIDE: heterogeneous information DE-identification.
In: Proceedings of the 12th International Conference on Extending Database
Technology Advances in Database Technology—EDBT ‘09. New York, NY, USA: ACM
Gardner J, Xiong L. An integrated framework for de-identifying unstructured
medical data. Data Knowledge Eng 2009;68:1441–51.
Gardner JJ, Xiong L, Wang F, et al. An evaluation of feature sets and sampling
techniques for de-identification of medical records. The 1st ACM International Health
Informatics Symposium (IHI). 2010:1–8.
Lafferty JD, McCallum A, Pereira FCN. Conditional random fields: Probabilistic
models for segmenting and labeling sequence data. In: Proceedings of the 18th
International Conference on Machine Learning. San Francisco, CA, USA: Morgan
Okazaki N. CRFsuite: a fast implementation of Conditional Random Fields (CRFs).
2007. http://www.chokkan.org/software/crfsuite/ (accessed June 2008).
Jurczyk P , Lu JJ, Xiong L, et al. FRIL: a tool for comparative record linkage. AMIA
Annu Symp Proc. 2008:440–4.
Xiao Y, Xiong L, Yuan C. Differentially private data release through multidimensional
partitioning. Secure Data Manag 2011;6358:150–68.
Xiao Y, Gardner J, Xiong L. DPCube: Releasing Differentially Private Data Cubes for Health
Information. In 28th IEEE International Conference on Data Engineering (ICDE), 2012.
Poosala V, Haas PJ, Ioannidis , et al. Improved histograms for selectivity estimation
of range predicates. In: Proceedings of the 1996 ACM SIGMOD international
conference on Management of data (SIGMOD ’96), ACM, 1996:294–305.
Hay M, Rastogi V, Miklau G, et al. Boosting the Accuracy of Differentially-Private
Histograms Through Consistency. In: Proceedings of the International Conference on
Very Large Data Bases; 2009:3:15.
Chen R, Mohammed N, Fung BCM, et al. Publishing set-valued data via differential
privacy. Proceedings of the VLDB Endowment 2011;4:1087–98.
Chen R, Fung BCM, Desai BC, et al. Differentially private transit data publication: A
case study on the montreal transportation system. In: Proceedings of the 18th ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining
(SIGKDD); August 2012, ACM Press: Beijing, China.
Surveillance, Epidemiology, and End Results (SEER) Program (www.seer.cancer.gov)
Research Data (1973–2009), National Cancer Institute, DCCPS, Surveillance
Research Program, Surveillance Systems Branch, released April 2011.
Hayy M, Rastogiz V, Miklauy G, et al. Boosting the accuracy of differentially-private
histograms through consistency. The 36th International Conference on Very Large
Data Bases, 13–17 September 2010, Singapore. In: Proceedings of the VLDB
Endowment, Vol 3, No 1.
Cormode G, Procopiuc M, Shen E, et al. Differentially private spatial
decompositions. IEEE 28th International Conference on Data Engineering (ICDE
2012), Washington, DC, USA (Arlington, Virginia), 1-5 April 2012, IEEE Computer
Xiao Y, Xiong Fan L, et al. DPCube: differentially private histogram release through
multidimensional partitioning. Eprint arXiv: 1202.5358.
Wu Y, Jiang X, Kim J, et al. Grid Binary LOgistic REgression (GLORE): building
shared models without sharing data. J Am Med Inform Assoc 2012;19:758–64.
116J Am Med Inform Assoc 2013;20:109–116. doi:10.1136/amiajnl-2012-001032
Research and applications