De-identification Methods for Open Health Data: The Case of the Heritage Health Prize Claims Dataset

Electronic Health Information Laboratory, CHEO Research Institute Inc., 401 Smyth Road, Ottawa, ON, Canada.
Journal of Medical Internet Research (Impact Factor: 4.67). 02/2012; 14(1):e33. DOI: 10.2196/jmir.2001
Source: PubMed

ABSTRACT There are many benefits to open datasets. However, privacy concerns have hampered the widespread creation of open health data. There is a dearth of documented methods and case studies for the creation of public-use health data. We describe a new methodology for creating a longitudinal public health dataset in the context of the Heritage Health Prize (HHP). The HHP is a global data mining competition to predict, by using claims data, the number of days patients will be hospitalized in a subsequent year. The winner will be the team or individual with the most accurate model past a threshold accuracy, and will receive a US $3 million cash prize. HHP began on April 4, 2011, and ends on April 3, 2013.
To de-identify the claims data used in the HHP competition and ensure that it meets the requirements in the US Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule.
We defined a threshold risk consistent with the HIPAA Privacy Rule Safe Harbor standard for disclosing the competition dataset. Three plausible re-identification attacks that can be executed on these data were identified. For each attack the re-identification probability was evaluated. If it was deemed too high then a new de-identification algorithm was applied to reduce the risk to an acceptable level. We performed an actual evaluation of re-identification risk using simulated attacks and matching experiments to confirm the results of the de-identification and to test sensitivity to assumptions. The main metric used to evaluate re-identification risk was the probability that a record in the HHP data can be re-identified given an attempted attack.
An evaluation of the de-identified dataset estimated that the probability of re-identifying an individual was .0084, below the .05 probability threshold specified for the competition. The risk was robust to violations of our initial assumptions.
It was possible to ensure that the probability of re-identification for a large longitudinal dataset was acceptably low when it was released for a global user community in support of an analytics competition. This is an example of, and methodology for, achieving open data principles for longitudinal health data.

1 Follower
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Differential privacy has gained a lot of attention in recent years as a general model for the protection of personal information when used and disclosed for secondary purposes. It has also been proposed as an appropriate model for health data. In this paper we review the current literature on differential privacy and highlight important general limitations to the model and the proposed mechanisms. We then examine some practical challenges to the application of differential privacy to health data. The review concludes by identifying areas that researchers and practitioners in this area need to address to increase the adoption of differential privacy for health data.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Information about the QT interval from surface electrocardiograms (ECGs) is essential for surveillance of the proarrhythmia potential of marketed drugs. However, ECG records obtained in daily practice cannot be easily used for this purpose without labor-intensive manual effort. This study was aimed at constructing an open-access QT database, the Electrocardiogram Vigilance with Electronic Data Warehouse (ECG-ViEW). This longitudinal observational database contains 710,369 measurements of QT and associated clinical data from 371,401 patients. The de-identified database is freely available at
    Clinical Pharmacology &#38 Therapeutics 07/2012; 92(3):393-6. DOI:10.1038/clpt.2012.93 · 7.39 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: OBJECTIVES: We present SHARE, a new system for statistical health information release with differential privacy. We present two case studies that evaluate the software on real medical datasets and demonstrate the feasibility and utility of applying the differential privacy framework on biomedical data. MATERIALS AND METHODS: SHARE releases statistical information in electronic health records with differential privacy, a strong privacy framework for statistical data release. It includes a number of state-of-the-art methods for releasing multidimensional histograms and longitudinal patterns. We performed a variety of experiments on two real datasets, the surveillance, epidemiology and end results (SEER) breast cancer dataset and the Emory electronic medical record (EeMR) dataset, to demonstrate the feasibility and utility of SHARE. RESULTS: Experimental results indicate that SHARE can deal with heterogeneous data present in medical data, and that the released statistics are useful. The Kullback-Leibler divergence between the released multidimensional histograms and the original data distribution is below 0.5 and 0.01 for seven-dimensional and three-dimensional data cubes generated from the SEER dataset, respectively. The relative error for longitudinal pattern queries on the EeMR dataset varies between 0 and 0.3. While the results are promising, they also suggest that challenges remain in applying statistical data release using the differential privacy framework for higher dimensional data. CONCLUSIONS: SHARE is one of the first systems to provide a mechanism for custodians to release differentially private aggregate statistics for a variety of use cases in the medical domain. This proof-of-concept system is intended to be applied to large-scale medical data warehouses.
    Journal of the American Medical Informatics Association 10/2012; 20(1). DOI:10.1136/amiajnl-2012-001032 · 3.93 Impact Factor
Show more