De-identification Methods for Open Health Data: The Case of the Heritage Health Prize Claims Dataset

Electronic Health Information Laboratory, CHEO Research Institute Inc., 401 Smyth Road, Ottawa, ON, Canada.
Journal of Medical Internet Research (Impact Factor: 3.43). 02/2012; 14(1):e33. DOI: 10.2196/jmir.2001
Source: PubMed


There are many benefits to open datasets. However, privacy concerns have hampered the widespread creation of open health data. There is a dearth of documented methods and case studies for the creation of public-use health data. We describe a new methodology for creating a longitudinal public health dataset in the context of the Heritage Health Prize (HHP). The HHP is a global data mining competition to predict, by using claims data, the number of days patients will be hospitalized in a subsequent year. The winner will be the team or individual with the most accurate model past a threshold accuracy, and will receive a US $3 million cash prize. HHP began on April 4, 2011, and ends on April 3, 2013.
To de-identify the claims data used in the HHP competition and ensure that it meets the requirements in the US Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule.
We defined a threshold risk consistent with the HIPAA Privacy Rule Safe Harbor standard for disclosing the competition dataset. Three plausible re-identification attacks that can be executed on these data were identified. For each attack the re-identification probability was evaluated. If it was deemed too high then a new de-identification algorithm was applied to reduce the risk to an acceptable level. We performed an actual evaluation of re-identification risk using simulated attacks and matching experiments to confirm the results of the de-identification and to test sensitivity to assumptions. The main metric used to evaluate re-identification risk was the probability that a record in the HHP data can be re-identified given an attempted attack.
An evaluation of the de-identified dataset estimated that the probability of re-identifying an individual was .0084, below the .05 probability threshold specified for the competition. The risk was robust to violations of our initial assumptions.
It was possible to ensure that the probability of re-identification for a large longitudinal dataset was acceptably low when it was released for a global user community in support of an analytics competition. This is an example of, and methodology for, achieving open data principles for longitudinal health data.

1 Follower
46 Reads
    • "In Section 5 we will briefly comment on the effects of data imputation on the estimation results. 6 Another possible explanation is that longer delays are the result of expensive treatments which require an unusually long time to pay (El Emam et al., 2012). Such treatments might indicate a severe illness of the patient. "
    [Show abstract] [Hide abstract]
    ABSTRACT: For a large heterogeneous group of patients, we analyse probabilities of hospital admission and distributional properties of lengths of hospital stay conditional on individual determinants. Bayesian structured additive regression models for zero-inflated and overdispersed count data are employed. In addition, the framework is extended towards hurdle specifications, providing an alternative approach to cover particularly large frequencies of zero quotes in count data. As a specific merit, the model class considered embeds linear and nonlinear effects of covariates on all distribution parameters. Linear effects indicate that the quantity and severity of prior illness are positively correlated with the risk of hospital admission, while medical prevention (in the form of general practice visits) and rehabilitation reduce the expected length of future hospital stays. Flexible nonlinear response patterns are diagnosed for age and an indicator of a patients' socioeconomic status. We find that social deprivation exhibits a positive impact on the risk of admission and a negative effect on the expected length of future hospital stays of admitted patients. Copyright © 2015 John Wiley & Sons, Ltd.
    Journal of Applied Econometrics 03/2015; DOI:10.1002/jae.2454 · 1.76 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Differential privacy has gained a lot of attention in recent years as a general model for the protection of personal information when used and disclosed for secondary purposes. It has also been proposed as an appropriate model for health data. In this paper we review the current literature on differential privacy and highlight important general limitations to the model and the proposed mechanisms. We then examine some practical challenges to the application of differential privacy to health data. The review concludes by identifying areas that researchers and practitioners in this area need to address to increase the adoption of differential privacy for health data.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Information about the QT interval from surface electrocardiograms (ECGs) is essential for surveillance of the proarrhythmia potential of marketed drugs. However, ECG records obtained in daily practice cannot be easily used for this purpose without labor-intensive manual effort. This study was aimed at constructing an open-access QT database, the Electrocardiogram Vigilance with Electronic Data Warehouse (ECG-ViEW). This longitudinal observational database contains 710,369 measurements of QT and associated clinical data from 371,401 patients. The de-identified database is freely available at
    Clinical Pharmacology &#38 Therapeutics 07/2012; 92(3):393-6. DOI:10.1038/clpt.2012.93 · 7.90 Impact Factor
Show more