De-identification Methods for Open Health Data: The Case of the Heritage Health Prize Claims Dataset

Electronic Health Information Laboratory, CHEO Research Institute Inc., 401 Smyth Road, Ottawa, ON, Canada.
Journal of Medical Internet Research (Impact Factor: 3.43). 02/2012; 14(1):e33. DOI: 10.2196/jmir.2001
Source: PubMed


There are many benefits to open datasets. However, privacy concerns have hampered the widespread creation of open health data. There is a dearth of documented methods and case studies for the creation of public-use health data. We describe a new methodology for creating a longitudinal public health dataset in the context of the Heritage Health Prize (HHP). The HHP is a global data mining competition to predict, by using claims data, the number of days patients will be hospitalized in a subsequent year. The winner will be the team or individual with the most accurate model past a threshold accuracy, and will receive a US $3 million cash prize. HHP began on April 4, 2011, and ends on April 3, 2013.
To de-identify the claims data used in the HHP competition and ensure that it meets the requirements in the US Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule.
We defined a threshold risk consistent with the HIPAA Privacy Rule Safe Harbor standard for disclosing the competition dataset. Three plausible re-identification attacks that can be executed on these data were identified. For each attack the re-identification probability was evaluated. If it was deemed too high then a new de-identification algorithm was applied to reduce the risk to an acceptable level. We performed an actual evaluation of re-identification risk using simulated attacks and matching experiments to confirm the results of the de-identification and to test sensitivity to assumptions. The main metric used to evaluate re-identification risk was the probability that a record in the HHP data can be re-identified given an attempted attack.
An evaluation of the de-identified dataset estimated that the probability of re-identifying an individual was .0084, below the .05 probability threshold specified for the competition. The risk was robust to violations of our initial assumptions.
It was possible to ensure that the probability of re-identification for a large longitudinal dataset was acceptably low when it was released for a global user community in support of an analytics competition. This is an example of, and methodology for, achieving open data principles for longitudinal health data.

1 Follower
49 Reads
    • "In Section 5 we will briefly comment on the effects of data imputation on the estimation results. 6 Another possible explanation is that longer delays are the result of expensive treatments which require an unusually long time to pay (El Emam et al., 2012). Such treatments might indicate a severe illness of the patient. "
    [Show abstract] [Hide abstract]
    ABSTRACT: For a large heterogeneous group of patients, we analyse probabilities of hospital admission and distributional properties of lengths of hospital stay conditional on individual determinants. Bayesian structured additive regression models for zero-inflated and overdispersed count data are employed. In addition, the framework is extended towards hurdle specifications, providing an alternative approach to cover particularly large frequencies of zero quotes in count data. As a specific merit, the model class considered embeds linear and nonlinear effects of covariates on all distribution parameters. Linear effects indicate that the quantity and severity of prior illness are positively correlated with the risk of hospital admission, while medical prevention (in the form of general practice visits) and rehabilitation reduce the expected length of future hospital stays. Flexible nonlinear response patterns are diagnosed for age and an indicator of a patients' socioeconomic status. We find that social deprivation exhibits a positive impact on the risk of admission and a negative effect on the expected length of future hospital stays of admitted patients. Copyright © 2015 John Wiley & Sons, Ltd.
    No preview · Article · Mar 2015 · Journal of Applied Econometrics
  • Source
    • "Consequently, Open Data is considered an important issue in several scientific communities for some time now[2]and very recently it is in debate in the biomedical area[3],[4]. A big asset of open data in research is that it allows to build on the work of others more efficiently and helps to speed the progress of science[5]; because to build on previous discoveries, there must be trust in the validity of prior research – but most important, it facilitates trust between researchers and with the public, due to the fact that it allows secondary analyses that expand the usefulness of datasets and the resulting knowledge gained[6]. Postmarket surveillance for adverse events is an essential component of every national and regional health system for assuring drug safety[7]. The Food and Drug Administration (FDA) in USA is responsible not only for approving drugs for marketing but also for monitoring their safety after they reach the market. "
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we present a study to discover hidden patterns in the reports of the public release of the Food and Drug Administration (FDA)’s Adverse Event Reporting System (AERS) for alendronate (fosamax) drug. Alendronate (fosamax) is a widely used medication for the treatment of osteoporosis disease. Osteoporosis is recognised as an important public health problem because of the significant morbidity, mortality and costs of treatment. We consider the importance of alendronate (fosamax) for medical research and explore the relationship between patient demographics information, the adverse event outcomes and drug’s adverse events. We analyze the FDA’s AERS which cover the period from the third quarter of 2005 through the second quarter of 2012 and create a dataset for association analysis. Both Apriori and Predictive Apriori algorithms are used for implementation which generates rules and the results are interpreted and evaluated. According to the results, some interesting rules and associations are obtained from the dataset. We believe that our results can be useful for medical researchers and decision making at pharmaceutical companies.
    Full-text · Chapter · Jan 2013
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Differential privacy has gained a lot of attention in recent years as a general model for the protection of personal information when used and disclosed for secondary purposes. It has also been proposed as an appropriate model for health data. In this paper we review the current literature on differential privacy and highlight important general limitations to the model and the proposed mechanisms. We then examine some practical challenges to the application of differential privacy to health data. The review concludes by identifying areas that researchers and practitioners in this area need to address to increase the adoption of differential privacy for health data.
    Full-text · Article · Jan 2012
Show more