De-identification Methods for Open Health Data: The Case of the Heritage Health Prize Claims Dataset

Electronic Health Information Laboratory, CHEO Research Institute Inc., 401 Smyth Road, Ottawa, ON, Canada.
Journal of Medical Internet Research (Impact Factor: 4.67). 02/2012; 14(1):e33. DOI: 10.2196/jmir.2001
Source: PubMed

ABSTRACT There are many benefits to open datasets. However, privacy concerns have hampered the widespread creation of open health data. There is a dearth of documented methods and case studies for the creation of public-use health data. We describe a new methodology for creating a longitudinal public health dataset in the context of the Heritage Health Prize (HHP). The HHP is a global data mining competition to predict, by using claims data, the number of days patients will be hospitalized in a subsequent year. The winner will be the team or individual with the most accurate model past a threshold accuracy, and will receive a US $3 million cash prize. HHP began on April 4, 2011, and ends on April 3, 2013.
To de-identify the claims data used in the HHP competition and ensure that it meets the requirements in the US Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule.
We defined a threshold risk consistent with the HIPAA Privacy Rule Safe Harbor standard for disclosing the competition dataset. Three plausible re-identification attacks that can be executed on these data were identified. For each attack the re-identification probability was evaluated. If it was deemed too high then a new de-identification algorithm was applied to reduce the risk to an acceptable level. We performed an actual evaluation of re-identification risk using simulated attacks and matching experiments to confirm the results of the de-identification and to test sensitivity to assumptions. The main metric used to evaluate re-identification risk was the probability that a record in the HHP data can be re-identified given an attempted attack.
An evaluation of the de-identified dataset estimated that the probability of re-identifying an individual was .0084, below the .05 probability threshold specified for the competition. The risk was robust to violations of our initial assumptions.
It was possible to ensure that the probability of re-identification for a large longitudinal dataset was acceptably low when it was released for a global user community in support of an analytics competition. This is an example of, and methodology for, achieving open data principles for longitudinal health data.

1 Follower
  • Source
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents the architecture of a health data search engine, along with preliminary findings that demonstrate the feasibility of the approach taken. The work is motivated by the need to incorporate information about similar patients into clinical decision making, and by the need to develop a tool that can search for similar patients in health data repositories. Central to the design of the search engine is the use of clustering analysis within health data repositories to ensure that responses to queries consist of data summaries that do not violate the confidentiality of patient records. Recent results concerning the feasibility of this search engine approach are reviewed. These results speak to the relative ease of creating clinically meaningful summaries of patient types, and to the accuracy of predictions made using the summarized data. The paper concludes with a brief discussion of further work required to implement a health data search engine and to demonstrate its effectiveness.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: De-identification of personal health information is essential in order not to require written patient informed consent. Previous de-identification methods were proposed using natural language processing technology in order to remove the identifiers in clinical narrative text, although these methods only focused on narrative text written in English. In this study, we propose a regular expression-based de-identification method used to address bilingual clinical records written in Korean and English. To develop and validate regular expression rules, we obtained training and validation datasets composed of 6,039 clinical notes of 20 types and 5,000 notes of 33 types, respectively. Fifteen regular expression rules were constructed using the development dataset and those rules achieved 99.87% precision and 96.25% recall for the validation dataset. Our de-identification method successfully removed the identifiers in diverse types of bilingual clinical narrative texts. This method will thus assist physicians to more easily perform retrospective research.
    Journal of Korean Medical Science 01/2015; 30(1):7-15. DOI:10.3346/jkms.2015.30.1.7 · 1.25 Impact Factor