De-identification Methods for Open Health Data: The Case of the Heritage Health Prize Claims Dataset

ArticleinJournal of Medical Internet Research 14(1):e33 · February 2012with53 Reads
DOI: 10.2196/jmir.2001 · Source: PubMed
Abstract
There are many benefits to open datasets. However, privacy concerns have hampered the widespread creation of open health data. There is a dearth of documented methods and case studies for the creation of public-use health data. We describe a new methodology for creating a longitudinal public health dataset in the context of the Heritage Health Prize (HHP). The HHP is a global data mining competition to predict, by using claims data, the number of days patients will be hospitalized in a subsequent year. The winner will be the team or individual with the most accurate model past a threshold accuracy, and will receive a US $3 million cash prize. HHP began on April 4, 2011, and ends on April 3, 2013. To de-identify the claims data used in the HHP competition and ensure that it meets the requirements in the US Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule. We defined a threshold risk consistent with the HIPAA Privacy Rule Safe Harbor standard for disclosing the competition dataset. Three plausible re-identification attacks that can be executed on these data were identified. For each attack the re-identification probability was evaluated. If it was deemed too high then a new de-identification algorithm was applied to reduce the risk to an acceptable level. We performed an actual evaluation of re-identification risk using simulated attacks and matching experiments to confirm the results of the de-identification and to test sensitivity to assumptions. The main metric used to evaluate re-identification risk was the probability that a record in the HHP data can be re-identified given an attempted attack. An evaluation of the de-identified dataset estimated that the probability of re-identifying an individual was .0084, below the .05 probability threshold specified for the competition. The risk was robust to violations of our initial assumptions. It was possible to ensure that the probability of re-identification for a large longitudinal dataset was acceptably low when it was released for a global user community in support of an analytics competition. This is an example of, and methodology for, achieving open data principles for longitudinal health data.
    • "Where |L()| is the number of leaves in the subtree rooted at x.The information loss of each event e is then defined as (El Emam, K., 2012 "
    [Show abstract] [Hide abstract] ABSTRACT: Recent improvements in Information Technology have increased the demand for collecting and sharing of data. However, privacy concerns are not avoidable in such cases. Privacy research area that tries to generalize data before publishing in order to safeguard the sensitive information, while complete data is maintained on other hand for required purpose. A database consists of several attr attributes(S). Identifier attributes which exactly identify the records of owners and are typically removed from the released data. Quasi Identifier attributes which could be linked with individual's records of data owners. However sensitive attributes are protected. It is important to anonymize QI attributes, so that the individual's records can't be re privacy and data utility. Data Anonymization technique's objective is to modify data such that no sensitive information about individuals can be disclosed from published data while data distortion is minimized to ensure usefulness of data in practice. However, data anonymization affects the utility of data as generalization and grouping, which leads to information loss. Usually generalization and suppression on the values of QI is to modify the original data. This anonymization process causes info either replaced with some coarse grained values or totally removed. In this paper we analyze information loss of three different data sets with respect to different anonymization level and attempts to finds which level of anonymization gives optimal information loss, consequently the privacy and data utility will be high.
    Article · Jan 2016 · Scientific research and essays
    • "Where |L()| is the number of leaves in the subtree rooted at x.The information loss of each event e is then defined as (El Emam, K., 2012 "
    Article · Jan 2016 · Scientific research and essays
    • "De-identification refers to a set of methods that can be applied to data to ensure that the probability of assigning a correct identity to a record in the data is very low (El Emam and Fineberg, 2009; El Emam et al., 2011). Recent studies (Bayardo and Agrawal, 2005; Campan and Truta, 2009; El Emam et al., 2012; El Emam and Dankar, 2008; Goryczka et al., 2011; Jurczyk and Xiong, 2009; LeFevre et al., 2005; Parmar et al., 2011; Sacharidis et al., 2010; Sokolova et al., 2012; Sweeney, 2002a, b; Tassa and Gudes, 2012; Truta and Vinay, 2006) indicate that the K-anonymity model provides a formal way of generalizing this concept because K-anonymity provides a measure of privacy protection by preventing the re-identification of data to fewer than a group of K data items. As stated in Sweeney and Samarati (Samarati, 2001;Sweeney, 2002a, b), a data record is K anonymous if and only if it is indistinguishable from its identifying information from at least K-specific records or entities. "
    Article · Mar 2015
Show more