Automated de-identification of free-text medical records

Laboratory for Computational Physiology, Massachusetts Institute of Technology, Cambridge, MA 02139, USA.
BMC Medical Informatics and Decision Making (Impact Factor: 1.83). 07/2008; 8(1):32. DOI: 10.1186/1472-6947-8-32
Source: PubMed


Text-based patient medical records are a vital resource in medical research. In order to preserve patient confidentiality, however, the U.S. Health Insurance Portability and Accountability Act (HIPAA) requires that protected health information (PHI) be removed from medical records before they can be disseminated. Manual de-identification of large medical record databases is prohibitively expensive, time-consuming and prone to error, necessitating automatic methods for large-scale, automated de-identification.
We describe an automated Perl-based de-identification software package that is generally usable on most free-text medical records, e.g., nursing notes, discharge summaries, X-ray reports, etc. The software uses lexical look-up tables, regular expressions, and simple heuristics to locate both HIPAA PHI, and an extended PHI set that includes doctors' names and years of dates. To develop the de-identification approach, we assembled a gold standard corpus of re-identified nursing notes with real PHI replaced by realistic surrogate information. This corpus consists of 2,434 nursing notes containing 334,000 words and a total of 1,779 instances of PHI taken from 163 randomly selected patient records. This gold standard corpus was used to refine the algorithm and measure its sensitivity. To test the algorithm on data not used in its development, we constructed a second test corpus of 1,836 nursing notes containing 296,400 words. The algorithm's false negative rate was evaluated using this test corpus.
Performance evaluation of the de-identification software on the development corpus yielded an overall recall of 0.967, precision value of 0.749, and fallout value of approximately 0.002. On the test corpus, a total of 90 instances of false negatives were found, or 27 per 100,000 word count, with an estimated recall of 0.943. Only one full date and one age over 89 were missed. No patient names were missed in either corpus.
We have developed a pattern-matching de-identification system based on dictionary look-ups, regular expressions, and heuristics. Evaluation based on two different sets of nursing notes collected from a U.S. hospital suggests that, in terms of recall, the software out-performs a single human de-identifier (0.81) and performs at least as well as a consensus of two human de-identifiers (0.94). The system is currently tuned to de-identify PHI in nursing notes and discharge summaries but is sufficiently generalized and can be customized to handle text files of any format. Although the accuracy of the algorithm is high, it is probably insufficient to be used to publicly disseminate medical data. The open-source de-identification software and the gold standard re-identified corpus of medical records have therefore been made available to researchers via the PhysioNet website to encourage improvements in the algorithm.

Download full-text


Available from: Li-wei H Lehman, Sep 14, 2014
  • Source
    • "Rule-based methods need experts to construct rules or patterns manually to identify PHI entities. For example, Neamatullah et al. [4] proposed a pattern-matching de-identification system for nursing progress notes, which were obviously less structured than discharge summaries were, using dictionary look-ups, regular expressions and heuristics. In nursing progress notes, technical terminologies, non-standard abbreviations , ungrammatical statements, misspellings, incorrect punctuation characters, and capitalization errors occur frequently. "
    [Show abstract] [Hide abstract]
    ABSTRACT: De-identification is a shared task of the 2014 i2b2/UTHealth challenge. The purpose of this task is to remove protected health information (PHI) from medical records. In this paper, we propose a novel de-identifier, WI-deId, based on conditional random fields (CRFs). A preprocessing module, which tokenizes the medical records using regular expressions and an off-the-shelf tokenizer, is introduced, and three groups of features are extracted to train the de-identifier model. The experiment shows that our system is effective in the de-identification of medical records, achieving a micro-F1 of 0.9232 at the i2b2 strict entity evaluation level. Copyright © 2015 Elsevier Inc. All rights reserved.
    Journal of Biomedical Informatics 08/2015; DOI:10.1016/j.jbi.2015.08.012 · 2.19 Impact Factor
  • Source
    • "Firstly, in order to get ideas on how many PHIs were included in the whole EDT EMRs, we ran the state-of-the-art deidentification toolkit called Physionet [10]. The classified PHIs were reported and details are in Section 4.3. "
    [Show abstract] [Hide abstract]
    ABSTRACT: In clinical NLP, one major barrier to adopting crowdsourcing for NLP annotation is the issue of confidentiality for protected health information (PHI) in clinical narratives. In this paper, we investigated the use of a frequency-based approach to extract sentences without PHI. Our approach is based on the assumption that sentences appearing frequently tend to contain no PHI. Both manual and automatic evaluations on 500 sentences out of the 7.9 million sentences of frequencies higher than one show that no PHI can be found among them. The promising results provide potentials of releasing those sentences for obtaining sentence-level NLP annotations via crowdsourcing.
    Studies in health technology and informatics 08/2015; 216:1033-4.
  • Source
    • "However all these approaches assume that sensitive attributes or patterns are known and do not consider links between attributes. Other proposals have been devoted to the sanitization of free-text, mainly in the medical domain (Neamatullah et al., 2008). However the problem is different in free-text and consists basically in identifying sensitive words based on specialized domain semantics. "
    [Show abstract] [Hide abstract]
    ABSTRACT: We propose an innovative approach and its implementation as an expert system to achieve the semi-automatic detection of candidate attributes for scrambling sensitive data. Our approach is based on semantic rules that determine which concepts have to be scrambled, and on a linguistic component that retrieves the attributes that semantically correspond to these concepts. Because attributes cannot be considered independently from each other, we also address the challenging problem of the propagation of the scrambling process through the entire database. One main contribution of our approach is to provide a semi-automatic process for the detection of sensitive data. The underlying knowledge is made available through production rules, operationalizing the detection of the sensitive data. A validation of our approach using four different databases is provided.
    Information Resources Management Journal 10/2014; 27(4):23-44. DOI:10.4018/irmj.2014100102
Show more