Automating the Assignment of Diagnosis Codes to Patient Encounters Using Example-based and Machine Learning Techniques

Division of Biomedical Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA.
Journal of the American Medical Informatics Association (Impact Factor: 3.5). 09/2006; 13(5):516-25. DOI: 10.1197/jamia.M2077
Source: PubMed


Human classification of diagnoses is a labor intensive process that consumes significant resources. Most medical practices use specially trained medical coders to categorize diagnoses for billing and research purposes.
We have developed an automated coding system designed to assign codes to clinical diagnoses. The system uses the notion of certainty to recommend subsequent processing. Codes with the highest certainty are generated by matching the diagnostic text to frequent examples in a database of 22 million manually coded entries. These code assignments are not subject to subsequent manual review. Codes at a lower certainty level are assigned by matching to previously infrequently coded examples. The least certain codes are generated by a naïve Bayes classifier. The latter two types of codes are subsequently manually reviewed.
Standard information retrieval accuracy measurements of precision, recall and f-measure were used. Micro- and macro-averaged results were computed. RESULTS At least 48% of all EMR problem list entries at the Mayo Clinic can be automatically classified with macro-averaged 98.0% precision, 98.3% recall and an f-score of 98.2%. An additional 34% of the entries are classified with macro-averaged 90.1% precision, 95.6% recall and 93.1% f-score. The remaining 18% of the entries are classified with macro-averaged 58.5%.
Over two thirds of all diagnoses are coded automatically with high accuracy. The system has been successfully implemented at the Mayo Clinic, which resulted in a reduction of staff engaged in manual coding from thirty-four coders to seven verifiers.

Download full-text


Available from: Christopher G Chute, Jul 14, 2014
    • "This approach is feasible for small code set but is questionable in reallife settings where thousands of codes need to be considered. Similar to our scheme, Pakhomov et al. [19] is the first work that attempts to improve the coding performance by combing the advantages of rule-based and machine learning approaches. It describes Autocoder, an automatic encoding system implemented at Mayo clinic. "
    [Show abstract] [Hide abstract]
    ABSTRACT: The vocabulary gap between health seekers and providers has hindered the cross-system operability and the inter-user reusability. To bridge this gap, this paper presents a novel scheme to code the medical records by jointly utilizing local mining and global learning approaches, which are tightly linked and mutually reinforced. Local mining attempts to code the individual medical record by independently extracting the medical concepts from the medical record itself and then mapping them to authenticated terminologies. A corpus-aware terminology vocabulary is naturally constructed as a byproduct, which is used as the terminology space for global learning. Local mining approach, however, may suffer from information loss and lower precision, which are caused by the absence of key medical concepts and the presence of irrelevant medical concepts. Global learning, on the other hand, works towards enhancing the local medical coding via collaboratively discovering missing key terminologies and keeping off the irrelevant terminologies by analyzing the social neighbors. Comprehensive experiments well validate the proposed scheme and each of its component. Practically, this unsupervised scheme holds potential to large-scale data.
    IEEE Transactions on Knowledge and Data Engineering 02/2015; 27(2):396-409. DOI:10.1109/TKDE.2014.2330813 · 2.07 Impact Factor
  • Source
    • "Clinical documentations in computer-based records are found to be more complete and appropriate for clinical decisions than those in paper-based records [18]. Likewise, automated coding and classification encompasses a variety of computerbased approaches, that are faster, reduce error rates, and are more efficient and accurate [4] [19] [20] [21]. Similarly, improvement in clinical documentation will be necessary to ensure complete automated coding [22] "
    [Show abstract] [Hide abstract]
    ABSTRACT: Background: Clinical coding is an integral part of health information management (HIM) practice which provides valuable data for healthcare quality evaluation, health resource allocation, health services research, medical billing, public health programming, Case-Mix/DRG funding. The International Statistical Classification of Diseases and Related Health Problems, Tenth Revision (ICD-10) is a veritable tool for the effectiveness of clinical coding practices. Objective: This present study determined implementation levels of ICD-10 as well as ICD-10-PCS and clinical coding practices in both public and forprofit hospitals in Nigeria. Methods: We used Chi square (χ2) and Cramer’s V (φc) to assess the level of association between type of workplace and implementations of ICD-10 and clinical coding practices. Statistical significance was set at .05. Result: The study discovered nationwide implementation of ICD-10 (179, 88.2%) and fair adoption of its procedure counterpart (79, 38.9%). Most hospitals in Nigeria especially, for-profit facilities (3, 100%) and tertiary healthcare settings (148, 93.1%) employed HIM professionals (214, 91.5%) to manage their clinical coding processes. Conversely, the study observed that challenges confronting clinical coding processes were enormous. Notable among these were absence of automation (70, 34.5%), lack of political will (51, 48.1%), inadequate clinical coders (153, 74.4%) and suboptimal documentation (186, 91.6). Suggestions to improve clinical coding practices ranges from continuing professional coding education (33, 10.3%) to initiation of Nigerian’s modification of ICD such that ICD-10 will become ICD-10-NGM (1, 0.3%). Conclusion: Most healthcare systems in Nigeria have implemented ICD-10 for coding and classification of diagnoses and procedures and the process is being managed by the right workforce (i.e. HIM professionals) which reassures effectiveness. However, lack of political will, inadequate and unmotivated workforce and suboptimal clinical documentation were among challenges confronting the practice in Nigeria. Therefore, this study suggests advocacy and coding education with a view to modifying the orientation of all stakeholders and to sensitize relevant authorities on the benefits of clinical coding practices in order to maximize its outcome and in effect, improve public health in the country. Keywords: Automated Coding, Clinical Coding, Clinical Documentation, Data Quality, Discharge Summary, Health Information Technology, Health Information Management Professionals, ICD-10
  • Source
    • "One pertinent example is the automatic categorization of informally written medical diagnoses, followed by the extraction of epidemiological information or even terms and structures needed to formulate guiding questions as a heuristic tool for helping doctors. Vector space models including LSA have been successfully used to this end (Lee et al., 2006; Pakhomov et al., 2006). Nonetheless, results from this type of models are at the mercy of the vectorial dynamics involved and the representational bias of some terms. "
    [Show abstract] [Hide abstract]
    ABSTRACT: There is currently a widespread interest in indexing and extracting taxonomic information from large text collections. An example is the automatic categorization of informally written medical or psychological diagnoses, followed by the extraction of epidemiological information or even terms and structures needed to formulate guiding questions as an heuristic tool for helping doctors. Vector space models have been successfully used to this end (Lee, Cimino, Zhu, Sable, Shanker, Ely & Yu, 2006; Pakhomov, Buntrock & Chute, 2006). In this study we use a computational model known as Latent Semantic Analysis (LSA) on a diagnostic corpus with the aim of retrieving definitions (in the form of lists of semantic neighbors) of common structures it contains (e.g. "storm phobia", "dog phobia") or less common structures that might be formed by logical combinations of categories and diagnostic symptoms (e.g. "gun personality" or "germ personality"). In the quest to bring definitions into line with the meaning of structures and make them in some way representative, various problems commonly arise while recovering content using vector space models. We propose some approaches which bypass these problems, such as Kintsch's (2001) predication algorithm and some corrections to the way lists of neighbors are obtained, which have already been tested on semantic spaces in a non-specific domain (Jorge-Botana, León, Olmos & Hassan-Montero, under review). The results support the idea that the predication algorithm may also be useful for extracting more precise meanings of certain structures from scientific corpora, and that the introduction of some corrections based on vector length may increases its efficiency on non-representative terms.
    The Spanish Journal of Psychology 11/2009; 12(2):424-40. DOI:10.1017/S1138741600001815 · 0.74 Impact Factor
Show more