-
[show abstract]
[hide abstract]
ABSTRACT: OBJECTIVE: To employ machine learning methods to predict the eventual therapeutic response of breast cancer patients after a single cycle of neoadjuvant chemotherapy (NAC). MATERIALS AND METHODS: Quantitative dynamic contrast-enhanced MRI and diffusion-weighted MRI data were acquired on 28 patients before and after one cycle of NAC. A total of 118 semiquantitative and quantitative parameters were derived from these data and combined with 11 clinical variables. We used Bayesian logistic regression in combination with feature selection using a machine learning framework for predictive model building. RESULTS: The best predictive models using feature selection obtained an area under the curve of 0.86 and an accuracy of 0.86, with a sensitivity of 0.88 and a specificity of 0.82. DISCUSSION: With the numerous options for NAC available, development of a method to predict response early in the course of therapy is needed. Unfortunately, by the time most patients are found not to be responding, their disease may no longer be surgically resectable, and this situation could be avoided by the development of techniques to assess response earlier in the treatment regimen. The method outlined here is one possible solution to this important clinical problem. CONCLUSIONS: Predictive modeling approaches based on machine learning using readily available clinical and quantitative MRI data show promise in distinguishing breast cancer responders from non-responders after the first cycle of NAC.
Journal of the American Medical Informatics Association 04/2013; · 3.61 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: OBJECTIVE: To create a computable MEDication Indication resource (MEDI) to support primary and secondary use of electronic medical records (EMRs). MATERIALS AND METHODS: We processed four public medication resources, RxNorm, Side Effect Resource (SIDER) 2, MedlinePlus, and Wikipedia, to create MEDI. We applied natural language processing and ontology relationships to extract indications for prescribable, single-ingredient medication concepts and all ingredient concepts as defined by RxNorm. Indications were coded as Unified Medical Language System (UMLS) concepts and International Classification of Diseases, 9th edition (ICD9) codes. A total of 689 extracted indications were randomly selected for manual review for accuracy using dual-physician review. We identified a subset of medication-indication pairs that optimizes recall while maintaining high precision. RESULTS: MEDI contains 3112 medications and 63 343 medication-indication pairs. Wikipedia was the largest resource, with 2608 medications and 34 911 pairs. For each resource, estimated precision and recall, respectively, were 94% and 20% for RxNorm, 75% and 33% for MedlinePlus, 67% and 31% for SIDER 2, and 56% and 51% for Wikipedia. The MEDI high-precision subset (MEDI-HPS) includes indications found within either RxNorm or at least two of the three other resources. MEDI-HPS contains 13 304 unique indication pairs regarding 2136 medications. The mean±SD number of indications for each medication in MEDI-HPS is 6.22±6.09. The estimated precision of MEDI-HPS is 92%. CONCLUSIONS: MEDI is a publicly available, computable resource that links medications with their indications as represented by concepts and billing codes. MEDI may benefit clinical EMR applications and reuse of EMR data for research.
Journal of the American Medical Informatics Association 04/2013; · 3.61 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: OBJECTIVE: To develop a comprehensive temporal information extraction system that can identify events, temporal expressions, and their temporal relations in clinical text. This project was part of the 2012 i2b2 clinical natural language processing (NLP) challenge on temporal information extraction. MATERIALS AND METHODS: The 2012 i2b2 NLP challenge organizers manually annotated 310 clinic notes according to a defined annotation guideline: a training set of 190 notes and a test set of 120 notes. All participating systems were developed on the training set and evaluated on the test set. Our system consists of three modules: event extraction, temporal expression extraction, and temporal relation (also called Temporal Link, or 'TLink') extraction. The TLink extraction module contains three individual classifiers for TLinks: (1) between events and section times, (2) within a sentence, and (3) across different sentences. The performance of our system was evaluated using scripts provided by the i2b2 organizers. Primary measures were micro-averaged Precision, Recall, and F-measure. RESULTS: Our system was among the top ranked. It achieved F-measures of 0.8659 for temporal expression extraction (ranked fourth), 0.6278 for end-to-end TLink track (ranked first), and 0.6932 for TLink-only track (ranked first) in the challenge. We subsequently investigated different strategies for TLink extraction, and were able to marginally improve performance with an F-measure of 0.6943 for TLink-only track.
Journal of the American Medical Informatics Association 04/2013; · 3.61 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: OBJECTIVE: To evaluate the validity of, characterize the usage of, and propose potential research applications for International Classification of Diseases, Ninth Revision (ICD-9) tobacco codes in clinical populations. MATERIALS AND METHODS: Using data on cancer cases and cancer-free controls from Vanderbilt's biorepository, BioVU, we evaluated the utility of ICD-9 tobacco use codes to identify ever-smokers in general and high smoking prevalence (lung cancer) clinic populations. We assessed potential biases in documentation, and performed temporal analysis relating transitions between smoking codes to smoking cessation attempts. We also examined the suitability of these codes for use in genetic association analyses. RESULTS: ICD-9 tobacco use codes can identify smokers in a general clinic population (specificity of 1, sensitivity of 0.32), and there is little evidence of documentation bias. Frequency of code transitions between 'current' and 'former' tobacco use was significantly correlated with initial success at smoking cessation (p<0.0001). Finally, code-based smoking status assignment is a comparable covariate to text-based smoking status for genetic association studies. DISCUSSION: Our results support the use of ICD-9 tobacco use codes for identifying smokers in a clinical population. Furthermore, with some limitations, these codes are suitable for adjustment of smoking status in genetic studies utilizing electronic health records. CONCLUSIONS: Researchers should not be deterred by the unavailability of full-text records to determine smoking status if they have ICD-9 code histories.
Journal of the American Medical Informatics Association 02/2013; · 3.61 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: OBJECTIVES: This study was to assess whether active learning strategies can be integrated with supervised word sense disambiguation (WSD) methods, thus reducing the number of annotated samples, while keeping or improving the quality of disambiguation models. METHODS: We developed support vector machine (SVM) classifiers to disambiguate 197 ambiguous terms and abbreviations in the MSH WSD collection. Three different uncertainty sampling-based active learning algorithms were implemented with the SVM classifiers and were compared with a passive learner (PL) based on random sampling. For each ambiguous term and each learning algorithm, a learning curve that plots the accuracy computed from the test set as a function of the number of annotated samples used in the model was generated. The area under the learning curve (ALC) was used as the primary metric for evaluation. RESULTS: Our experiments demonstrated that active learners (ALs) significantly outperformed the PL, showing better performance for 177 out of 197 (89.8%) WSD tasks. Further analysis showed that to achieve an average accuracy of 90%, the PL needed 38 annotated samples, while the ALs needed only 24, a 37% reduction in annotation effort. Moreover, we analyzed cases where active learning algorithms did not achieve superior performance and identified three causes: (1) poor models in the early learning stage; (2) easy WSD cases; and (3) difficult WSD cases, which provide useful insight for future improvements. CONCLUSIONS: This study demonstrated that integrating active learning strategies with supervised WSD methods could effectively reduce annotation cost and improve the disambiguation models.
Journal of the American Medical Informatics Association 01/2013; · 3.61 Impact Factor
-
International Journal of Computational Biology and Drug Design 01/2013; 6(1-2):1-4.
-
[show abstract]
[hide abstract]
ABSTRACT: OBJECTIVE: Medication safety requires that each drug be monitored throughout its market life as early detection of adverse drug reactions (ADRs) can lead to alerts that prevent patient harm. Recently, electronic medical records (EMRs) have emerged as a valuable resource for pharmacovigilance. This study examines the use of retrospective medication orders and inpatient laboratory results documented in the EMR to identify ADRs. METHODS: Using 12 years of EMR data from Vanderbilt University Medical Center (VUMC), we designed a study to correlate abnormal laboratory results with specific drug administrations by comparing the outcomes of a drug-exposed group and a matched unexposed group. We assessed the relative merits of six pharmacovigilance measures used in spontaneous reporting systems (SRSs): proportional reporting ratio (PRR), reporting OR (ROR), Yule's Q (YULE), the χ(2) test (CHI), Bayesian confidence propagation neural networks (BCPNN), and a gamma Poisson shrinker (GPS). RESULTS: We systematically evaluated the methods on two independently constructed reference standard datasets of drug-event pairs. The dataset of Yoon et al contained 470 drug-event pairs (10 drugs and 47 laboratory abnormalities). Using VUMC's EMR, we created another dataset of 378 drug-event pairs (nine drugs and 42 laboratory abnormalities). Evaluation on our reference standard showed that CHI, ROR, PRR, and YULE all had the same F score (62%). When the reference standard of Yoon et al was used, ROR had the best F score of 68%, with 77% precision and 61% recall. CONCLUSIONS: Results suggest that EMR-derived laboratory measurements and medication orders can help to validate previously reported ADRs, and detect new ADRs.
Journal of the American Medical Informatics Association 11/2012; · 3.61 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: Objective Adverse drug reaction (ADR) is one of the major causes of failure in drug development. Severe ADRs that go undetected until the post-marketing phase of a drug often lead to patient morbidity. Accurate prediction of potential ADRs is required in the entire life cycle of a drug, including early stages of drug design, different phases of clinical trials, and post-marketing surveillance. Methods Many studies have utilized either chemical structures or molecular pathways of the drugs to predict ADRs. Here, the authors propose a machine-learning-based approach for ADR prediction by integrating the phenotypic characteristics of a drug, including indications and other known ADRs, with the drug's chemical structures and biological properties, including protein targets and pathway information. A large-scale study was conducted to predict 1385 known ADRs of 832 approved drugs, and five machine-learning algorithms for this task were compared. Results This evaluation, based on a fivefold cross-validation, showed that the support vector machine algorithm outperformed the others. Of the three types of information, phenotypic data were the most informative for ADR prediction. When biological and phenotypic features were added to the baseline chemical information, the ADR prediction model achieved significant improvements in area under the curve (from 0.9054 to 0.9524), precision (from 43.37% to 66.17%), and recall (from 49.25% to 63.06%). Most importantly, the proposed model successfully predicted the ADRs associated with withdrawal of rofecoxib and cerivastatin. Conclusion The results suggest that phenotypic information on drugs is valuable for ADR prediction. Moreover, they demonstrate that different models that combine chemical, biological, or phenotypic information can be built from approved drugs, and they have the potential to detect clinically important ADRs in both preclinical and post-marketing phases.
Journal of the American Medical Informatics Association 06/2012; 19(e1):e28-e35. · 3.61 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: BACKGROUND: Extraction of clinical information such as medications or problems from clinical text is an important task of clinical natural language processing (NLP). Rule-based methods are often used in clinical NLP systems because they are easy to adapt and customize. Recently, supervised machine learning methods have proven to be effective in clinical NLP as well. However, combining different classifiers to further improve the performance of clinical entity recognition systems has not been investigated extensively. Combining classifiers into an ensemble classifier presents both challenges and opportunities to improve performance in such NLP tasks. METHODS: We investigated ensemble classifiers that used different voting strategies to combine outputs from three individual classifiers: a rule-based system, a support vector machine (SVM) based system, and a conditional random field (CRF) based system. Three voting methods were proposed and evaluated using the annotated data sets from the 2009 i2b2 NLP challenge: simple majority, local SVM-based voting, and local CRF-based voting. RESULTS: Evaluation on 268 manually annotated discharge summaries from the i2b2 challenge showed that the local CRF-based voting method achieved the best F-score of 90.84% (94.11% Precision, 87.81% Recall) for 10-fold cross-validation. We then compared our systems with the first-ranked system in the challenge by using the same training and test sets. Our system based on majority voting achieved a better F-score of 89.65% (93.91% Precision, 85.76% Recall) than the previously reported F-score of 89.19% (93.78% Precision, 85.03% Recall) by the first-ranked system in the challenge. CONCLUSIONS: Our experimental results using the 2009 i2b2 challenge datasets showed that ensemble classifiers that combine individual classifiers into a voting system could achieve better performance than a single classifier in recognizing medication information from clinical text. It suggests that simple strategies that can be easily implemented such as majority voting could have the potential to significantly improve clinical entity recognition.
BMC Medical Informatics and Decision Making 05/2012; 12(1):36. · 1.48 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: Antipsychotic drugs are tranquilizing psychiatric medications primarily used in the treatment of schizophrenia and similar severe mental disorders. So far, most of these drugs have been discovered without knowing much on the molecular mechanisms of their actions. The available large amount of pharmacogenetics, pharmacometabolomics, and pharmacoproteomics data for many drugs makes it possible to systematically explore the molecular mechanisms underlying drug actions. In this study, we applied a unique network-based approach to investigate antipsychotic drugs and their targets. We first retrieved 43 antipsychotic drugs, 42 unique target genes, and 46 adverse drug interactions from the DrugBank database and then generated a drug-gene network and a drug-drug interaction network. Through drug-gene network analysis, we found that seven atypical antipsychotic drugs tended to form two clusters that could be defined by drugs with different target receptor profiles. In the drug-drug interaction network, we found that three drugs (zuclopenthixol, ziprasidone, and thiothixene) tended to have more adverse drug interactions than others, while clozapine had fewer adverse drug interactions. This investigation indicated that these antipsychotics might have different molecular mechanisms underlying the drug actions. This pilot network-assisted investigation of antipsychotics demonstrates that network-based analysis is useful for uncovering the molecular actions of antipsychotics.
Chemistry & Biodiversity 05/2012; 9(5):900-10. · 1.80 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: Much epidemiologic information resides in literature, which is not in a computable format. To extract information and build knowledge bases of epidemiologic studies, we developed a system to extract noun phrases about epidemiologic exposures and outcomes. The system consists of two components: a natural language processing (NLP) engine; a machine learning (ML) based classifier. Four ML algorithms were applied and compared over different feature sets. To evaluate the performance of the system, we manually constructed an annotated dataset. The system achieved the highest F-measure of 82.0% for extracting exposure terms, and 70% for extracting outcome terms.
International Journal of Data Mining and Bioinformatics 01/2012; 6(4):447-59. · 0.43 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: Semantic lexicons that link words and phrases to specific semantic types such as diseases are valuable assets for clinical natural language processing (NLP) systems. Although terminological terms with predefined semantic types can be generated easily from existing knowledge bases such as the Unified Medical Language Systems (UMLS), they are often limited and do not have good coverage for narrative clinical text. In this study, we developed a method for building semantic lexicons from clinical corpus. It extracts candidate semantic terms using a conditional random field (CRF) classifier and then selects terms using the C-Value algorithm. We applied the method to a corpus containing 10 years of discharge summaries from Vanderbilt University Hospital (VUH) and extracted 44,957 new terms for three semantic groups: Problem, Treatment, and Test. A manual analysis of 200 randomly selected terms not found in the UMLS demonstrated that 59% of them were meaningful new clinical concepts and 25% were lexical variants of exiting concepts in the UMLS. Furthermore, we compared the effectiveness of corpus-derived and UMLS-derived semantic lexicons in the concept extraction task of the 2010 i2b2 clinical NLP challenge. Our results showed that the classifier with corpus-derived semantic lexicons as features achieved a better performance (F-score 82.52%) than that with UMLS-derived semantic lexicons as features (F-score 82.04%). We conclude that such corpus-based methods are effective for generating semantic lexicons, which may improve named entity recognition tasks and may aid in augmenting synonymy within existing terminologies.
AMIA ... Annual Symposium proceedings / AMIA Symposium. AMIA Symposium 01/2012; 2012:409-16.
-
[show abstract]
[hide abstract]
ABSTRACT: Clinical Natural Language Processing (NLP) systems extract clinical information from narrative clinical texts in many settings. Previous research mentions the challenges of handling abbreviations in clinical texts, but provides little insight into how well current NLP systems correctly recognize and interpret abbreviations. In this paper, we compared performance of three existing clinical NLP systems in handling abbreviations: MetaMap, MedLEE, and cTAKES. The evaluation used an expert-annotated gold standard set of clinical documents (derived from from 32 de-identified patient discharge summaries) containing 1,112 abbreviations. The existing NLP systems achieved suboptimal performance in abbreviation identification, with F-scores ranging from 0.165 to 0.601. MedLEE achieved the best F-score of 0.601 for all abbreviations and 0.705 for clinically relevant abbreviations. This study suggested that accurate identification of clinical abbreviations is a challenging task and that more advanced abbreviation recognition modules might improve existing clinical NLP systems.
AMIA ... Annual Symposium proceedings / AMIA Symposium. AMIA Symposium 01/2012; 2012:997-1003.
-
Mei Liu,
Anushi Shah,
Min Jiang,
Neeraja B Peterson,
Qi Dai,
Melinda C Aldrich,
Qingxia Chen,
Erica A Bowton,
Hongfang Liu,
Joshua C Denny, Hua Xu
[show abstract]
[hide abstract]
ABSTRACT: Electronic Medical Records (EMRs) are valuable resources for clinical observational studies. Smoking status of a patient is one of the key factors for many diseases, but it is often embedded in narrative text. Natural language processing (NLP) systems have been developed for this specific task, such as the smoking status detection module in the clinical Text Analysis and Knowledge Extraction System (cTAKES). This study examined transportability of the smoking module in cTAKES on the Vanderbilt University Hospital's EMR data. Our evaluation demonstrated that modest effort of change is necessary to achieve desirable performance. We modified the system by filtering notes, annotating new data for training the machine learning classifier, and adding rules to the rule-based classifiers. Our results showed that the customized module achieved significantly higher F-measures at all levels of classification (i.e., sentence, document, patient) compared to the direct application of the cTAKES module to the Vanderbilt data.
AMIA ... Annual Symposium proceedings / AMIA Symposium. AMIA Symposium 01/2012; 2012:577-86.
-
[show abstract]
[hide abstract]
ABSTRACT: Drug responses vary greatly among individuals due to human genetic variations, which is known as pharmacogenomics (PGx). Much of the PGx knowledge has been embedded in biomedical literature and there is a growing interest to develop text mining approaches to extract such knowledge. In this paper, we present a study to rank candidate gene-drug relations using Latent Dirichlet Allocation (LDA) model. Our approach consists of three steps: 1) recognize gene and drug entities in MEDLINE abstracts; 2) extract candidate gene-drug pairs based on different levels of co-occurrence, including abstract level, sentence level, and phrase level; and 3) rank candidate gene-drug pairs using multiple different methods including term frequency, Chi-square test, Mutual Information (MI), a reported Kullback-Leibler (KL) distance based on topics derived from LDA (LDA-KL), and a newly defined probabilistic KL distance based on LDA (LDA-PKL). We systematically evaluated these methods by using a gold standard data set of gene-drug relations derived from PharmGKB. Our results showed that the proposed LDA-PKL method achieved better Mean Average Precision (MAP) than any other methods, suggesting its promising uses for ranking and detecting PGx relations.
Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing 01/2012;
-
[show abstract]
[hide abstract]
ABSTRACT: Understanding drug bioactivities is crucial for early-stage drug discovery, toxicology studies and clinical trials. Network pharmacology is a promising approach to better understand the molecular mechanisms of drug bioactivities. With a dramatic increase of rich data sources that document drugs' structural, chemical, and biological activities, it is necessary to develop an automated tool to construct a drug-target network for candidate drugs, thus facilitating the drug discovery process.
We designed a computational workflow to construct drug-target networks from different knowledge bases including DrugBank, PharmGKB, and the PINA database. To automatically implement the workflow, we created a web-based tool called DTome (Drug-Target interactome tool), which is comprised of a database schema and a user-friendly web interface. The DTome tool utilizes web-based queries to search candidate drugs and then construct a DTome network by extracting and integrating four types of interactions. The four types are adverse drug interactions, drug-target interactions, drug-gene associations, and target-/gene-protein interactions. Additionally, we provided a detailed network analysis and visualization process to illustrate how to analyze and interpret the DTome network. The DTome tool is publicly available at http://bioinfo.mc.vanderbilt.edu/DTome.
As demonstrated with the antipsychotic drug clozapine, the DTome tool was effective and promising for the investigation of relationships among drugs, adverse interaction drugs, drug primary targets, drug-associated genes, and proteins directly interacting with targets or genes. The resultant DTome network provides researchers with direct insights into their interest drug(s), such as the molecular mechanisms of drug actions. We believe such a tool can facilitate identification of drug targets and drug adverse interactions.
BMC Bioinformatics 01/2012; 13 Suppl 9:S7. · 2.75 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: Supervised machine learning methods for clinical natural language processing (NLP) research require a large number of annotated samples, which are very expensive to build because of the involvement of physicians. Active learning, an approach that actively samples from a large pool, provides an alternative solution. Its major goal in classification is to reduce the annotation effort while maintaining the quality of the predictive model. However, few studies have investigated its uses in clinical NLP. This paper reports an application of active learning to a clinical text classification task: to determine the assertion status of clinical concepts. The annotated corpus for the assertion classification task in the 2010 i2b2/VA Clinical NLP Challenge was used in this study. We implemented several existing and newly developed active learning algorithms and assessed their uses. The outcome is reported in the global ALC score, based on the Area under the average Learning Curve of the AUC (Area Under the Curve) score. Results showed that when the same number of annotated samples was used, active learning strategies could generate better classification models (best ALC-0.7715) than the passive learning method (random sampling) (ALC-0.7411). Moreover, to achieve the same classification performance, active learning strategies required fewer samples than the random sampling method. For example, to achieve an AUC of 0.79, the random sampling method used 32 samples, while our best active learning algorithm required only 12 samples, a reduction of 62.5% in manual annotation effort.
Journal of Biomedical Informatics 11/2011; 45(2):265-72. · 1.79 Impact Factor
-
Kelly A Birdwell,
Ben Grady,
Leena Choi, Hua Xu,
Aihua Bian,
Josh C Denny,
Min Jiang,
Gayle Vranic,
Melissa Basford,
James D Cowan,
Danielle M Richardson,
Melanie P Robinson,
Talat Alp Ikizler,
Marylyn D Ritchie,
Charles Michael Stein,
David W Haas
[show abstract]
[hide abstract]
ABSTRACT: Tacrolimus, an immunosuppressive drug widely prescribed in kidney transplantation, requires therapeutic drug monitoring due to its marked interindividual pharmacokinetic variability and narrow therapeutic index. Previous studies have established that CYP3A5 rs776746 is associated with tacrolimus clearance, blood concentration, and dose requirement. The importance of other drug absorption, distribution, metabolism, and elimination (ADME) gene variants has not been well characterized.
We used novel DNA biobank and electronic medical record resources to identify ADME variants associated with tacrolimus dose requirement. Broad ADME genotyping was performed on 446 kidney transplant recipients, who had been dosed to a steady state with tacrolimus. The cohort was obtained from Vanderbilt's DNA biobank, BioVU, which contains linked deidentified electronic medical record data. Genotyping included Affymetrix drug-metabolizing enzymes and transporters Plus (1936 polymorphisms), custom Sequenom Massarray iPLEX Gold assay (95 polymorphisms), and ancestry-informative markers. The primary outcome was tacrolimus dose requirement defined as blood concentration to dose ratio.
In analyses, which adjusted for race and other clinical factors, we replicated the association of tacrolimus blood concentration to dose ratio with CYP3A5 rs776746 (P=7.15×10), and identified associations with nine variants in linkage disequilibrium with rs776746, including eight CYP3A4 variants. No NR1I2 variants were significantly associated. Age, weight, and hemoglobin were also significantly associated with the outcome. In final models, rs776746 explained 39% of variability in dose requirement and 46% was explained by the model containing clinical covariates.
This study highlights the utility of DNA biobanks and electronic medical records for tacrolimus pharmacogenomic research.
Pharmacogenetics and Genomics 11/2011; 22(1):32-42. · 3.48 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: Semantic-based sublanguage grammars have been shown to be an efficient method for medical language processing. However, given the complexity of the medical domain, parsers using such grammars inevitably encounter ambiguous sentences, which could be interpreted by different groups of production rules and consequently result in two or more parse trees. One possible solution, which has not been extensively explored previously, is to augment productions in medical sublanguage grammars with probabilities to resolve the ambiguity. In this study, we associated probabilities with production rules in a semantic-based grammar for medication findings and evaluated its performance on reducing parsing ambiguity. Using the existing data set from 2009 i2b2 NLP (Natural Language Processing) challenge for medication extraction, we developed a semantic-based CFG (Context Free Grammar) for parsing medication sentences and manually created a Treebank of 4564 medication sentences from discharge summaries. Using the Treebank, we derived a semantic-based PCFG (Probabilistic Context Free Grammar) for parsing medication sentences. Our evaluation using a 10-fold cross validation showed that the PCFG parser dramatically improved parsing performance when compared to the CFG parser.
Journal of Biomedical Informatics 08/2011; 44(6):1068-75. · 1.79 Impact Factor
-
Hua Xu,
Min Jiang,
Matt Oetjens,
Erica A Bowton,
Andrea H Ramirez,
Janina M Jeff,
Melissa A Basford,
Jill M Pulley,
James D Cowan,
Xiaoming Wang,
Marylyn D Ritchie,
Daniel R Masys,
Dan M Roden,
Dana C Crawford,
Joshua C Denny
[show abstract]
[hide abstract]
ABSTRACT: DNA biobanks linked to comprehensive electronic health records systems are potentially powerful resources for pharmacogenetic studies. This study sought to develop natural-language-processing algorithms to extract drug-dose information from clinical text, and to assess the capabilities of such tools to automate the data-extraction process for pharmacogenetic studies.
A manually validated warfarin pharmacogenetic study identified a cohort of 1125 patients with a stable warfarin dose, in which 776 patients were managed by Coumadin Clinic physicians, and the remaining 349 patients were managed by their providers. The authors developed two algorithms to extract weekly warfarin doses from both data sets: a regular expression-based program for semistructured Coumadin Clinic notes; and an advanced weekly dose calculator based on an existing medication information extraction system (MedEx) for narrative providers' notes. The authors then conducted an association analysis between an automatically extracted stable weekly dose of warfarin and four genetic variants of VKORC1 and CYP2C9 genes. The performance of the weekly dose-extraction program was evaluated by comparing it with a gold standard containing manually curated weekly doses. Precision, recall, F-measure, and overall accuracy were reported. Associations between known variants in VKORC1 and CYP2C9 and warfarin stable weekly dose were performed with linear regression adjusted for age, gender, and body mass index.
The authors' evaluation showed that the MedEx-based system could determine patients' warfarin weekly doses with 99.7% recall, 90.8% precision, and 93.8% accuracy. Using the automatically extracted weekly doses of warfarin, the authors successfully replicated the previous known associations between warfarin stable dose and genetic variants in VKORC1 and CYP2C9.
Journal of the American Medical Informatics Association 07/2011; 18(4):387-91. · 3.61 Impact Factor