Active learning for clinical text classification: is it better than random sampling?
ABSTRACT This study explores active learning algorithms as a way to reduce the requirements for large training sets in medical text classification tasks.
Three existing active learning algorithms (distance-based (DIST), diversity-based (DIV), and a combination of both (CMB)) were used to classify text from five datasets. The performance of these algorithms was compared to that of passive learning on the five datasets. We then conducted a novel investigation of the interaction between dataset characteristics and the performance results.
Classification accuracy and area under receiver operating characteristics (ROC) curves for each algorithm at different sample sizes were generated. The performance of active learning algorithms was compared with that of passive learning using a weighted mean of paired differences. To determine why the performance varies on different datasets, we measured the diversity and uncertainty of each dataset using relative entropy and correlated the results with the performance differences.
The DIST and CMB algorithms performed better than passive learning. With a statistical significance level set at 0.05, DIST outperformed passive learning in all five datasets, while CMB was found to be better than passive learning in four datasets. We found strong correlations between the dataset diversity and the DIV performance, as well as the dataset uncertainty and the performance of the DIST algorithm.
For medical text classification, appropriate active learning algorithms can yield performance comparable to that of passive learning with considerably smaller training sets. In particular, our results suggest that DIV performs better on data with higher diversity and DIST on data with lower uncertainty.
[Show abstract] [Hide abstract]
ABSTRACT: Natural language processing (NLP) applications typically use regular expressions that have been developed manually by human experts. Our goal is to automate both the creation and utilization of regular expressions in text classification. We designed a novel regular expression discovery (RED) algorithm and implemented two text classifiers based on RED. The RED+ALIGN classifier combines RED with an alignment algorithm, and RED+SVM combines RED with a support vector machine (SVM) classifier. Two clinical datasets were used for testing and evaluation: the SMOKE dataset, containing 1091 text snippets describing smoking status; and the PAIN dataset, containing 702 snippets describing pain status. We performed 10-fold cross-validation to calculate accuracy, precision, recall, and F-measure metrics. In the evaluation, an SVM classifier was trained as the control. The two RED classifiers achieved 80.9-83.0% in overall accuracy on the two datasets, which is 1.3-3% higher than SVM's accuracy (p<0.001). Similarly, small but consistent improvements have been observed in precision, recall, and F-measure when RED classifiers are compared with SVM alone. More significantly, RED+ALIGN correctly classified many instances that were misclassified by the SVM classifier (8.1-10.3% of the total instances and 43.8-53.0% of SVM's misclassifications). Machine-generated regular expressions can be effectively used in clinical text classification. The regular expression-based classifier can be combined with other classifiers, like SVM, to improve classification performance.Journal of the American Medical Informatics Association 02/2014; 21(5). DOI:10.1136/amiajnl-2013-002411 · 3.93 Impact Factor
[Show abstract] [Hide abstract]
ABSTRACT: Observational research using data from electronic health records (EHR) is a rapidly growing area, which promises both increased sample size and data richness - therefore unprecedented study power. However, in many medical domains, large amounts of potentially valuable data are contained within the free text clinical narrative. Manually reviewing free text to obtain desired information is an inefficient use of researcher time and skill. Previous work has demonstrated the feasibility of applying Natural Language Processing (NLP) to extract information. However, in real world research environments, the demand for NLP skills outweighs supply, creating a bottleneck in the secondary exploitation of the EHR. To address this, we present TextHunter, a tool for the creation of training data, construction of concept extraction machine learning models and their application to documents. Using confidence thresholds to ensure high precision (>90%), we achieved recall measurements as high as 99% in real world use cases.American Medical Informatics Association (AMIA) 2014, Washington, USA; 11/2014
[Show abstract] [Hide abstract]
ABSTRACT: Objective This paper presents an automated system for classifying the results of imaging examinations (CT, MRI, positron emission tomography) into reportable and non-reportable cancer cases. This system is part of an industrial-strength processing pipeline built to extract content from radiology reports for use in the Victorian Cancer Registry. Materials and methods In addition to traditional supervised learning methods such as conditional random fields and support vector machines, active learning (AL) approaches were investigated to optimize training production and further improve classification performance. The project involved two pilot sites in Victoria, Australia (Lake Imaging (Ballarat) and Peter MacCallum Cancer Centre (Melbourne)) and, in collaboration with the NSW Central Registry, one pilot site at Westmead Hospital (Sydney). Results The reportability classifier performance achieved 98.25% sensitivity and 96.14% specificity on the cancer registry's held-out test set. Up to 92% of training data needed for supervised machine learning can be saved by AL. Discussion AL is a promising method for optimizing the supervised training production used in classification of radiology reports. When an AL strategy is applied during the data selection process, the cost of manual classification can be reduced significantly. Conclusions The most important practical application of the reportability classifier is that it can dramatically reduce human effort in identifying relevant reports from the large imaging pool for further investigation of cancer. The classifier is built on a large real-world dataset and can achieve high performance in filtering relevant reports to support cancer registries.Journal of the American Medical Informatics Association 05/2014; 21(5). DOI:10.1136/amiajnl-2013-002516 · 3.93 Impact Factor