Text mining for the Vaccine Adverse Event Reporting System: medical text classification using informative feature selection

Office of Biostatistics and Epidemiology, Center for Biologics Evaluation and Research, Food and Drug Administration, Rockville, Maryland 20852, USA.
Journal of the American Medical Informatics Association (Impact Factor: 3.5). 06/2011; 18(5):631-8. DOI: 10.1136/amiajnl-2010-000022
Source: PubMed


The US Vaccine Adverse Event Reporting System (VAERS) collects spontaneous reports of adverse events following vaccination. Medical officers review the reports and often apply standardized case definitions, such as those developed by the Brighton Collaboration. Our objective was to demonstrate a multi-level text mining approach for automated text classification of VAERS reports that could potentially reduce human workload.
We selected 6034 VAERS reports for H1N1 vaccine that were classified by medical officers as potentially positive (N(pos)=237) or negative for anaphylaxis. We created a categorized corpus of text files that included the class label and the symptom text field of each report. A validation set of 1100 labeled text files was also used. Text mining techniques were applied to extract three feature sets for important keywords, low- and high-level patterns. A rule-based classifier processed the high-level feature representation, while several machine learning classifiers were trained for the remaining two feature representations.
Classifiers' performance was evaluated by macro-averaging recall, precision, and F-measure, and Friedman's test; misclassification error rate analysis was also performed.
Rule-based classifier, boosted trees, and weighted support vector machines performed well in terms of macro-recall, however at the expense of a higher mean misclassification error rate. The rule-based classifier performed very well in terms of average sensitivity and specificity (79.05% and 94.80%, respectively).
Our validated results showed the possibility of developing effective medical text classifiers for VAERS reports by combining text mining with informative feature selection; this strategy has the potential to reduce reviewer workload considerably.

31 Reads
  • Source
    • "Weights can also be applied to the variables. The Brighton Collaboration case definition has been implemented as a rule-based program and using features representing the clinical concepts extracted from narrative case descriptions, this program has worked reasonably well to classify cases [46] "
    [Show abstract] [Hide abstract]
    ABSTRACT: Safety of medical products is a major public health concern. We present a critical discussion of the currently used analytical tools for mining spontaneous reporting systems (SRS) to identify safety signals after use of medical products. We introduce a pattern discovery framework for the analysis of SRS. The terminology ‘pattern discovery’ is borrowed from the engineering and artificial intelligence literature and signifies that the basis of the proposed framework is the medical case, formalizing the cognitive paradigm known to clinicians who evaluate individual patients and individual case safety reports submitted to SRS. The fundamental contribution of this approach is a strong probabilistic component that may account for selection and other biases and facilitates rigorous modeling and inference. We discuss somewhat in depth the concept of signal in pharmacovigilance and connect it with the concept of a pattern; we illustrate this conceptual framework using the example of anaphylaxis. Finally, we propose a research agenda in statistics, informatics, and pharmacovigilance practices needed to advance the pattern discovery framework in both the short and long terms.
    Statistical Analysis and Data Mining 10/2014; 7(5). DOI:10.1002/sam.11233
  • Source
    • "Analysis of biomedical literature for safety signal detection is challenging and labor intensive due to unstructured nature. Therefore, natural-language processing (NLP) techniques recently developed for extracting ADE-related information or direct/indirect drug interactions have gained large popularity[35]-[37]. "
    Chemistry and Chemical Engineering; 06/2014
  • Source
    • "Botsis et al19 identified cases of anaphylaxis to influenza vaccine from a database of adverse events using a rule-based classifier that incorporated keywords of anaphylaxis as part of the feature set with sensitivity of 79% and specificity of 94%. The standard machine learning classifiers resulted in F-measures ranging from 0.70 to 0.81.19 In 2009, Denecke and Baehr20 utilized additional document metadata such as keywords, titles, journal, and conference information to achieve favorable results for a document classifier.20 "
    [Show abstract] [Hide abstract]
    ABSTRACT: The database of Genotypes and Phenotypes (dbGaP) allows researchers to understand phenotypic contribution to genetic conditions, generate new hypotheses, confirm previous study results, and identify control populations. However, effective use of the database is hindered by suboptimal study retrieval. Our objective is to evaluate text classification techniques to improve study retrieval in the context of the dbGaP database. We utilized standard machine learning algorithms (naive Bayes, support vector machines, and the C4.5 decision tree) trained on dbGaP study text and incorporated n-gram features and study metadata to identify heart, lung, and blood studies. We used the χ(2) feature selection algorithm to identify features that contributed most to classification performance and experimented with dbGaP associated PubMed papers as a proxy for topicality. Classifier performance was favorable in comparison to keyword-based search results. It was determined that text categorization is a useful complement to document retrieval techniques in the dbGaP.
    Biomedical Informatics Insights 07/2013; 6:35-45. DOI:10.4137/BII.S11987
Show more