Web 2.0-Based Crowdsourcing for High-Quality Gold Standard Development in Clinical Natural Language Processing

Division of Biomedical Informatics, Cincinnati Children's Hospital Medical Center, Cincinnati, OH, United States.
Journal of Medical Internet Research (Impact Factor: 3.43). 04/2013; 15(4):e73. DOI: 10.2196/jmir.2426
Source: PubMed


A high-quality gold standard is vital for supervised, machine learning-based, clinical natural language processing (NLP) systems. In clinical NLP projects, expert annotators traditionally create the gold standard. However, traditional annotation is expensive and time-consuming. To reduce the cost of annotation, general NLP projects have turned to crowdsourcing based on Web 2.0 technology, which involves submitting smaller subtasks to a coordinated marketplace of workers on the Internet. Many studies have been conducted in the area of crowdsourcing, but only a few have focused on tasks in the general NLP field and only a handful in the biomedical domain, usually based upon very small pilot sample sizes. In addition, the quality of the crowdsourced biomedical NLP corpora were never exceptional when compared to traditionally-developed gold standards. The previously reported results on medical named entity annotation task showed a 0.68 F-measure based agreement between crowdsourced and traditionally-developed corpora.
Building upon previous work from the general crowdsourcing research, this study investigated the usability of crowdsourcing in the clinical NLP domain with special emphasis on achieving high agreement between crowdsourced and traditionally-developed corpora.
To build the gold standard for evaluating the crowdsourcing workers' performance, 1042 clinical trial announcements (CTAs) from the ClinicalTrials.gov website were randomly selected and double annotated for medication names, medication types, and linked attributes. For the experiments, we used CrowdFlower, an Amazon Mechanical Turk-based crowdsourcing platform. We calculated sensitivity, precision, and F-measure to evaluate the quality of the crowd's work and tested the statistical significance (P<.001, chi-square test) to detect differences between the crowdsourced and traditionally-developed annotations.
The agreement between the crowd's annotations and the traditionally-generated corpora was high for: (1) annotations (0.87, F-measure for medication names; 0.73, medication types), (2) correction of previous annotations (0.90, medication names; 0.76, medication types), and excellent for (3) linking medications with their attributes (0.96). Simple voting provided the best judgment aggregation approach. There was no statistically significant difference between the crowd and traditionally-generated corpora. Our results showed a 27.9% improvement over previously reported results on medication named entity annotation task.
This study offers three contributions. First, we proved that crowdsourcing is a feasible, inexpensive, fast, and practical approach to collect high-quality annotations for clinical text (when protected health information was excluded). We believe that well-designed user interfaces and rigorous quality control strategy for entity annotation and linking were critical to the success of this work. Second, as a further contribution to the Internet-based crowdsourcing field, we will publicly release the JavaScript and CrowdFlower Markup Language infrastructure code that is necessary to utilize CrowdFlower's quality control and crowdsourcing interfaces for named entity annotations. Finally, to spur future research, we will release the CTA annotations that were generated by traditional and crowdsourced approaches.

Download full-text


Available from: Todd Lingren,
  • Source
    • "These techniques are being leveraged to collect data on drug interactions, drug adverse events and other complex relations reported in the literature. Other researchers (16) have experimented with crowdsourcing to create gold standard datasets for evaluation of biomedical curation tasks. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Background: This article describes capture of biological information using a hybrid approach that combines natural language processing to extract biological entities and crowdsourcing with annotators recruited via Amazon Mechanical Turk to judge correctness of candidate biological relations. These techniques were applied to extract gene– mutation relations from biomedical abstracts with the goal of supporting production scale capture of gene–mutation–disease findings as an open source resource for personalized medicine. Results: The hybrid system could be configured to provide good performance for gene–mutation extraction (precision ∼82%; recall ∼70% against an expert-generated gold standard) at a cost of $0.76 per abstract. This demonstrates that crowd labor platforms such as Amazon Mechanical Turk can be used to recruit quality annotators, even in an application requiring subject matter expertise; aggregated Turker judgments for gene–mutation relations exceeded 90% accuracy. Over half of the precision errors were due to mismatches against the gold standard hidden from annotator view (e.g. incorrect EntrezGene identifier or incorrect mutation position extracted), or incomplete task instructions (e.g. the need to exclude nonhuman mutations). Conclusions: The hybrid curation model provides a readily scalable cost-effective approach to curation, particularly if coupled with expert human review to filter precision errors. We plan to generalize the framework and make it available as open source software.Database URL: http://www.mitre.org/publications/technical-papers/hybrid-curation-of-gene-mutation-relations-combining-automated
    Database The Journal of Biological Databases and Curation 01/2014; 2014. DOI:10.1093/database/bau094 · 3.37 Impact Factor
  • Source
    • "This makes crowdsourcing a feasible approach for collecting a large normative free-text dataset, such as is needed for an automated natural language scoring method [6]. Unlike previous applications of crowdsourcing to medical natural language processing (eg [24]), our method does not use worker qualification tests or “gold standard” responses created by experts to screen out low-quality answers. Instead, the large volume of free-text data compensates for potential inconsistency in the quality. "
    [Show abstract] [Hide abstract]
    ABSTRACT: BACKGROUND: Crowdsourcing has become a valuable method for collecting medical research data. This approach, recruiting through open calls on the Web, is particularly useful for assembling large normative datasets. However, it is not known how natural language datasets collected over the Web differ from those collected under controlled laboratory conditions. OBJECTIVE: To compare the natural language responses obtained from a crowdsourced sample of participants with responses collected in a conventional laboratory setting from participants recruited according to specific age and gender criteria. METHODS: We collected natural language descriptions of 200 half-minute movie clips, from Amazon Mechanical Turk workers (crowdsourced) and 60 participants recruited from the community (lab-sourced). Crowdsourced participants responded to as many clips as they wanted and typed their responses, whereas lab-sourced participants gave spoken responses to 40 clips, and their responses were transcribed. The content of the responses was evaluated using a take-one-out procedure, which compared responses to other responses to the same clip and to other clips, with a comparison of the average number of shared words. RESULTS: In contrast to the 13 months of recruiting that was required to collect normative data from 60 lab-sourced participants (with specific demographic characteristics), only 34 days were needed to collect normative data from 99 crowdsourced participants (contributing a median of 22 responses). The majority of crowdsourced workers were female, and the median age was 35 years, lower than the lab-sourced median of 62 years but similar to the median age of the US population. The responses contributed by the crowdsourced participants were longer on average, that is, 33 words compared to 28 words (P<.001), and they used a less varied vocabulary. However, there was strong similarity in the words used to describe a particular clip between the two datasets, as a cross-dataset count of shared words showed (P<.001). Within both datasets, responses contained substantial relevant content, with more words in common with responses to the same clip than to other clips (P<.001). There was evidence that responses from female and older crowdsourced participants had more shared words (P=.004 and .01 respectively), whereas younger participants had higher numbers of shared words in the lab-sourced population (P=.01). CONCLUSIONS: Crowdsourcing is an effective approach to quickly and economically collect a large reliable dataset of normative natural language responses.
    Journal of Medical Internet Research 05/2013; 15(5):e100. DOI:10.2196/jmir.2620 · 3.43 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Prescription opioid diversion and abuse are major public health issues in the United States and internationally. Street prices of diverted prescription opioids can provide an indicator of drug availability, demand, and abuse potential, but these data can be difficult to collect. Crowdsourcing is a rapid and cost-effective way to gather information about sales transactions. We sought to determine whether crowdsourcing can provide accurate measurements of the street price of diverted prescription opioid medications. To assess the possibility of crowdsourcing black market drug price data by cross-validation with law enforcement officer reports. Using a crowdsourcing research website (StreetRx), we solicited data about the price that site visitors paid for diverted prescription opioid analgesics during the first half of 2012. These results were compared with a survey of law enforcement officers in the Researched Abuse, Diversion, and Addiction-Related Surveillance (RADARS) System, and actual transaction prices on a "dark Internet" marketplace (Silk Road). Geometric means and 95% confidence intervals were calculated for comparing prices per milligram of drug in US dollars. In a secondary analysis, we compared prices per milligram of morphine equivalent using standard equianalgesic dosing conversions. A total of 954 price reports were obtained from crowdsourcing, 737 from law enforcement, and 147 from the online marketplace. Correlations between the 3 data sources were highly linear, with Spearman rho of 0.93 (P<.001) between crowdsourced and law enforcement, and 0.98 (P<.001) between crowdsourced and online marketplace. On StreetRx, the mean prices per milligram were US$3.29 hydromorphone, US$2.13 buprenorphine, US$1.57 oxymorphone, US$0.97 oxycodone, US$0.96 methadone, US$0.81 hydrocodone, US$0.52 morphine, and US$0.05 tramadol. The only significant difference between data sources was morphine, with a Drug Diversion price of US$0.67/mg (95% CI 0.59-0.75) and a Silk Road price of US$0.42/mg (95% CI 0.37-0.48). Street prices generally followed clinical equianalgesic potency. Crowdsourced data provide a valid estimate of the street price of diverted prescription opioids. The (ostensibly free) black market was able to accurately predict the relative pharmacologic potency of opioid molecules.
    Journal of Medical Internet Research 08/2013; 15(8):e178. DOI:10.2196/jmir.2810 · 3.43 Impact Factor
Show more