Article

Automatic vs. manual curation of a multi-source chemical dictionary: The impact on text mining

Department of Medical Informatics, Erasmus University Medical Center, Rotterdam, The Netherlands. .
Journal of Cheminformatics (Impact Factor: 4.55). 03/2010; 2(1):4. DOI: 10.1186/1758-2946-2-4
Source: PubMed

ABSTRACT

Background
Previously, we developed a combined dictionary dubbed Chemlist for the identification of small molecules and drugs in text based on a number of publicly available databases and tested it on an annotated corpus. To achieve an acceptable recall and precision we used a number of automatic and semi-automatic processing steps together with disambiguation rules. However, it remained to be investigated which impact an extensive manual curation of a multi-source chemical dictionary would have on chemical term identification in text. ChemSpider is a chemical database that has undergone extensive manual curation aimed at establishing valid chemical name-to-structure relationships.

Results
We acquired the component of ChemSpider containing only manually curated names and synonyms. Rule-based term filtering, semi-automatic manual curation, and disambiguation rules were applied. We tested the dictionary from ChemSpider on an annotated corpus and compared the results with those for the Chemlist dictionary. The ChemSpider dictionary of ca. 80 k names was only a 1/3 to a 1/4 the size of Chemlist at around 300 k. The ChemSpider dictionary had a precision of 0.43 and a recall of 0.19 before the application of filtering and disambiguation and a precision of 0.87 and a recall of 0.19 after filtering and disambiguation. The Chemlist dictionary had a precision of 0.20 and a recall of 0.47 before the application of filtering and disambiguation and a precision of 0.67 and a recall of 0.40 after filtering and disambiguation.

Conclusions
We conclude the following: (1) The ChemSpider dictionary achieved the best precision but the Chemlist dictionary had a higher recall and the best F-score; (2) Rule-based filtering and disambiguation is necessary to achieve a high precision for both the automatically generated and the manually curated dictionary. ChemSpider is available as a web service at http://www.chemspider.com/ and the Chemlist dictionary is freely available as an XML file in Simple Knowledge Organization System format on the web at http://www.biosemantics.org/chemlist.

Full-text

Available from: Antony John Williams
Hettne et al. Journal of Cheminformatics 2010, 2:4
http://www.jcheminf.com/content/2/1/4
Open Access
CORRECTION
© 2010 Hettne et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons
Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in
any medium, provided the original work is properly cited.
Correction
Automatic vs. manual curation of a multi-source
chemical dictionary: the impact on text mining
Kristina M Hettne*
1,2
, Antony J Williams
3
, Erik M van Mulligen
1
, Jos Kleinjans
2
, Valery Tkachenko
3
and Jan A Kors
1
Correction
In 'Automatic vs. manual curation of a multi-source
chemical dictionary: the impact on text mining' (Hettne
et al. Journal of Cheminformatics 2010, 2:3) [1], the name
of the automatically curated dictionary is identified as
'Chemlist'. CHEMLIST is a trademark that the American
Chemical Society has used for many years to identify its
Regulated Chemicals Listing (CAS) database. To avoid
future confusion, the 'Chemlist' dictionary mentioned in
this article has been renamed to 'Jochem.'
Author Details
1
Department of Medical Informatics, Erasmus University Medical Center,
Rotterdam, The Netherlands,
2
Department of Health Risk Analysis and
Toxicology, Maastricht University, Maastricht, The Netherlands and
3
Royal
Society of Chemistry, 904 Tamaras Circle, Wake Forest, NC-27587, USA
References
1. Hettne KM, Williams AJ, van Mulligen EM, Kleinjans J, Tkachenko V, Kors
JA: Automatic vs. manual curation of a multi-source chemical
dictionary: the impact on text mining. J Cheminform 2010, 2:3.
doi: 10.1186/1758-2946-2-4
Cite this article as: Hettne et al., Automatic vs. manual curation of a multi-
source chemical dictionary: the impact on text mining Journal of Cheminfor-
matics 2010, 2:4
Received: 1 June 2010 Accepted: 3 June 2010
Published: 3 June 2010
This article is available from: http://www.jcheminf.com/content/2/1/4© 2010 Hettne et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.Journal of Cheminformatics 2010, 2:4
* Correspondence: k.hettne@erasmusmc.nl
1
Department of Medical Informatics, Erasmus University Medical Center,
Rotterdam, The Netherlands
Full list of author information is available at the end of the article
Open access provides opportunities to our
colleagues in other parts of the globe, by allowing
anyone to view the content free of charge.
Publish with ChemistryCentral and every
scientist can read your work free of charge
W. Jeffery Hurst, The Hershey Company.
available free of charge to the entire scientific community
peer reviewed and published immediately upon acceptance
cited in PubMed and archived on PubMed Central
yours you keep the copyright
Submit your manuscript here:
http://www.chemistrycentral.com/manuscript/
Page 1
  • Source
    • "Text mining from the scientific literature has been considered promising for creating and updating structured databases of biomedical knowledge [1] but it often falls short and, currently, manual curation by experts is still the standard practice for this task2345. Some argue that text mining or natural language processing (NLP) becomes unnecessary when researchers report results following a standardized template [6] . "
    [Show abstract] [Hide abstract] ABSTRACT: Background: Numerous publicly available biomedical databases derive data by curating from literatures. The curated data can be useful as training examples for information extraction, but curated data usually lack the exact mentions and their locations in the text required for supervised machine learning. This paper describes a general approach to information extraction using curated data as training examples. The idea is to formulate the problem as cost-sensitive learning from noisy labels, where the cost is estimated by a committee of weak classifiers that consider both curated data and the text. Results: We test the idea on two information extraction tasks of Genome-Wide Association Studies (GWAS). The first task is to extract target phenotypes (diseases or traits) of a study and the second is to extract ethnicity backgrounds of study subjects for different stages (initial or replication). Experimental results show that our approach can achieve 87 % of Precision-at-2 (P@2) for disease/trait extraction, and 0.83 of F1-Score for stage-ethnicity extraction, both outperforming their cost-insensitive baseline counterparts. Conclusions: The results show that curated biomedical databases can potentially be reused as training examples to train information extractors without expert annotation or refinement, opening an unprecedented opportunity of using "big data" in biomedical text mining.
    Preview · Article · Dec 2016 · BMC Bioinformatics
  • Source
    • "It contains structures, properties and associated information for compounds gathered from more than 470 data sources. The information in the database is validated automatically by robot software, and manually by annotators and crowdsourcing [26,28,29]. We only used the subset of compounds that were manually validated. "
    [Show abstract] [Hide abstract] ABSTRACT: Background The past decade has seen an upsurge in the number of publications in chemistry. The ever-swelling volume of available documents makes it increasingly hard to extract relevant new information from such unstructured texts. The BioCreative CHEMDNER challenge invites the development of systems for the automatic recognition of chemicals in text (CEM task) and for ranking the recognized compounds at the document level (CDI task). We investigated an ensemble approach where dictionary-based named entity recognition is used along with grammar-based recognizers to extract compounds from text. We assessed the performance of ten different commercial and publicly available lexical resources using an open source indexing system (Peregrine), in combination with three different chemical compound recognizers and a set of regular expressions to recognize chemical database identifiers. The effect of different stop-word lists, case-sensitivity matching, and use of chunking information was also investigated. We focused on lexical resources that provide chemical structure information. To rank the different compounds found in a text, we used a term confidence score based on the normalized ratio of the term frequencies in chemical and non-chemical journals. Results The use of stop-word lists greatly improved the performance of the dictionary-based recognition, but there was no additional benefit from using chunking information. A combination of ChEBI and HMDB as lexical resources, the LeadMine tool for grammar-based recognition, and the regular expressions, outperformed any of the individual systems. On the test set, the F-scores were 77.8% (recall 71.2%, precision 85.8%) for the CEM task and 77.6% (recall 71.7%, precision 84.6%) for the CDI task. Missed terms were mainly due to tokenization issues, poor recognition of formulas, and term conjunctions. Conclusions We developed an ensemble system that combines dictionary-based and grammar-based approaches for chemical named entity recognition, outperforming any of the individual systems that we considered. The system is able to provide structure information for most of the compounds that are found. Improved tokenization and better recognition of specific entity types is likely to further improve system performance.
    Full-text · Article · Jan 2015 · Journal of Cheminformatics
  • Source
    • "Phenotype ontologies are also available for multiple species and are widely used for the annotation of the abnormalities observed in mutagenesis experiments192021 as well as for the characterization of diseases and drug effects [22]. Further domains covered comprise chemical entities to annotate drugs and theirs biological activities [23] , structures , and pharmaceutical applications [23,24] for data interoperability [25] , and ontologies for experimental settings , e.g., the BioAssay Ontology [26], the Experimental Factor Ontology [27], the eagle-i ontology [28] and the Ontology of Biomedical Investigations [29], capture the biomedical metadata to characterize experiments. Similarly , ontologies for environmental conditions denote data samples and their surroundings upon their encounter [27,30]. "
    [Show abstract] [Hide abstract] ABSTRACT: Over the past 15 years, the biomedical research community has increased its efforts to produce ontologies encoding biomedical knowledge, and to provide the corresponding infrastructure to maintain them. As ontologies are becoming a central part of biological and biomedical research, a communication channel to publish frequent updates and latest developments on them would be an advantage. Here, we introduce the JBMS thematic series on Biomedical Ontologies. The aim of the series is to disseminate the latest developments in research on biomedical ontologies and provide a venue for publishing newly developed ontologies, updates to existing ontologies as well as methodological advances, and selected contributions from conferences and workshops. We aim to give this thematic series a central role in the exploration of ongoing research in biomedical ontologies and intend to work closely together with the research community towards this aim. Researchers and working groups are encouraged to provide feedback on novel developments and special topics to be integrated into the existing publication cycles.
    Full-text · Article · Mar 2014 · Journal of Biomedical Semantics
Show more