Automatic vs. manual curation of a multi-source chemical dictionary: the impact on text mining.
ABSTRACT :Previously, we developed a combined dictionary dubbed Chemlist for the identification of small molecules and drugs in text based on a number of publicly available databases and tested it on an annotated corpus. To achieve an acceptable recall and precision we used a number of automatic and semi-automatic processing steps together with disambiguation rules. However, it remained to be investigated which impact an extensive manual curation of a multi-source chemical dictionary would have on chemical term identification in text. ChemSpider is a chemical database that has undergone extensive manual curation aimed at establishing valid chemical name-to-structure relationships.
- SourceAvailable from: Craig Knox[show abstract] [hide abstract]
ABSTRACT: DrugBank is a richly annotated resource that combines detailed drug data with comprehensive drug target and drug action information. Since its first release in 2006, DrugBank has been widely used to facilitate in silico drug target discovery, drug design, drug docking or screening, drug metabolism prediction, drug interaction prediction and general pharmaceutical education. The latest version of DrugBank (release 2.0) has been expanded significantly over the previous release. With approximately 4900 drug entries, it now contains 60% more FDA-approved small molecule and biotech drugs including 10% more 'experimental' drugs. Significantly, more protein target data has also been added to the database, with the latest version of DrugBank containing three times as many non-redundant protein or drug target sequences as before (1565 versus 524). Each DrugCard entry now contains more than 100 data fields with half of the information being devoted to drug/chemical data and the other half devoted to pharmacological, pharmacogenomic and molecular biological data. A number of new data fields, including food-drug interactions, drug-drug interactions and experimental ADME data have been added in response to numerous user requests. DrugBank has also significantly improved the power and simplicity of its structure query and text query searches. DrugBank is available at http://www.drugbank.ca.Nucleic Acids Research 02/2008; 36(Database issue):D901-6. · 8.28 Impact Factor
- [show abstract] [hide abstract]
ABSTRACT: To achieve high speed with minimal effort, we created a system dubbed Peregrine that performs gene name normalization by simple dictionary lookup followed by several post-processing steps.
- [show abstract] [hide abstract]
ABSTRACT: Knowledge about biological effects of small molecules helps in the understanding of biological processes and supports the development of new therapeutic agents. DrugBank is a high quality database providing such information about drugs that contains annotation of drug effects and classification of therapeutic effects. However, to broaden the scope of such a database in classifying and annotating drugs, systems for automatic extraction of classification terms and the corresponding annotation of drugs are needed. We have developed an approach for the identification of new terms used in unstructured text that provide information about drug properties. It is based on the identification and extraction of phrases corresponding to lexico-syntactic patterns--so-called Hearst patterns that contain drug names and directly related drug annotation terms. Such phrases could be identified with a high performance in DrugBank text (0.89 F-score) and in Medline abstracts (0.83 F-score). In comparison to DrugBank annotation terminology, a huge amount of new drug annotation terms could be found. The evaluation of terms extracted from Medline showed that 29-53% of them are new valid drug property terms. They could be assigned to existing and new drug property classes not provided by the DrugBank drug annotation. We come to the conclusion that our system can support database content update by providing additionally drug descriptions of pharmacological effects not yet found in databases like DrugBank. Moreover, we propose that automatic normalization of terms improves the annotation and the retrieval of relevant database entries. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.Bioinformatics 08/2007; 23(13):i264-72. · 5.47 Impact Factor
Hettne et al. Journal of Cheminformatics 2010, 2:4
Automatic vs. manual curation of a multi-source
chemical dictionary: the impact on text mining
© 2010 Hettne et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons
Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in
any medium, provided the original work is properly cited.
Kristina M Hettne*1,2, Antony J Williams3, Erik M van Mulligen1, Jos Kleinjans2, Valery Tkachenko3 and Jan A Kors1
In 'Automatic vs. manual curation of a multi-source
chemical dictionary: the impact on text mining' (Hettne
et al. Journal of Cheminformatics 2010, 2:3) , the name
of the automatically curated dictionary is identified as
'Chemlist'. CHEMLIST is a trademark that the American
Chemical Society has used for many years to identify its
Regulated Chemicals Listing (CAS) database. To avoid
future confusion, the 'Chemlist' dictionary mentioned in
this article has been renamed to 'Jochem.'
1Department of Medical Informatics, Erasmus University Medical Center,
Rotterdam, The Netherlands, 2Department of Health Risk Analysis and
Toxicology, Maastricht University, Maastricht, The Netherlands and 3Royal
Society of Chemistry, 904 Tamaras Circle, Wake Forest, NC-27587, USA
1.Hettne KM, Williams AJ, van Mulligen EM, Kleinjans J, Tkachenko V, Kors
JA: Automatic vs. manual curation of a multi-source chemical
dictionary: the impact on text mining. J Cheminform 2010, 2:3.
Cite this article as: Hettne et al., Automatic vs. manual curation of a multi-
source chemical dictionary: the impact on text mining Journal of Cheminfor-
matics 2010, 2:4
Received: 1 June 2010 Accepted: 3 June 2010
Published: 3 June 2010
This article is available from: http://www.jcheminf.com/content/2/1/4© 2010 Hettne et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Journal of Cheminformatics 2010, 2:4
* Correspondence: email@example.com
1 Department of Medical Informatics, Erasmus University Medical Center,
Rotterdam, The Netherlands
Full list of author information is available at the end of the article
Open access provides opportunities to our
colleagues in other parts of the globe, by allowing
anyone to view the content free of charge.
W. Jeffery Hurst, The Hershey Company.
Publish with ChemistryCentral and every
scientist can read your work free of charge
available free of charge to the entire scientific community
peer reviewed and published immediately upon acceptance
cited in PubMed and archived on PubMed Central
yours you keep the copyright
Submit your manuscript here: