Automatic vs. manual curation of a multi-source chemical dictionary: the impact on text mining.
ABSTRACT :Previously, we developed a combined dictionary dubbed Chemlist for the identification of small molecules and drugs in text based on a number of publicly available databases and tested it on an annotated corpus. To achieve an acceptable recall and precision we used a number of automatic and semi-automatic processing steps together with disambiguation rules. However, it remained to be investigated which impact an extensive manual curation of a multi-source chemical dictionary would have on chemical term identification in text. ChemSpider is a chemical database that has undergone extensive manual curation aimed at establishing valid chemical name-to-structure relationships.
- SourceAvailable from: Robert David Stevens[Show abstract] [Hide abstract]
ABSTRACT: Over the past 15 years, the biomedical research community has increased its efforts to produce ontologies encoding biomedical knowledge, and to provide the corresponding infrastructure to maintain them. As ontologies are becoming a central part of biological and biomedical research, a communication channel to publish frequent updates and latest developments on them would be an advantage. Here, we introduce the JBMS thematic series on Biomedical Ontologies. The aim of the series is to disseminate the latest developments in research on biomedical ontologies and provide a venue for publishing newly developed ontologies, updates to existing ontologies as well as methodological advances, and selected contributions from conferences and workshops. We aim to give this thematic series a central role in the exploration of ongoing research in biomedical ontologies and intend to work closely together with the research community towards this aim. Researchers and working groups are encouraged to provide feedback on novel developments and special topics to be integrated into the existing publication cycles.Journal of Biomedical Semantics 03/2014; 5(1):15.
- [Show abstract] [Hide abstract]
ABSTRACT: Weighted semantic networks built from text-mined literature can be used to retrieve known protein-protein or gene-disease associations, and have been shown to anticipate associations years before they are explicitly stated in the literature. Our text-mining system recognizes over 640,000 biomedical concepts: some are specific (i.e., names of genes or proteins) others generic (e.g., 'Homo sapiens'). Generic concepts may play important roles in automated information retrieval, extraction, and inference but may also result in concept overload and confound retrieval and reasoning with low-relevance or even spurious links. Here, we attempted to optimize the retrieval performance for protein-protein interactions (PPI) by filtering generic concepts (node filtering) or links to generic concepts (edge filtering) from a weighted semantic network. First, we defined metrics based on network properties that quantify the specificity of concepts. Then using these metrics, we systematically filtered generic information from the network while monitoring retrieval performance of known protein-protein interactions. We also systematically filtered specific information from the network (inverse filtering), and assessed the retrieval performance of networks composed of generic information alone. Filtering generic or specific information induced a two-phase response in retrieval performance: initially the effects of filtering were minimal but beyond a critical threshold network performance suddenly drops. Contrary to expectations, networks composed exclusively of generic information demonstrated retrieval performance comparable to unfiltered networks that also contain specific concepts. Furthermore, an analysis using individual generic concepts demonstrated that they can effectively support the retrieval of known protein-protein interactions. For instance the concept "binding" is indicative for PPI retrieval and the concept "mutation abnormality" is indicative for gene-disease associations. Generic concepts are important for information retrieval and cannot be removed from semantic networks without negative impact on retrieval performance.PLoS ONE 01/2013; 8(11):e78665. · 3.53 Impact Factor
- [Show abstract] [Hide abstract]
ABSTRACT: In the Semantic Enrichment of the Scientific Literature (SESL) project, researchers from academia and from life science and publishing companies collaborated in a pre-competitive way to integrate and share information for type 2 diabetes mellitus (T2DM) in adults. This case study exposes benefits from semantic interoperability after integrating the scientific literature with biomedical data resources, such as UniProt Knowledgebase (UniProtKB) and the Gene Expression Atlas. We annotated scientific documents in a standardized way, by applying public terminological resources for diseases and proteins, and other text-mining approaches. Eventually, we compared the genetic causes of T2DM across the data resources to demonstrate the benefits from the SESL triple store. Our solution enables publishers to distribute their content with little overhead into remote data infrastructures, such as into any Virtual Knowledge Broker.Drug discovery today 11/2013; · 6.63 Impact Factor
Hettne et al. Journal of Cheminformatics 2010, 2:4
Automatic vs. manual curation of a multi-source
chemical dictionary: the impact on text mining
© 2010 Hettne et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons
Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in
any medium, provided the original work is properly cited.
Kristina M Hettne*1,2, Antony J Williams3, Erik M van Mulligen1, Jos Kleinjans2, Valery Tkachenko3 and Jan A Kors1
In 'Automatic vs. manual curation of a multi-source
chemical dictionary: the impact on text mining' (Hettne
et al. Journal of Cheminformatics 2010, 2:3) , the name
of the automatically curated dictionary is identified as
'Chemlist'. CHEMLIST is a trademark that the American
Chemical Society has used for many years to identify its
Regulated Chemicals Listing (CAS) database. To avoid
future confusion, the 'Chemlist' dictionary mentioned in
this article has been renamed to 'Jochem.'
1Department of Medical Informatics, Erasmus University Medical Center,
Rotterdam, The Netherlands, 2Department of Health Risk Analysis and
Toxicology, Maastricht University, Maastricht, The Netherlands and 3Royal
Society of Chemistry, 904 Tamaras Circle, Wake Forest, NC-27587, USA
1.Hettne KM, Williams AJ, van Mulligen EM, Kleinjans J, Tkachenko V, Kors
JA: Automatic vs. manual curation of a multi-source chemical
dictionary: the impact on text mining. J Cheminform 2010, 2:3.
Cite this article as: Hettne et al., Automatic vs. manual curation of a multi-
source chemical dictionary: the impact on text mining Journal of Cheminfor-
matics 2010, 2:4
Received: 1 June 2010 Accepted: 3 June 2010
Published: 3 June 2010
This article is available from: http://www.jcheminf.com/content/2/1/4© 2010 Hettne et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Journal of Cheminformatics 2010, 2:4
* Correspondence: email@example.com
1 Department of Medical Informatics, Erasmus University Medical Center,
Rotterdam, The Netherlands
Full list of author information is available at the end of the article
Open access provides opportunities to our
colleagues in other parts of the globe, by allowing
anyone to view the content free of charge.
W. Jeffery Hurst, The Hershey Company.
Publish with ChemistryCentral and every
scientist can read your work free of charge
available free of charge to the entire scientific community
peer reviewed and published immediately upon acceptance
cited in PubMed and archived on PubMed Central
yours you keep the copyright
Submit your manuscript here: