Automatic vs. manual curation of a multi-source chemical dictionary: The impact on text mining

Department of Medical Informatics, Erasmus University Medical Center, Rotterdam, The Netherlands. .
Journal of Cheminformatics (Impact Factor: 4.55). 03/2010; 2(1):4. DOI: 10.1186/1758-2946-2-4
Source: PubMed


Previously, we developed a combined dictionary dubbed Chemlist for the identification of small molecules and drugs in text based on a number of publicly available databases and tested it on an annotated corpus. To achieve an acceptable recall and precision we used a number of automatic and semi-automatic processing steps together with disambiguation rules. However, it remained to be investigated which impact an extensive manual curation of a multi-source chemical dictionary would have on chemical term identification in text. ChemSpider is a chemical database that has undergone extensive manual curation aimed at establishing valid chemical name-to-structure relationships.

We acquired the component of ChemSpider containing only manually curated names and synonyms. Rule-based term filtering, semi-automatic manual curation, and disambiguation rules were applied. We tested the dictionary from ChemSpider on an annotated corpus and compared the results with those for the Chemlist dictionary. The ChemSpider dictionary of ca. 80 k names was only a 1/3 to a 1/4 the size of Chemlist at around 300 k. The ChemSpider dictionary had a precision of 0.43 and a recall of 0.19 before the application of filtering and disambiguation and a precision of 0.87 and a recall of 0.19 after filtering and disambiguation. The Chemlist dictionary had a precision of 0.20 and a recall of 0.47 before the application of filtering and disambiguation and a precision of 0.67 and a recall of 0.40 after filtering and disambiguation.

We conclude the following: (1) The ChemSpider dictionary achieved the best precision but the Chemlist dictionary had a higher recall and the best F-score; (2) Rule-based filtering and disambiguation is necessary to achieve a high precision for both the automatically generated and the manually curated dictionary. ChemSpider is available as a web service at and the Chemlist dictionary is freely available as an XML file in Simple Knowledge Organization System format on the web at

Download full-text


Available from: Antony John Williams
  • Source
    • "Further domains covered comprise chemical entities to annotate drugs and theirs biological activities [23], structures, and pharmaceutical applications [23,24] for data interoperability [25], and ontologies for experimental settings, e.g., the BioAssay Ontology [26], the Experimental Factor Ontology [27], the eagle-i ontology [28] and the Ontology of Biomedical Investigations [29], capture the biomedical metadata to characterize experiments. Similarly, ontologies for environmental conditions denote data samples and their surroundings upon their encounter [27,30]. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Over the past 15 years, the biomedical research community has increased its efforts to produce ontologies encoding biomedical knowledge, and to provide the corresponding infrastructure to maintain them. As ontologies are becoming a central part of biological and biomedical research, a communication channel to publish frequent updates and latest developments on them would be an advantage. Here, we introduce the JBMS thematic series on Biomedical Ontologies. The aim of the series is to disseminate the latest developments in research on biomedical ontologies and provide a venue for publishing newly developed ontologies, updates to existing ontologies as well as methodological advances, and selected contributions from conferences and workshops. We aim to give this thematic series a central role in the exploration of ongoing research in biomedical ontologies and intend to work closely together with the research community towards this aim. Researchers and working groups are encouraged to provide feedback on novel developments and special topics to be integrated into the existing publication cycles.
    Full-text · Article · Mar 2014 · Journal of Biomedical Semantics
  • Source
    • "We use a text-mining and inference system based on concept profiles to expose novel and relevant associations between concepts from biomedical literature. This information retrieval system has been shown in retrospective studies to rediscover gene-chemical, protein-protein, and gene-disease associations in some cases years before they were explicitly stated in the literature [5], [6]. Concept profiles have also been shown to predict protein-protein interactions that were subsequently validated experimentally [7]. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Weighted semantic networks built from text-mined literature can be used to retrieve known protein-protein or gene-disease associations, and have been shown to anticipate associations years before they are explicitly stated in the literature. Our text-mining system recognizes over 640,000 biomedical concepts: some are specific (i.e., names of genes or proteins) others generic (e.g., ‘Homo sapiens’). Generic concepts may play important roles in automated information retrieval, extraction, and inference but may also result in concept overload and confound retrieval and reasoning with low-relevance or even spurious links. Here, we attempted to optimize the retrieval performance for protein-protein interactions (PPI) by filtering generic concepts (node filtering) or links to generic concepts (edge filtering) from a weighted semantic network. First, we defined metrics based on network properties that quantify the specificity of concepts. Then using these metrics, we systematically filtered generic information from the network while monitoring retrieval performance of known protein-protein interactions. We also systematically filtered specific information from the network (inverse filtering), and assessed the retrieval performance of networks composed of generic information alone.
    Full-text · Article · Nov 2013 · PLoS ONE
  • Source
    • "The selected phytoligands are shown in supplementary materials, Table 1. The three-dimensional structures of these phytoligands were retrieved from ChemSpider (Hettne et al., 2010) database . The drug likeness properties of the phytoligands were computationally predicted by Lipinski's rule of five (Giménez, Santos, Ferrarini, & Fernandes, 2010), Comprehensive Medicinal Chemistry (CMC)-like rule (Ajay, Walters, & Murcko, 1998), MDDR (MDL Drug Data Report)-like rule (Frimurer, Bywater, Naerum, Lauritsen, & Brunak, 2000), Lead-like Rule (Oprea et al., 2007) and World drug index (WDI)-like rule (Wagener & van Geerestein, 2000) available in PreAD- MET package. "
    [Show abstract] [Hide abstract]
    ABSTRACT: In our recent studies on prevalence of multidrug resistant pathogens in Byramangala reservoir, Karnataka, India, we identified Salmonella typhi, Staphylococcus aureus, and Vibrio cholerae which had acquired multiple drug resistance (MDR) and emerged as superbugs. Hence, there is a pressing demand to identify alternative therapeutic remedies. Our study focused on the screening of herbal leads by structure-based virtual screening. The virulent gene products of these pathogens towards Kanamycin(aph), Trimethoprim(dfrA1), Methicillin (mecI), and Vancomycin (vanH) were identified as the probable drug targets and their 3D structures were predicted by homology modeling. The predicted models showed good stereochemical validity. By extensive literature survey, we selected 58 phytoligands and their drug likeliness and pharmacokinetic properties were computationally predicted. The inhibitory properties of these ligands against drug targets were studied by molecular docking. Our studies revealed that Baicalein from S. baicalensis (baikal skullcap) and Luteolin from Taraxacum officinale (dandelion) were identified as potential inhibitors against aph of S. typhi. Resveratrol from Vitis vinifera (grape vine) and Wogonin from S. baicalensis were identified as potential inhibitors against dfrA1 of S. typhi. Herniarin from Herniaria glabra (rupture worts) and Pyrocide from Daucus carota (Carrot) were identified as the best leads against dfrA1 of V. cholerae. Taraxacin of T. officinale (weber) and Luteolin were identified as potential inhibitors against Mec1. Apigenin from Coffee arabica (coffee) and Luteolin were identified as the best leads against vanH of S. aureus. Our findings pave crucial insights for exploring alternative therapeutics against MDR pathogens.
    Full-text · Article · Jul 2013 · Journal of biomolecular Structure & Dynamics
Show more