Automatic vs. manual curation of a multi-source chemical dictionary: the impact on text mining.

Department of Medical Informatics, Erasmus University Medical Center, Rotterdam, The Netherlands. .
Journal of Cheminformatics (Impact Factor: 4.54). 01/2010; 2(1):4. DOI: 10.1186/1758-2946-2-4
Source: PubMed

ABSTRACT :Previously, we developed a combined dictionary dubbed Chemlist for the identification of small molecules and drugs in text based on a number of publicly available databases and tested it on an annotated corpus. To achieve an acceptable recall and precision we used a number of automatic and semi-automatic processing steps together with disambiguation rules. However, it remained to be investigated which impact an extensive manual curation of a multi-source chemical dictionary would have on chemical term identification in text. ChemSpider is a chemical database that has undergone extensive manual curation aimed at establishing valid chemical name-to-structure relationships.

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Over the past 15 years, the biomedical research community has increased its efforts to produce ontologies encoding biomedical knowledge, and to provide the corresponding infrastructure to maintain them. As ontologies are becoming a central part of biological and biomedical research, a communication channel to publish frequent updates and latest developments on them would be an advantage. Here, we introduce the JBMS thematic series on Biomedical Ontologies. The aim of the series is to disseminate the latest developments in research on biomedical ontologies and provide a venue for publishing newly developed ontologies, updates to existing ontologies as well as methodological advances, and selected contributions from conferences and workshops. We aim to give this thematic series a central role in the exploration of ongoing research in biomedical ontologies and intend to work closely together with the research community towards this aim. Researchers and working groups are encouraged to provide feedback on novel developments and special topics to be integrated into the existing publication cycles.
    Journal of Biomedical Semantics 03/2014; 5(1):15.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Weighted semantic networks built from text-mined literature can be used to retrieve known protein-protein or gene-disease associations, and have been shown to anticipate associations years before they are explicitly stated in the literature. Our text-mining system recognizes over 640,000 biomedical concepts: some are specific (i.e., names of genes or proteins) others generic (e.g., 'Homo sapiens'). Generic concepts may play important roles in automated information retrieval, extraction, and inference but may also result in concept overload and confound retrieval and reasoning with low-relevance or even spurious links. Here, we attempted to optimize the retrieval performance for protein-protein interactions (PPI) by filtering generic concepts (node filtering) or links to generic concepts (edge filtering) from a weighted semantic network. First, we defined metrics based on network properties that quantify the specificity of concepts. Then using these metrics, we systematically filtered generic information from the network while monitoring retrieval performance of known protein-protein interactions. We also systematically filtered specific information from the network (inverse filtering), and assessed the retrieval performance of networks composed of generic information alone. Filtering generic or specific information induced a two-phase response in retrieval performance: initially the effects of filtering were minimal but beyond a critical threshold network performance suddenly drops. Contrary to expectations, networks composed exclusively of generic information demonstrated retrieval performance comparable to unfiltered networks that also contain specific concepts. Furthermore, an analysis using individual generic concepts demonstrated that they can effectively support the retrieval of known protein-protein interactions. For instance the concept "binding" is indicative for PPI retrieval and the concept "mutation abnormality" is indicative for gene-disease associations. Generic concepts are important for information retrieval and cannot be removed from semantic networks without negative impact on retrieval performance.
    PLoS ONE 01/2013; 8(11):e78665. · 3.53 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: In the Semantic Enrichment of the Scientific Literature (SESL) project, researchers from academia and from life science and publishing companies collaborated in a pre-competitive way to integrate and share information for type 2 diabetes mellitus (T2DM) in adults. This case study exposes benefits from semantic interoperability after integrating the scientific literature with biomedical data resources, such as UniProt Knowledgebase (UniProtKB) and the Gene Expression Atlas. We annotated scientific documents in a standardized way, by applying public terminological resources for diseases and proteins, and other text-mining approaches. Eventually, we compared the genetic causes of T2DM across the data resources to demonstrate the benefits from the SESL triple store. Our solution enables publishers to distribute their content with little overhead into remote data infrastructures, such as into any Virtual Knowledge Broker.
    Drug discovery today 11/2013; · 6.63 Impact Factor

Full-text (4 Sources)

Available from
May 22, 2014