MEDIC: a practical disease vocabulary used at the Comparative Toxicogenomics Database

Department of Bioinformatics, The Mount Desert Island Biological Laboratory, Salisbury Cove, ME 04672, USA.
Database The Journal of Biological Databases and Curation (Impact Factor: 4.46). 01/2012; 2012:bar065. DOI: 10.1093/database/bar065
Source: PubMed

ABSTRACT The Comparative Toxicogenomics Database (CTD) is a public resource that promotes understanding about the effects of environmental chemicals on human health. CTD biocurators manually curate a triad of chemical–gene, chemical–disease and gene–disease relationships from the scientific literature. The CTD curation paradigm uses controlled vocabularies for chemicals, genes and diseases. To curate disease information, CTD first had to identify a source of controlled terms. Two resources seemed to be good candidates: the Online Mendelian Inheritance in Man (OMIM) and the ‘Diseases’ branch of the National Library of Medicine's Medical Subject Headers (MeSH). To maximize the advantages of both, CTD biocurators undertook a novel initiative to map the flat list of OMIM disease terms into the hierarchical nature of the MeSH vocabulary. The result is CTD’s ‘merged disease vocabulary’ (MEDIC), a unique resource that integrates OMIM terms, synonyms and identifiers with MeSH terms, synonyms, definitions, identifiers and hierarchical relationships. MEDIC is both a deep and broad vocabulary, composed of 9700 unique diseases described by more than 67 000 terms (including synonyms). It is freely available to download in various formats from CTD. While neither a true ontology nor a perfect solution, this vocabulary has nonetheless proved to be extremely successful and practical for our biocurators in generating over 2.5 million disease-associated toxicogenomic relationships in CTD. Other external databases have also begun to adopt MEDIC for their disease vocabulary. Here, we describe the construction, implementation, maintenance and use of MEDIC to raise awareness of this resource and to offer it as a putative scaffold in the formal construction of an official disease ontology.
Database URL:

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This contribution demonstrates how to apply concepts of social network analysis on educational data. The main aim of this approach is to provide a deeper insight into the structure of courses and/or other learning units that belong to a given curriculum in order to improve the learning process. Presented work can help us to discover communities of similar study disciplines (based on the similarity measures of textual descriptions of their contents), as well as identify important courses strongly linked to others, and also find more independent and less important parts of the curriculum using centrality measures arising from the graph theory and social network analysis.
    5th International Workshop on Artificial Intelligence in Medical Applications, Lodz, Poland; 09/2015
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Information encoded in natural language in biomedical literature publications is only useful if efficient and reliable ways of accessing and analyzing that information are available. Natural language processing and text mining tools are therefore essential for extracting valuable information, however, the development of powerful, highly effective tools to automatically detect central biomedical concepts such as diseases is conditional on the availability of annotated corpora. This paper presents the disease name and concept annotations of the NCBI disease corpus, a collection of 793 PubMed abstracts fully annotated at the mention and concept level to serve as a research resource for the biomedical natural language processing community. Each PubMed abstract was manually annotated by two annotators with disease mentions and their corresponding concepts in Medical Subject Headings (MeSH(®)) or Online Mendelian Inheritance in Man (OMIM(®)). Manual curation was performed using PubTator, which allowed the use of pre-annotations as a pre-step to manual annotations. Fourteen annotators were randomly paired and differing annotations were discussed for reaching a consensus in two annotation phases. In this setting, a high inter-annotator agreement was observed. Finally, all results were checked against annotations of the rest of the corpus to assure corpus-wide consistency. The public release of the NCBI disease corpus contains 6,892 disease mentions, which are mapped to 790 unique disease concepts. Of these, 88% link to a MeSH identifier, while the rest contain an OMIM identifier. We were able to link 91% of the mentions to a single disease concept, while the rest are described as a combination of concepts. In order to help researchers use the corpus to design and test disease identification methods, we have prepared the corpus as training, testing and development sets. To demonstrate its utility, we conducted a benchmarking experiment where we compared three different knowledge-based disease normalization methods with a best performance in F-measure of 63.7%. These results show that the NCBI disease corpus has the potential to significantly improve the state-of-the-art in disease name recognition and normalization research, by providing a high-quality gold standard thus enabling the development of machine-learning based approaches for such tasks. The NCBI disease corpus, guidelines and other associated resources are available at:
    Journal of Biomedical Informatics 01/2014; 47. DOI:10.1016/j.jbi.2013.12.006 · 2.48 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Despite the central role of diseases in biomedical research, there have been much fewer attempts to automatically determine which diseases are mentioned in a text - the task of disease name normalization - compared with other normalization tasks in biomedical text mining research. In this article we introduce the first machine learning approach for disease name normalization (DNorm), using the NCBI Disease corpus and the MEDIC vocabulary, which combines MeSH® and OMIM. Our method is a high-performing and mathematically principled framework for learning similarities between mentions and concept names directly from training data. The technique is based on pairwise learning to rank, which has not previously been applied to the normalization task but has proven successful in very large optimization problems for information retrieval. We compare our method to several techniques based on lexical normalization and matching, MetaMap, and Lucene. Our algorithm achieves 0.782 micro-averaged F-measure and 0.809 macro-averaged F-measure, an increase over the highest performing baseline method of 0.121 and 0.098, respectively. The source code for DNorm is available at, along with a web-based demonstration and links to the NCBI Disease Corpus. Results on PubMed abstracts are available in PubTator:
    Bioinformatics 08/2013; 29(22). DOI:10.1093/bioinformatics/btt474 · 4.62 Impact Factor


Available from