ArticlePDF Available

Abstract and Figures

The most accurate approaches to Word Sense Disambiguation (WSD) for biomedical docu-ments are based on supervised learning. How-ever, these require manually labeled training examples which are expensive to create and consequently supervised WSD systems are normally limited to disambiguating a small set of ambiguous terms. An alternative approach is to create labeled training examples automat-ically and use them as a substitute for manu-ally labeled ones. This paper describes a large scale WSD system based on automatically la-beled examples generated using information from the UMLS Metathesaurus. The labeled examples are generated without any use of la-beled training data whatsoever and is therefore completely unsupervised (unlike some previ-ous approaches). The system is evaluated on two widely used data sets and found to outper-form a state-of-the-art unsupervised approach which also uses information from the UMLS Metathesaurus.
Content may be subject to copyright.
A preview of the PDF is not available
... We explore three different WSD systems for the biomedical domain, a general personalized page rank (PPR) based system [22] which we apply to the biomedical domain, a vector space model (VSM) based WSD system [23] applicable to any domain but tuned to biomedical texts, and MetaMap [12] which is designed to associate terms in biomedical documents with UMLS CUIs. We also present results based on a random sense baseline. ...
Article
Full-text available
Background The volume of research published in the biomedical domain has increasingly lead to researchers focussing on specific areas of interest and connections between findings being missed. Literature based discovery (LBD) attempts to address this problem by searching for previously unnoticed connections between published information (also known as “hidden knowledge”). A common approach is to identify hidden knowledge via shared linking terms. However, biomedical documents are highly ambiguous which can lead LBD systems to over generate hidden knowledge by hypothesising connections through different meanings of linking terms. Word Sense Disambiguation (WSD) aims to resolve ambiguities in text by identifying the meaning of ambiguous terms. This study explores the effect of WSD accuracy on LBD performance. Methods An existing LBD system is employed and four approaches to WSD of biomedical documents integrated with it. The accuracy of each WSD approach is determined by comparing its output against a standard benchmark. Evaluation of the LBD output is carried out using timeslicing approach, where hidden knowledge is generated from articles published prior to a certain cutoff date and a gold standard extracted from publications after the cutoff date. Results WSD accuracy varies depending on the approach used. The connection between the performance of the LBD and WSD systems are analysed to reveal a correlation between WSD accuracy and LBD performance. Conclusion This study reveals that LBD performance is sensitive to WSD accuracy. It is therefore concluded that WSD has the potential to improve the output of LBD systems by reducing the amount of spurious hidden knowledge that is generated. It is also suggested that further improvements in WSD accuracy have the potential to improve LBD accuracy.
... For the biomedical domain, the majority of previous works center around two WSD datasets (Weeber et al., 2001;Jimeno-Yepes et al., 2011) that together contain 253 ambiguous words, multi-word terms, and abbreviations. In addition, Stevenson et al. (2008), Fan et al. (2009), andCheng et al. (2012) propose methods to generate labeled data. As for methodologies, vector space models (McInnes, 2008;Savova et al., 2008) are a common choice. ...
Conference Paper
Complex noun phrases are pervasive in biomedical texts, but are largely underexplored in entity discovery and information extraction. Such expressions often contain a mix of highly specific names (diseases, drugs, etc.) and common words such as “condition”, “degree”, “process”, etc. These words can have different semantic types depending on their context in noun phrases. In this paper, we address the task of classifying these common words onto fine-grained semantic types: for instance, “condition” can be typed as “symptom and finding” or “configuration and setting”. For information extraction tasks, it is crucial to consider common nouns only when they really carry biomedical meaning; hence the classifier must also detect the negative case when nouns are merely used in a generic, uninformative sense. Our solution harnesses a small number of labeled seeds and employs label propagation, a semisupervised learning method on graphs. Experiments on 50 frequent nouns show that our method computes semantic labels with a microaveraged accuracy of 91.34%.
... Structural Semantic Integration (SSI) and SSI+Information Content (SSI+IC) [38] use a model from the Metathesaurus that is enriched by co-occurrence information available from the UMLS distribution . PageRank [2] uses a graph based approach to perform the selection (we use the results presented in [16]). MRD+KMeans and AEC+KMeans combine MRD and AEC predictions with k-means [22] . ...
Article
Text mining of scientific literature has been essential for setting up large public biomedical databases, which are being widely used by the research community. In the biomedical domain, the existence of a large number of terminological resources and knowledge bases (KB) has enabled a myriad of machine learning methods for different text mining related tasks. Unfortunately, KBs have not been devised for text mining tasks but for human interpretation, thus performance of KB-based methods is usually lower when compared to supervised machine learning methods. The disadvantage of supervised methods though is they require labelled training data and therefore not useful for large scale biomedical text mining systems. KB-based methods do not have this limitation. In this paper, we describe a novel method to generate word-concept probabilities from a KB, which can serve as a basis for several text mining tasks. This method not only takes into account the underlying patterns within the descriptions contained in the KB but also those in texts available from large unlabelled corpora such as MEDLINE. The parameters of the model have been estimated without training data. Patterns from MEDLINE have been built using MetaMap for entity recognition and related using co-occurrences. The word-concept probabilities were evaluated on the task of word sense disambiguation (WSD). The results showed that our method obtained a higher degree of accuracy than other state-of-the-art approaches when evaluated on the MSH WSD data set. We also evaluated our method on the task of document ranking using MEDLINE citations. These results also showed an increase in performance over existing baseline retrieval approaches. Copyright © 2014. Published by Elsevier Inc.
... PPR [3] is also unsupervised and relies on a graphbased algorithm similar to the page rank that converts UMLS R into a graph where the possible meanings of ambiguous words are nodes and relations between them are edges. AEC [23] and UB [11] are supervised learning algorithms that alleviate the problem of requiring manually annotated training data by querying Medline documents. Our methods present very good scores against unsupervised approaches of the literature and near to semisupervised ones. ...
Article
Abstract This paper presents a novel method for semantic annotation and search of a target corpus using several knowledge resources (KRs). This method relies on a formal statistical framework in which KR concepts and corpus documents are homogeneously represented using statistical language models. Under this framework, we can perform all the necessary operations for an efficient and effective semantic annotation of the corpus. Firstly, we propose a coarse tailoring of the KRs w.r.t the target corpus with the main goal of reducing the ambiguity of the annotations and their computational overhead. Then, we propose the generation of concept profiles, which allow measuring the semantic overlap of the KRs as well as performing a finer tailoring of them. Finally, we propose how to semantically represent documents and queries in terms of the KRs concepts and the statistical framework to perform semantic search. Experiments have been carried out with a corpus about web resources which includes several Life Sciences catalogues and Wikipedia pages related to web resources in general (e.g., databases, tools, services, etc). Results demonstrate that the proposed method is more effective and efficient than state-of-the-art methods relying on either context-free annotation or keyword-based search.
... In unsupervised approaches disambiguated training examples are not provided and clustering techniques are used to group instances that belong to the same sense of the target word [14]. More recent approaches to unsupervised learning include [15] and more recently [16]. ...
Conference Paper
Full-text available
Word Sense Disambiguation (WSD) is a fundamental task in many Computational Linguistics applications. It consists of automatically identifying the sense of ambiguous words in context using computational methods. This work evaluates the automatic disambiguation performance of five machine learning classifiers: Naive Bayes, Support Vector Machines, Decision Trees, KStar and Maximum Entropy. For the classification we compare the performance of these algorithms using knowledge-rich and knowledge-poor features applied to Portuguese data.
Article
Objective The use of electronic health records (EHR) systems has grown over the past decade, and with it, the need to extract information from unstructured clinical narratives. Clinical notes, however, frequently contain acronyms with several potential senses (meanings) and traditional natural language processing (NLP) techniques cannot differentiate between these senses. In this study we introduce a semi-supervised method for binary acronym disambiguation, the task of classifying a target sense for acronyms in the clinical EHR notes. Methods We developed a semi-supervised ensemble machine learning (CASEml) algorithm to automatically identify when an acronym means a target sense by leveraging semantic embeddings, visit-level text and billing information. The algorithm was validated using note data from the Veterans Affairs hospital system to classify the meaning of three acronyms: RA, MS, and MI. We compared the performance of CASEml against another standard semi-supervised method and a baseline metric selecting the most frequent acronym sense. Along with evaluating the performance of these methods for specific instances of acronyms, we evaluated the impact of acronym disambiguation on NLP-driven phenotyping of rheumatoid arthritis. Results CASEml achieved accuracies of 0.947, 0.911, and 0.706 for RA, MS, and MI, respectively, higher than a standard baseline metric and (on average) higher than a state-of-the-art semi-supervised method. As well, we demonstrated that applying CASEml to medical notes improves the AUC of a phenotype algorithm for rheumatoid arthritis. Conclusion CASEml is a novel method that accurately disambiguates acronyms in clinical notes and has advantages over commonly used supervised and semi-supervised machine learning approaches. In addition, CASEml improves the performance of NLP tasks that rely on ambiguous acronyms, such as phenotyping.
Article
Full-text available
This article is a survey of methods for measuring agreement among corpus annotators. It exposes the mathematics and underlying assumptions of agreement coefficients, covering Krippendorff's alpha as well as Scott's pi and Cohen's kappa; discusses the use of coefficients in several annotation tasks; and argues that weighted, alpha-like coefficients, traditionally less used than kappa-like measures in computational linguistics, may be more appropriate for many corpus annotation tasks -- but that their use makes the interpretation of the value of the coefficient even harder.
Conference Paper
Full-text available
We have applied flve supervised learning approaches to word sense disambiguation in the medical domain. Our objective is to evaluate Support Vector Machines (SVMs) in comparison with other well known supervised learning algorithms including the na˜‡ve Bayes classifler, C4.5 decision trees, decision lists and boosting approaches. Based on these results we introduce further reflnements of these approaches. We have made use of unigram and bigram features selected using difierent fre- quency cut-ofi values and window sizes along with the statistical signif- icance test of the log likelihood measure for bigrams. Our results show that overall, the best SVM model was most accurate in 27 of 60 cases, compared to 22, 14, 10 and 14 for the na˜‡ve Bayes, C4.5 decision trees, decision list and boosting methods respectively.
Article
Full-text available
Like text in other domains, biomedical documents contain a range of terms with more than one possible meaning. These ambiguities form a significant obstacle to the automatic processing of biomedical texts. Previous approaches to resolving this problem have made use of various sources of information including linguistic features of the context in which the ambiguous term is used and domain-specific resources, such as UMLS. We compare various sources of information including ones which have been previously used and a novel one: MeSH terms. Evaluation is carried out using a standard test set (the NLM-WSD corpus). The best performance is obtained using a combination of linguistic features and MeSH terms. Performance of our system exceeds previously published results for systems evaluated using the same data set. Disambiguation of biomedical terms benefits from the use of information from a variety of sources. In particular, MeSH terms have proved to be useful and should be used if available.
Conference Paper
Word sense disambiguation (WSD) is an intermediate task within information retrieval and information extraction, attempting to select the proper sense of ambiguous words. Due to the scarcity of training data, semi-supervised learning, which profits from seed annotated examples and a large set of unlabeled data, are worth researching. We present preliminary results of two semi-supervised learning algorithms on biomedical word sense disambiguation. Both methods add relevant unlabeled examples to the training set, and optimal parameters are similar for each ambiguous word.