Article
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Ambiguity in the biomedical domain represents a major issue when performing Natural Language Processing tasks over the huge amount of available information in the field. For this reason, Word Sense Disambiguation is critical for achieving accurate systems able to tackle complex tasks such as information extraction, summarization or document classification. In this work we explore whether multilinguality can help to solve the problem of ambiguity, and the conditions required for a system to improve the results obtained by monolingual approaches. Also, we analyse the best ways to generate those useful multilingual resources, and study different languages and sources of knowledge. The proposed system, based on co-occurrence graphs containing biomedical concepts and textual information, is evaluated on a test dataset frequently used in biomedicine. We can conclude that multilingual resources are able to provide a clear improvement of more than 7% compared to monolingual approaches, for graphs built from a small number of documents. Also, empirical results show that automatically translated resources are a useful source of information for this particular task.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Today many organizations consider the user"s feelings, emotions, opinions and sentiments, which are shared on the public social media networks as the feedback for their products, services or events. Sentiment analysis is part of the Natural Language Processing(NLP) [1,2,4,9,11,19,20]. Today social media networks like twitter, facebook, Google+, Myspace, Tumblr, LinkedIn, Xing, Pinterest, Disqus, Renren, Snapchat, Twoo, YouTube, Instagram, Quora, forums etc. served as the major public platform to share the views and opinions on the products, services, movies, educational institutions, professional experiences, government policies and politics by the common people, customers, government, employees, business men, industry people, professionals, students and politicians. ...
... 90% of customer"s decisions depended on social media review [5]. By the discussions enormous amount textual data is generated, a lot of knowledge is hidden behind these views and opinions, in order to extract wisdom from these views and opinions sentiment analysis is used [3,9,10]. ...
... Social network data contain noise, irrelevant, symbols, numbers, slag words, stop words and ambiguous words which are no way help to analyze data. [1,4,5,6,9,11,12,13,15,18,20,21]. ...
Article
Full-text available
Resolving ambiguities of Homographs and Auto-antonyms in Sentiment Analysis is a challenging intellectual problem that aims to predict the sentiments in online textual messages such as tweets and reviews. WOM (word-of-mouth) data sets on social media network such as on-line text tweets contain ambiguous terms misleading the analytical process. Existing Bipolar and Natural Language Processing could not disambiguate these problems and lead to high overheads and are not able to give accurate results. Homograph and Auto-antonym ambiguous terms in the data contribute much to the solution quality apart from other problems such as pre-processing analytical data, text normalization tasks including tokenization, stop-words, stemming, number transformation, removing re-tweets, URLs, symbols, tags etc. In this study, in order to resolve homographs and auto-antonyms ambiguities, One sense per collocation characteristic of Yarowsky"s semi-supervised algorithm is exploited and the accuracy in sentiment classification has been improved when applied on a sample big datasets as compared to the traditional algorithms.
... Making use of multilingual resources for analysing a specific language seems to be a more fruitful approach [152,153,164]. It also yielded improved performance for word sense disambiguation in English [165]. ...
... Interestingly, segmentation with lack of spacing in Japanese [33] could be successfully applied to English text where spacing between words was removed such as in Character Recognition (OCR) where word spacing is often not captured properly. Duque et al. [165] show that multilingual ressources can be useful for processing English text: for a word sense disambiguation task, multilingual resources yield a 7% improvement in performance, compared to monolingual resources. ...
Article
Full-text available
Background: Natural language processing applied to clinical text or aimed at a clinical outcome has been thriving in recent years. This paper offers the first broad overview of clinical Natural Language Processing (NLP) for languages other than English. Recent studies are summarized to offer insights and outline opportunities in this area. Main body: We envision three groups of intended readers: (1) NLP researchers leveraging experience gained in other languages, (2) NLP researchers faced with establishing clinical text processing in a language other than English, and (3) clinical informatics researchers and practitioners looking for resources in their languages in order to apply NLP techniques and tools to clinical practice and/or investigation. We review work in clinical NLP in languages other than English. We classify these studies into three groups: (i) studies describing the development of new NLP systems or components de novo, (ii) studies describing the adaptation of NLP architectures developed for English to another language, and (iii) studies focusing on a particular clinical application. Conclusion: We show the advantages and drawbacks of each method, and highlight the appropriate application context. Finally, we identify major challenges and opportunities that will affect the impact of NLP on clinical practice and public health studies in a context that encompasses English as well as other languages.
... Sabbir exploits recent advances in neural word/concept embeddings to improve the performance of biomedical WSD on MSH dataset [26]. Duque gives a biomedical WSD system based on co-occurrence graphs which contain biomedical concepts and textual information [27]. Rais exploits semantic similarity and relatedness measures from biomedical resources to evaluate the influence of context window size on WSD [28]. ...
Article
Full-text available
Biomedical words have many semantics. Biomedical word sense disambiguation (WSD) is an important research issue in biomedicine field. Biomedical WSD refers to the process of determining meanings of ambiguous word according to its context. It is widely applied to process, translate and retrieve biomedical texts now. In order to improve WSD accuracy in biomedicine, this paper proposes a new WSD method based on graph attention neural network (GAT). Words, parts of speech, and semantic categories in context of ambiguous word are used as disambiguation features. Disambiguation features and the sentence are used as nodes to construct WSD graph. GAT is used to extract discriminative features, and softmax function is applied to determine semantic category of biomedical ambiguous word. MSH dataset is used to optimize GAT-based WSD classifier and test its accuracy. Experiments show that average accuracy of the proposed method is improved. At the same time, majority voting strategy is adopted to optimize GAT-based WSD classifier further.
... They adopted hidden Markov models and validated this approach using a labeled corpus consisting of about 500,000 words. Duque et al (2016) explored whether multilingualism can help solve problems of ambiguity or the conditions required for a system to improve the results obtained using a monolingual approach in WSD. They determined the optimal means of generating those useful multilingual resources, and they studied different languages and sources of knowledge. ...
Conference Paper
Full-text available
Word Sense Disambiguation (WSD) is the process of automatically identifying which the appropriate meaning of a word given in its sentence. (WSD) is a promising research area in computational linguistics, especially in wide range of advanced applications, such as medical and social sciences. This research employs the concept (WSD) to determine the inherent meaning of voter intentions regarding possible political candidates. Where candidates can be examined and their true assets and competencies in three major areas of eligibility, education, and experience inputs can be deciphered. Data envelope analysis (DEA) is used to determine underlying word instances for elected and successful outputs. The results demonstrate the validity of using (DEA) as a tool for (WSD). The results also indicate that the survey administered by the website which is developed for the purpose of this research, and used in this study, is a promising tool for predicting successful presidential candidates.
... This technique for building the co-occurrence graph has been previously used for general WSD tasks, such as Cross-Lingual WSD [24], with successful results, which suggests that a similar approach could also lead to competitive results in domain-specific WSD. The proposed technique can also be used for analysing the implications of including new potentially useful aspects to the WSD task in the biomedical domain, such as multilinguality [25]. In that work, information from multilingual corpora is added to the co-occurrence graphs used in the disambiguation process, for testing whether the use of smaller multilingual corpora is able to achieve similar results than those obtained through the use of big monolingual corpora. ...
Article
Word Sense Disambiguation is a key step for many Natural Language Processing tasks (e.g. summarization, text classification, relation extraction) and presents a challenge to any system that aims to process documents from the biomedical domain. In this paper, we present a new graphbased unsupervised technique to address this problem. The knowledge base used in this work is a graph built with co-occurrence information from medical concepts found in scientific abstracts, and hence adapted to the specific domain. Unlike other unsupervised approaches based on static graphs such as UMLS, in this work the knowledge base takes the context of the ambiguous terms into account. Abstracts downloaded from PubMed are used for building the graph and disambiguation is performed using the Personalized PageRank algorithm. Evaluation is carried out over two test datasets widely explored in the literature. Different parameters of the system are also evaluated to test robustness and scalability. Results show that the system is able to outperform state-of-the-art knowledge-based systems, obtaining more than 10% of accuracy improvement in some cases, while only requiring minimal external resources.
Article
Full-text available
We employ the concept of word sense disambiguation to determine the inherent meaning of voter intentions regarding possible political candidates from the 2016 Presidential election. We present our findings based on a website (www.presidentselect.com) that we developed, where candidates can be examined and their true assets and competencies in three major areas of eligibility, education, and experience inputs can be deciphered. Data envelope analysis is used to determine underlying word instances for elected and successful outputs. We also utilize our web site results to longitudinally extend these findings for decision making of potential election fraud detection in the 2020 Presidential election, utilizing Benford’s Law. Our results shed light on these phenomenon and provide new insights into the word sense disambiguation literature.
Chapter
Due to the ever-evolving nature of human languages, the ambiguity in it needs to be dealt with by the researchers. Word sense disambiguation (WSD) is a classical problem of natural language processing which refers to identifying the most appropriate sense of a given word in the concerned context. WordNet graph based approaches are used by several state-of-art methods for performing WSD. This paper highlights a novel genetic algorithm based approach for performing WSD using fuzzy WordNet graph based approach. The fitness function is calculated using the fuzzy global measures of graph connectivity. For proposing this fitness function, a comparative study is performed for the global measures edge density, entropy and compactness. Also, an analytical insight is provided by presenting a visualization of the control terms for word sense disambiguation in the research papers from 2013 to 2018 present in Web of Science.
Conference Paper
Full-text available
Recent years have seen a dramatic growth in the popularity of word embeddings mainly owing to their ability to capture semantic information from massive amounts of textual content. As a result, many tasks in Natural Language Processing have tried to take advantage of the potential of these distributional models. In this work, we study how word embeddings can be used in Word Sense Disambiguation, one of the oldest tasks in Natural Language Processing and Artificial Intelligence. We propose different methods through which word embeddings can be leveraged in a state-of-the-art supervised WSD system architecture, and perform a deep analysis of how different parameters affect performance. We show how a WSD system that makes use of word embeddings alone, if designed properly, can provide significant performance improvement over a state-of-the-art WSD system that incorporates several standard WSD features.
Conference Paper
Full-text available
In this paper, we introduce a knowledge-based method to disambiguate biomedical acronyms using second-order co-occurrence vectors. We create these vectors using information about a long-form obtained from the Unified Medical Language System and Medline. We evaluate this method on a dataset of 18 acronyms found in biomedical text. Our method achieves an overall accuracy of 89%. The results show that using second-order features provide a distinct representation of the long-form and potentially enhances automated disambiguation.
Article
Full-text available
Automated Word Sense Disambiguation in clinical documents is a prerequisite to accurate extraction of medical information. Emerging methods utilizing hyperdimensional computing present new approaches to this problem. In this paper, we evaluate one such approach, the Binary Spatter Code Word Sense Disambiguation algorithm, on 50 ambiguous abbreviation sets derived from clinical notes. This algorithm uses reversible vector transformations to encode ambiguous terms and their context-specific senses into vectors representing surrounding terms. The sense for a new context is then inferred from vectors representing the terms it contains. One-to-one BSC-WSD achieves average accuracy of 94.55% when considering the orientation and distance of neighboring terms relative to the target abbreviation, outperforming Support Vector Machine and Naïve Bayes classifiers. Furthermore, it is practical to deal with all 50 abbreviations in an identical manner using a single one-to-many BSC-WSD model with average accuracy of 93.91%, which is not possible with common machine learning algorithms.
Article
Full-text available
To evaluate state-of-the-art unsupervised methods on the word sense disambiguation (WSD) task in the clinical domain. In particular, to compare graph-based approaches relying on a clinical knowledge base with bottom-up topic-modeling-based approaches. We investigate several enhancements to the topic-modeling techniques that use domain-specific knowledge sources. The graph-based methods use variations of PageRank and distance-based similarity metrics, operating over the Unified Medical Language System (UMLS). Topic-modeling methods use unlabeled data from the Multiparameter Intelligent Monitoring in Intensive Care (MIMIC II) database to derive models for each ambiguous word. We investigate the impact of using different linguistic features for topic models, including UMLS-based and syntactic features. We use a sense-tagged clinical dataset from the Mayo Clinic for evaluation. The topic-modeling methods achieve 66.9% accuracy on a subset of the Mayo Clinic's data, while the graph-based methods only reach the 40-50% range, with a most-frequent-sense baseline of 56.5%. Features derived from the UMLS semantic type and concept hierarchies do not produce a gain over bag-of-words features in the topic models, but identifying phrases from UMLS and using syntax does help. Although topic models outperform graph-based methods, semantic features derived from the UMLS prove too noisy to improve performance beyond bag-of-words. Topic modeling for WSD provides superior results in the clinical domain; however, integration of knowledge remains to be effectively exploited.
Article
Full-text available
This paper presents a new open text word sense disambiguation method that combines the use of logical inferences with PageRank-style algo-rithms applied on graphs extracted from natu-ral language documents. We evaluate the ac-curacy of the proposed algorithm on several sense-annotated texts, and show that it consis-tently outperforms the accuracy of other pre-viously proposed knowledge-based word sense disambiguation methods. We also explore and evaluate methods that combine several open-text word sense disambiguation algorithms.
Conference Paper
Full-text available
In this paper we propose a new graph-based method that uses the knowledge in a LKB (based on WordNet) in order to perform un- supervised Word Sense Disambiguation. Our algorithm uses the full graph of the LKB ef- ficiently, performing better than previous ap- proaches in English all-words datasets. We also show that the algorithm can be easily ported to other languages with good results, with the only requirement of having a word- net. In addition, we make an analysis of the performance of the algorithm, showing that it is efficient and that it could be tuned to be faster.
Conference Paper
Full-text available
This paper introduces an unsupervised vector approach to disambiguate words in biomedi- cal text that can be applied to all-word dis- ambiguation. We explore using contextual information from the Unied Medical Lan- guage System (UMLS) to describe the pos- sible senses of a word. We experiment with automatically creating individualized stoplists to help reduce the noise in our dataset. We compare our results to SenseClusters and Humphrey et al. (2006) using the NLM-WSD dataset and with SenseClusters using con- ated data from the 2005 Medline Baseline.
Article
Full-text available
Word sense disambiguation (WSD) is the ability to identify the meaning of words in context in a computational manner. WSD is considered an AI-complete problem, that is, a task whose solution is at least as hard as the most difficult problems in artificial intelligence. We introduce the reader to the motivations for solving the ambiguity of words and provide a description of the task. We overview supervised, unsupervised, and knowledge-based approaches. The assessment of WSD systems is discussed in the context of the Senseval/Semeval campaigns, aiming at the objective evaluation of systems participating in several different disambiguation tasks. Finally, applications, open problems, and future directions are discussed.
Article
Full-text available
The mesoscopic structure of complex networks has proven a powerful level of description to understand the linchpins of the system represented by the network. Nevertheless, the mapping of a series of relationships between elements, in terms of a graph, is sometimes not straightforward. Given that all the information we would extract using complex network tools depend on this initial graph, it is mandatory to preprocess the data to build it on in the most accurate manner. Here we propose a procedure to build a network, attending only to statistically significant relations between constituents. We use a paradigmatic example of word associations to show the development of our approach. Analyzing the modular structure of the obtained network we are able to disentangle categorical relations, disambiguating words with success that is comparable to the best algorithms designed to the same end.
Article
Full-text available
Evaluation of Word Sense Disambiguation (WSD) methods in the biomedical domain is difficult because the available resources are either too small or too focused on specific types of entities (e.g. diseases or genes). We present a method that can be used to automatically develop a WSD test collection using the Unified Medical Language System (UMLS) Metathesaurus and the manual MeSH indexing of MEDLINE. We demonstrate the use of this method by developing such a data set, called MSH WSD. In our method, the Metathesaurus is first screened to identify ambiguous terms whose possible senses consist of two or more MeSH headings. We then use each ambiguous term and its corresponding MeSH heading to extract MEDLINE citations where the term and only one of the MeSH headings co-occur. The term found in the MEDLINE citation is automatically assigned the UMLS CUI linked to the MeSH heading. Each instance has been assigned a UMLS Concept Unique Identifier (CUI). We compare the characteristics of the MSH WSD data set to the previously existing NLM WSD data set. The resulting MSH WSD data set consists of 106 ambiguous abbreviations, 88 ambiguous terms and 9 which are a combination of both, for a total of 203 ambiguous entities. For each ambiguous term/abbreviation, the data set contains a maximum of 100 instances per sense obtained from MEDLINE.We evaluated the reliability of the MSH WSD data set using existing knowledge-based methods and compared their performance to that of the results previously obtained by these algorithms on the pre-existing data set, NLM WSD. We show that the knowledge-based methods achieve different results but keep their relative performance except for the Journal Descriptor Indexing (JDI) method, whose performance is below the other methods. The MSH WSD data set allows the evaluation of WSD algorithms in the biomedical domain. Compared to previously existing data sets, MSH WSD contains a larger number of biomedical terms/abbreviations and covers the largest set of UMLS Semantic Types. Furthermore, the MSH WSD data set has been generated automatically reusing already existing annotations and, therefore, can be regenerated from subsequent UMLS versions.
Article
Full-text available
Word sense disambiguation (WSD) algorithms attempt to select the proper sense of ambiguous terms in text. Resources like the UMLS provide a reference thesaurus to be used to annotate the biomedical literature. Statistical learning approaches have produced good results, but the size of the UMLS makes the production of training data infeasible to cover all the domain. We present research on existing WSD approaches based on knowledge bases, which complement the studies performed on statistical learning. We compare four approaches which rely on the UMLS Metathesaurus as the source of knowledge. The first approach compares the overlap of the context of the ambiguous word to the candidate senses based on a representation built out of the definitions, synonyms and related terms. The second approach collects training data for each of the candidate senses to perform WSD based on queries built using monosemous synonyms and related terms. These queries are used to retrieve MEDLINE citations. Then, a machine learning approach is trained on this corpus. The third approach is a graph-based method which exploits the structure of the Metathesaurus network of relations to perform unsupervised WSD. This approach ranks nodes in the graph according to their relative structural importance. The last approach uses the semantic types assigned to the concepts in the Metathesaurus to perform WSD. The context of the ambiguous word and semantic types of the candidate concepts are mapped to Journal Descriptors. These mappings are compared to decide among the candidate concepts. Results are provided estimating accuracy of the different methods on the WSD test collection available from the NLM. We have found that the last approach achieves better results compared to the other methods. The graph-based approach, using the structure of the Metathesaurus network to estimate the relevance of the Metathesaurus concepts, does not perform well compared to the first two methods. In addition, the combination of methods improves the performance over the individual approaches. On the other hand, the performance is still below statistical learning trained on manually produced data and below the maximum frequency sense baseline. Finally, we propose several directions to improve the existing methods and to improve the Metathesaurus to be more effective in WSD.
Article
Full-text available
Word Sense Disambiguation (WSD), automatically identifying the meaning of ambiguous words in context, is an important stage of text processing. This article presents a graph-based approach to WSD in the biomedical domain. The method is unsupervised and does not require any labeled training data. It makes use of knowledge from the Unified Medical Language System (UMLS) Metathesaurus which is represented as a graph. A state-of-the-art algorithm, Personalized PageRank, is used to perform WSD. When evaluated on the NLM-WSD dataset, the algorithm outperforms other methods that rely on the UMLS Metathesaurus alone. The WSD system is open source licensed and available from http://ixa2.si.ehu.es/ukb/. The UMLS, MetaMap program and NLM-WSD corpus are available from the National Library of Medicine http://www.nlm.nih.gov/research/umls/, http://mmtx.nlm.nih.gov and http://wsd.nlm.nih.gov. Software to convert the NLM-WSD corpus into a format that can be used by our WSD system is available from http://www.dcs.shef.ac.uk/∼marks/biomedical_wsd under open source license.
Book
Full-text available
This book describes the state of the art in Word Sense Disambiguation. Current algorithms and applications are presented
Article
Full-text available
Ambiguity, the phenomenon that a word has more than one sense, poses difficulties for many current Natural Language Processing (NLP) systems. Algorithms that assist in the resolution of these ambiguities, i.e. which make unambiguous a word, or more generally, a text string, will boost performance of these systems. To test such techniques in the biomedical language domain, we have developed a Word Sense Disambiguation (WSD) test collection that comprises 5,000 unambiguous instances for 50 ambiguous UMLS Metathesaurus strings.
Article
Full-text available
Electronic medical records (EMR) constitute a valuable resource of patient specific information and are increasingly used for clinical practice and research. Acronyms present a challenge to retrieving information from the EMR because many acronyms are ambiguous with respect to their full form. In this paper we perform a comparative study of supervised acronym disambiguation in a corpus of clinical notes, using three machine learning algorithms: the naïve Bayes classifier, decision trees and Support Vector Machines (SVMs). Our training features include part-of-speech tags, unigrams and bigrams in the context of the ambiguous acronym. We find that the combination of these feature types results in consistently better accuracy than when they are used individually, regardless of the learning algorithm employed. The accuracy of all three methods when using all features consistently approaches or exceeds 90%, even when the baseline majority classifier is below 50%.
Article
In this paper, we present a new method based on co-occurrence graphs for performing Cross-Lingual Word Sense Disambiguation (CLWSD). The proposed approach comprises the automatic generation of bilingual dictionaries, and a new technique for the construction of a co-occurrence graph used to select the most suitable translations from the dictionary. Different algorithms that combine both the dictionary and the co-occurrence graph are then used for performing this selection of the final translations: techniques based on sub-graphs (communities) containing clusters of words with related meanings, based on distances between nodes representing words, and based on the relative importance of each node in the whole graph. The initial output of the system is enhanced with translation probabilities, provided by a statistical bilingual dictionary. The system is evaluated using datasets from two competitions: task 3 of SemEval 2010, and task 10 of SemEval 2013. Results obtained by the different disambiguation techniques are analysed and compared to those obtained by the systems participating in the competitions. Our system offers the best results in comparison with other unsupervised systems in most of the experiments, and even overcomes supervised systems in some cases.
Article
In this paper we investigate the role of multilingual features in improving word sense disambiguation. In particular, we explore the use of semantic clues derived from context translation to enrich the intended sense and therefore reduce ambiguity. Our experiments demonstrate up to 26% increase in disambiguation accuracy by utilizing multilingual features as compared to the monolingual baseline.
Article
This paper describes explorations in word sense disambiguation using Wikipedia as a source of sense annotations. Through experiments on four different languages, we show that the Wikipedia-based sense annotations are reliable and can be used to construct accurate sense classifiers.
Article
Named Entities (NEs) are often written with no orthographic changes across different languages that share a common alphabet. We show that this can be leveraged so as to improve named entity recognition (NER) by using unsupervised word clusters from secondary languages as features in state-of-the-art discriminative NER systems. We observe significant increases in performance, finding that person and location identification is particularly improved, and that phylogenetically close languages provide more valuable features than more distant languages.
Conference Paper
In the deep neural network (DNN), the hidden layers can be considered as increasingly complex feature transformations and the final softmax layer as a log-linear classifier making use of the most abstract features computed in the hidden layers. While the loglinear classifier should be different for different languages, the feature transformations can be shared across languages. In this paper we propose a shared-hidden-layer multilingual DNN (SHL-MDNN), in which the hidden layers are made common across many languages while the softmax layers are made language dependent. We demonstrate that the SHL-MDNN can reduce errors by 3-5%, relatively, for all the languages decodable with the SHL-MDNN, over the monolingual DNNs trained using only the language specific data. Further, we show that the learned hidden layers sharing across languages can be transferred to improve recognition accuracy of new languages, with relative error reductions ranging from 6% to 28% against DNNs trained without exploiting the transferred hidden layers. It is particularly interesting that the error reduction can be achieved for the target language that is in different families of the languages used to learn the hidden layers.
Article
Resnik and Yarowsky (1997) made a set of observations about the state-of-the-art in automatic word sense disambiguation and, motivated by those observations, offered several specific proposals regarding improved evaluation criteria, common training and testing resources, and the definition of sense inventories. Subsequent discussion of those proposals resulted in SENSEVAL, the first evaluation exercise for word sense disambiguation (Kilgarriff and Palmer 2000). This article is a revised and extended version of our 1997 workshop paper, reviewing its observations and proposals and discussing them in light of the SENSEVAL exercise. It also includes a new in-depth empirical study of translingually-based sense inventories and distance measures, using statistics collected from native-speaker annotations of 222 polysemous contexts across 12 languages. These data show that monolingual sense distinctions at most levels of granularity can be effectively captured by translations into some set of second languages, especially as language family distance increases. In addition, the probability that a given sense pair will tend to lexicalize differently across languages is shown to correlate with semantic salience and sense granularity; sense hierarchies automatically generated from such distance matrices yield results remarkably similar to those created by professional monolingual lexicographers.
Article
Access to the vast body of research literature that is now available on biomedicine and related fields can be improved with automatic summarization. This paper describes a summarization system for the biomedical domain that represents documents as graphs formed from concepts and relations in the UMLS Metathesaurus. This system has to deal with the ambiguities that occur in biomedical documents. We describe a variety of strategies that make use of MetaMap and Word Sense Disambiguation (WSD) to accurately map biomedical documents onto UMLS Metathesaurus concepts. Evaluation is carried out using a collection of 150 biomedical scientific articles from the BioMed Central corpus. We find that using WSD improves the quality of the summaries generated.
Article
The comparison of two treatments generally falls into one of the following two categories: (a) we may have a number of replications for each of the two treatments, which are unpaired, or (b) we may have a number of paired comparisons leading to a series of differences, some of which may be positive and some negative. The appropriate methods for testing the significance of the differences of the means in these two cases are described in most of the textbooks on statistical methods.
Article
The need to align investments in health research and development (R&D) with public health demands is one of the most pressing global public health challenges. We aim to provide a comprehensive description of available data sources, propose a set of indicators for monitoring the global landscape of health R&D, and present a sample of country indicators on research inputs (investments), processes (clinical trials), and outputs (publications), based on data from international databases. Total global investments in health R&D (both public and private sector) in 2009 reached US$240 billion. Of the US$214 billion invested in high-income countries, 60% of health R&D investments came from the business sector, 30% from the public sector, and about 10% from other sources (including private non-profit organisations). Only about 1% of all health R&D investments were allocated to neglected diseases in 2010. Diseases of relevance to high-income countries were investigated in clinical trials seven-to-eight-times more often than were diseases whose burden lies mainly in low-income and middle-income countries. This report confirms that substantial gaps in the global landscape of health R&D remain, especially for and in low-income and middle-income countries. Too few investments are targeted towards the health needs of these countries. Better data are needed to improve priority setting and coordination for health R&D, ultimately to ensure that resources are allocated to diseases and regions where they are needed the most. The establishment of a global observatory on health R&D, which is being discussed at WHO, could address the absence of a comprehensive and sustainable mechanism for regular global monitoring of health R&D.
Article
In this paper, we present a word sense disambiguation (WSD) based system for multilingual lexical substitution. Our method depends on having a WSD system for English and an automatic word alignment method. Crucially the approach relies on having parallel corpora. For Task 2 (Sinha et al., 2009) we apply a supervised WSD system to derive the English word senses. For Task 3 (Lefever & Hoste, 2009), we apply an unsupervised approach to the training and test data. Both of our systems that participated in Task 2 achieve a decent ranking among the participating systems. For Task 3 we achieve the highest ranking on several of the language pairs: French, German and Italian.
Article
This paper describes a new task to extract and align information networks from comparable corpora. As a case study we demonstrate the effectiveness of this task on automatically mining name translation pairs. Starting from a small set of seeds, we design a novel approach to acquire name translation pairs in a bootstrapping framework. The experimental results show this approach can generate highly accurate name translation pairs for persons, geopolitical and organization entities.
Article
This paper explores the role played by a multilingual feature representation for the task of word sense disambiguation. We translate the context of an ambiguous word in multiple languages, and show through experiments on standard datasets that by using a multilingual vector space we can obtain error rate reductions of up to 25%, as compared to a monolingual classifier.
Article
Seventy-five years ago, Yates (1934) presented an article intro-ducing his continuity correction to the χ 2 test for independence in contingency tables. The paper also was one of the first introductions to Fisher's exact test. We discuss the historical importance of Yates and his 1934 paper. The development of the exact test and continuity correction are studied in some detail. Subsequent disputes about the exact test and continuity correction are recounted. We examine the current relevance of these issues and the 1934 paper itself and attempt to ascertain its place in history.
Chapter
This chapter provides an overview of research to date in knowledge-based word sense disambiguation. It outlines the main knowledge-intensive methods devised so far for automatic sense tagging: 1) methods using contextual overlap with respect to dictionary definitions, 2) methods based on similarity measures computed on semantic networks, 3) selectional preferences as a means of constraining the possible meanings of words in a given context, and 4) heuristic-based methods that rely on properties of human language including the most frequent sense, one sense per discourse, and one sense per collocation.
Chapter
In this chapter, the supervised approach to word sense disambiguation is presented, which consists of automatically inducing classification models or rules from annotated examples. We start by introducing the machine learning framework for classification and some important related concepts. Then, a review of the main approaches in the literature is presented, focusing on the following issues: learning paradigms, corpora used, sense repositories, and feature representation. We also include a more detailed description of five statistical and machine learning algorithms, which are experimentally evaluated and compared on the DSO corpus. In the final part of the chapter, the current challenges of the supervised learning approach to WSD are briefly discussed.
Conference Paper
We present an unsupervised method for word sense disambiguation that exploits translation correspondences in parallel corpora. The technique takes advantage of the fact that crosslanguage lexicalizations of the same concept tend to be consistent, preserving some core element of its semantics, and yet also variable, reflecting differing translator preferences and the influence of context. Working with parallel corpora introduces an extra complication for evaluation, since it is difficult to find a corpus that is both sense tagged and parallel with another language; therefore we use pseudotranslations, created by machine translation systems, in order to make possible the evaluation of the approach against a standard test set. The results demonstrate that word-level translation correspondences are a valuable source of information for sense disam- biguation.
Conference Paper
Previous algorithms to compute lexical chains suf- fer either from a lack of accuracy in word sense disambiguation (WSD) or from computational in- efficiency. In this paper, we present a new linear- time algorithm for lexical chaining that adopts the assumption of one sense per discourse. Our results show an improvement over previous algorithms when evaluated on a WSD task. In this paper, we further investigate the automatic identifi- cation of lexical chains for subsequent use as an intermediate representation of text. In the next section, we propose a new algorithm that runs in linear time and adopts the assumption of one sense per discourse (Gale et al., 1992). We suggest that separating WSD from the actual chaining of words can increase the quality of chains. In the last section, we present an evaluation of the lexical chaining algorithm proposed in this paper, and compare it against (Barzilay and Elhadad, 1997; Silber and McCoy, 2003) for the task of WSD. This evaluation shows that our algorithm performs significantly better than the other two.
Article
In this paper, we present Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext. Google is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems. The prototype with a full text and hyperlink database of at least 24 million pages is available at http://google.stanford.edu/ To engineer a search engine is a challenging task. Search engines index tens to hundreds of millions of web pages involving a comparable number of distinct terms. They answer tens of millions of queries every day. Despite the importance of large-scale search engines on the web, very little academic research has been done on them. Furthermore, due to rapid advance in technology and web proliferation, creating a web search engine today is very different from three years ago. This paper provides an in-depth description of our large-scale web search engine -- the first such detailed public description we know of to date. Apart from the problems of scaling traditional search techniques to data of this magnitude, there are new technical challenges involved with using the additional information present in hypertext to produce better search results. This paper addresses this question of how to build a practical largescale system which can exploit the additional information present in hypertext. Also we look at the problem of how to effectively deal with uncontrolled hypertext collections where anyone can publish anything they want. Keywords World Wide Web, Search Engines, Information Retrieval, PageRank, Google 1.
Conference Paper
In this paper, we present Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext. Google is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems. The prototype with a full text and hyperlink database of at least 24 million pages is available at http://google.stanford.edu/ To engineer a search engine is a challenging task. Search engines index tens to hundreds of millions of web pages involving a comparable number of distinct terms. They answer tens of millions of queries every day. Despite the importance of large-scale search engines on the web, very little academic research has been done on them. Furthermore, due to rapid advance in technology and web proliferation, creating a web search engine today is very different from 3 years ago. This paper provides an in-depth description of our large-scale web search engine - the first such detailed public description we know of to date. Apart from the problems of scaling traditional search techniques to data of this magnitude, there are new technical challenges involved with using the additional information present in hypertext to produce better search results. This paper addresses this question of how to build a practical large-scale system which can exploit the additional information present in hypertext. Also we look at the problem of how to effectively deal with uncontrolled hypertext collections, where anyone can publish anything they want.
Article
Word Sense Disambiguation (WSD), the automatic identification of the meanings of ambiguous terms in a document, is an important stage in text processing. We describe a WSD system that has been developed specifically for the types of ambiguities found in biomedical documents. This system uses a range of knowledge sources. It employs both linguistic features, such as local collocations, and features derived from domain-specific knowledge sources, the Unified Medical Language System (UMLS) and Medical Subject Headings (MeSH). This system is applied to three types of ambiguities found in Medline abstracts: ambiguous terms, abbreviations with multiple expansions and names that are ambiguous between genes. The WSD system is applied to the standard NLM-WSD data set, which consists of ambiguous terms from Medline abstracts, and was found to perform well in comparison with previously reported results. The system's performance and the contribution of each knowledge source depends upon the type of lexical ambiguity. 87.9% of the ambiguous terms are correctly disambiguated using a combination of linguistic features and MeSH terms, 99% of abbreviations are disambiguated by combining all knowledge sources, while 97.2% of ambiguous gene names are disambiguated using the MeSH terms alone. Analysis reveals that these differences are caused by the nature of each ambiguity type. These results should be taken into account when deciding which information to use for WSD and the level of performance that can be expected.
Article
Researchers have access to a vast amount of information stored in textual documents and there is a pressing need for the development of automated methods to enable and improve access to this resource. Lexical ambiguity, the phenomena in which a word or phrase has more than one possible meaning, presents a significant obstacle to automated text processing. Word Sense Disambiguation (WSD) is a technology that resolves these ambiguities automatically and is an important stage in text understanding. The most accurate approaches to WSD rely on manually labeled examples but this is usually not available and is prohibitively expensive to create. This paper offers a solution to that problem by using information in the UMLS Metathesaurus to automatically generate labeled examples. Two approaches are presented. The first is an extension of existing work (Liu et al., 2002 [1]) and the second a novel approach that exploits information in the UMLS that has not been used for this purpose. The automatically generated examples are evaluated by comparing them against the manually labeled ones in the NLM-WSD data set and are found to outperform the baseline. The examples generated using the novel approach produce an improvement in WSD performance when combined with manually labeled examples.
Article
An experiment was performed at the National Library of Medicine((R)) (NLM((R))) in word sense disambiguation (WSD) using the Journal Descriptor Indexing (JDI) methodology. The motivation is the need to solve the ambiguity problem confronting NLM's MetaMap system, which maps free text to terms corresponding to concepts in NLM's Unified Medical Language System((R)) (UMLS((R))) Metathesaurus((R)). If the text maps to more than one Metathesaurus concept at the same high confidence score, MetaMap has no way of knowing which concept is the correct mapping. We describe the JDI methodology, which is ultimately based on statistical associations between words in a training set of MEDLINE((R)) citations and a small set of journal descriptors (assigned by humans to journals per se) assumed to be inherited by the citations. JDI is the basis for selecting the best meaning that is correlated to UMLS semantic types (STs) assigned to ambiguous concepts in the Metathesaurus. For example, the ambiguity transport has two meanings: "Biological Transport" assigned the ST Cell Function and "Patient transport" assigned the ST Health Care Activity. A JDI-based methodology can analyze text containing transport and determine which ST receives a higher score for that text, which then returns the associated meaning, presumed to apply to the ambiguity itself. We then present an experiment in which a baseline disambiguation method was compared to four versions of JDI in disambiguating 45 ambiguous strings from NLM's WSD Test Collection. Overall average precision for the highest-scoring JDI version was 0.7873 compared to 0.2492 for the baseline method, and average precision for individual ambiguities was greater than 0.90 for 23 of them (51%), greater than 0.85 for 24 (53%), and greater than 0.65 for 35 (79%). On the basis of these results, we hope to improve performance of JDI and test its use in applications.
In 1986, the National Library of Medicine began a long-term research and development project to build the Unified Medical Language System® (UMLS®). The purpose of the UMLS is to improve the ability of computer programs to “understand” the biomedical meaning in user inquiries and to use this understanding to retrieve and integrate relevant machine-readable information for users. Underlying the UMLS effort is the assumption that timely access to accurate and up-to-date information will improve decision making and ultimately the quality of patient care and research. The development of the UMLS is a distributed national experiment with a strong element of international collaboration. The general strategy is to develop UMLS components through a series of successive approximations of the capabilities ultimately desired. Three experimental Knowledge Sources, the Metathesaurus®, the Semantic Network, and the Information Sources Map have been developed and are distributed annually to interested researchers, many of whom have tested and evaluated them in a range of applications. The UMLS project and current developments in high-speed, high-capacity international networks are converging in ways that have great potential for enhancing access to biomedical information.
Article
In 1986, the National Library of Medicine (NLM) assembled a large multidisciplinary, multisite team to work on the Unified Medical Language System (UMLS), a collaborative research project aimed at reducing fundamental barriers to the application of computers to medicine. Beyond its tangible products, the UMLS Knowledge Sources, and its influence on the field of informatics, the UMLS project is an interesting case study in collaborative research and development. It illustrates the strengths and challenges of substantive collaboration among widely distributed research groups. Over the past decade, advances in computing and communications have minimized the technical difficulties associated with UMLS collaboration and also facilitated the development, dissemination, and use of the UMLS Knowledge Sources. The spread of the World Wide Web has increased the visibility of the information access problems caused by multiple vocabularies and many information sources which are the focus of UMLS work. The time is propitious for building on UMLS accomplishments and making more progress on the informatics research issues first highlighted by the UMLS project more than 10 years ago.
Article
The UMLS Metathesaurus, the largest thesaurus in the biomedical domain, provides a representation of biomedical knowledge consisting of concepts classified by semantic type and both hierarchical and non-hierarchical relationships among the concepts. This knowledge has proved useful for many applications including decision support systems, management of patient records, information retrieval (IR) and data mining. Gaining effective access to the knowledge is critical to the success of these applications. This paper describes MetaMap, a program developed at the National Library of Medicine (NLM) to map biomedical text to the Metathesaurus or, equivalently, to discover Metathesaurus concepts referred to in text. MetaMap uses a knowledge intensive approach based on symbolic, natural language processing (NLP) and computational linguistic techniques. Besides being applied for both IR and data mining applications, MetaMap is one of the foundations of NLM's Indexing Initiative System which is being applied to both semi-automatic and fully automatic indexing of the biomedical literature at the library.
Article
There is a trend towards automatic analysis of large amounts of literature in the biomedical domain. However, this can be effective only if the ambiguity in natural language is resolved. In this paper, the current state of research in word sense disambiguation (WSD) is reviewed. Several methods for WSD have already been proposed, but many systems have been tested only on evaluation sets of limited size. There are currently only very few applications of WSD in the biomedical domain. The current direction of research points towards statistically based algorithms that use existing curated data and can be applied to large sets of biomedical literature. There is a need for manually tagged evaluation sets to test WSD algorithms in the biomedical domain. WSD algorithms should preferably be able to take into account both known and unknown senses of a word. Without WSD, automatic metaanalysis of large corpora of text will be error prone.
Article
Biological literature contains many abbreviations with one particular sense in each document. However, most abbreviations do not have a unique sense across the literature. Furthermore, many documents do not contain the long forms of the abbreviations. Resolving an abbreviation in a document consists of retrieving its sense in use. Abbreviation resolution improves accuracy of document retrieval engines and of information extraction systems. We combine an automatic analysis of Medline abstracts and linguistic methods to build a dictionary of abbreviation/sense pairs. The dictionary is used for the resolution of abbreviations occurring with their long forms. Ambiguous global abbreviations are resolved using support vector machines that have been trained on the context of each instance of the abbreviation/sense pairs, previously extracted for the dictionary set-up. The system disambiguates abbreviations with a precision of 98.9% for a recall of 98.2% (98.5% accuracy). This performance is superior in comparison with previously reported research work. The abbreviation resolution module is available at http://www.ebi.ac.uk/Rebholz/software.html.