[show abstract][hide abstract] ABSTRACT: BACKGROUND: When new concepts are inserted into the UMLS, they are assigned one or several semantic types from the UMLS Semantic Network by the UMLS editors. However, not every combination of semantic types is permissible. It was observed that many concepts with rare combinations of semantic types have erroneous semantic type assignments or prohibited combinations of semantic types. The correction of such errors is resource-intensive. OBJECTIVE: We design a computational system to inform UMLS editors as to whether a specific combination of two, three, four, or five semantic types is permissible or prohibited or questionable. METHODS: We identify a set of inclusion and exclusion instructions in the UMLS Semantic Network documentation and derive corresponding rule-categories as well as rule-categories from the UMLS concept content. We then design an algorithm adviseEditor based on these rule-categories. The algorithm specifies rules for an editor how to proceed when considering a tuple (pair, triple, quadruple, quintuple) of semantic types to be assigned to a concept. RESULTS: Eight rule-categories were identified. A Web-based system was developed to implement the adviseEditor algorithm, which returns for an input combination of semantic types whether it is permitted, prohibited or (in a few cases) requires more research. The numbers of semantic type pairs assigned to each rule-category are reported. Interesting examples for each rule-category are illustrated. Cases of semantic type assignments that contradict rules are listed, including recently introduced ones. CONCLUSION: The adviseEditor system implements explicit and implicit knowledge available in the UMLS in a system that informs UMLS editors about the permissibility of a desired combination of semantic types. Using adviseEditor might help accelerate the work of the UMLS editors and prevent erroneous semantic type assignments.
Journal of Biomedical Informatics 10/2012; · 2.13 Impact Factor
[show abstract][hide abstract] ABSTRACT: Auditing healthcare terminologies for errors requires human experts. In this paper, we present a study of the performance of auditors looking for errors in the semantic type assignments of complex UMLS concepts. In this study, concepts are considered complex whenever they are assigned combinations of semantic types. Past research has shown that complex concepts have a higher likelihood of errors. The results of this study indicate that individual auditors are not reliable when auditing such concepts and their performance is low, according to various metrics. These results confirm the outcomes of an earlier pilot study. They imply that to achieve an acceptable level of reliability and performance, when auditing such concepts of the UMLS, several auditors need to be assigned the same task. A mechanism is then needed to combine the possibly differing opinions of the different auditors into a final determination. In the current study, in contrast to our previous work, we used a majority mechanism for this purpose. For a sample of 232 complex UMLS concepts, the majority opinion was found reliable and its performance for accuracy, recall, precision and the F-measure was found statistically significantly higher than the average performance of individual auditors.
Journal of Biomedical Informatics 06/2012; · 2.13 Impact Factor
[show abstract][hide abstract] ABSTRACT: An abstraction network is an auxiliary network of nodes and links that provides a compact, high-level view of an ontology. Such a view lends support to ontology orientation, comprehension, and quality-assurance efforts. A methodology is presented for deriving a kind of abstraction network, called a partial-area taxonomy, for the Ontology of Clinical Research (OCRe). OCRe was selected as a representative of ontologies implemented using the Web Ontology Language (OWL) based on shared domains. The derivation of the partial-area taxonomy for the Entity hierarchy of OCRe is described. Utilizing the visualization of the content and structure of the hierarchy provided by the taxonomy, the Entity hierarchy is audited, and several errors and inconsistencies in OCRe's modeling of its domain are exposed. After appropriate corrections are made to OCRe, a new partial-area taxonomy is derived. The generalizability of the paradigm of the derivation methodology to various families of biomedical ontologies is discussed.
[show abstract][hide abstract] ABSTRACT: Medical terminologies are large and complex. Frequently, errors are hidden in this complexity. Our objective is to find such errors, which can be aided by deriving abstraction networks from a large terminology. Abstraction networks preserve important features but eliminate many minor details, which are often not useful for identifying errors. Providing visualizations for such abstraction networks aids auditors by allowing them to quickly focus on elements of interest within a terminology. Previously we introduced area taxonomies and partial area taxonomies for SNOMED CT. In this paper, two advanced, novel kinds of abstraction networks, the relationship-constrained partial area subtaxonomy and the root-constrained partial area subtaxonomy are defined and their benefits are demonstrated. We also describe BLUSNO, an innovative software tool for quickly generating and visualizing these SNOMED CT abstraction networks. BLUSNO is a dynamic, interactive system that provides quick access to well organized information about SNOMED CT.
[show abstract][hide abstract] ABSTRACT: This paper strives to overcome a major problem encountered by a previous expansion methodology for discovering concepts highly likely to be missing a specific semantic type assignment in the UMLS. This methodology is the basis for an algorithm that presents the discovered concepts to a human auditor for review and possible correction. We analyzed the problem of the previous expansion methodology and discovered that it was due to an obstacle constituted by one or more concepts assigned the UMLS Semantic Network semantic type Classification. A new methodology was designed that bypasses such an obstacle without a combinatorial explosion in the number of concepts presented to the human auditor for review. The new expansion methodology with obstacle avoidance was tested with the semantic type Experimental Model of Disease and found over 500 concepts missed by the previous methodology that are in need of this semantic type assignment. Furthermore, other semantic types suffering from the same major problem were discovered, indicating that the methodology is of more general applicability. The algorithmic discovery of concepts that are likely missing a semantic type assignment is possible even in the face of obstacles, without an explosion in the number of processed concepts.
Journal of Biomedical Informatics 09/2011; 45(1):61-70. · 2.13 Impact Factor
[show abstract][hide abstract] ABSTRACT: An algorithmically-derived abstraction network, called the partial-area taxonomy, for a SNOMED hierarchy has led to the identification of concepts considered complex. The designation "complex" is arrived at automatically on the basis of structural analyses of overlap among the constituent concept groups of the partial-area taxonomy. Such complex concepts, called overlapping concepts, constitute a tangled portion of a hierarchy and can be obstacles to users trying to gain an understanding of the hierarchy's content. A new methodology for partitioning the entire collection of overlapping concepts into singly-rooted groups, that are more manageable to work with and comprehend, is presented. Different kinds of overlapping concepts with varying degrees of complexity are identified. This leads to an abstract model of the overlapping concepts called the disjoint partial-area taxonomy, which serves as a vehicle for enhanced, high-level display. The methodology is demonstrated with an application to SNOMED's Specimen hierarchy. Overall, the resulting disjoint partial-area taxonomy offers a refined view of the hierarchy's structural organization and conceptual content that can aid users, such as maintenance personnel, working with SNOMED. The utility of the disjoint partial-area taxonomy as the basis for a SNOMED auditing regimen is presented in a companion paper.
Journal of Biomedical Informatics 08/2011; 45(1):15-29. · 2.13 Impact Factor
[show abstract][hide abstract] ABSTRACT: Little information exists concerning SNOMED CT (systematized nomenclature of medicine-clinical terms) users. This report describes current impressions and preferences of direct SNOMED CT users regarding coverage, quality, and concept details, and the change request mechanism.
A 43-question anonymous survey distributed electronically to relevant online communities.
Data on user demographic characteristics, modes and purposes of use, means and frequencies of access, satisfaction with SNOMED CT content coverage and quality and with the change request mechanism were recorded.
The survey was conducted in January 2010 and elicited 215 responses. Details regarding users' profiles, modes of use and access were reported elsewhere. The coverage of SNOMED CT was perceived to be at least 85% complete by 42% of responders, and 60% were at least satisfied with its quality. Various deficiencies were encountered at least 'somewhat often' by 28-61% of responders. Incorrect data were more bothersome than missing data. Users indicated that significant resources should be allocated to more consistent and complete conceptual representations and to further enhance content coverage. Enhanced synonym coverage and the introduction of textual definitions were important to users (54% and 63%, respectively).
A survey format with limited control over recruitment and selection bias. Lack of information regarding the SNOMED CT version used by responders.
Despite overall satisfaction, direct users indicated a strong desire to improve consistency, quality, and completeness of conceptual representations and concept details, as well as a continued desire to expand coverage. The survey provides much needed data for informed decisions regarding the use and development goals of SNOMED CT. Focused periodical surveys are warranted.
Journal of the American Medical Informatics Association 08/2011; 18 Suppl 1:i36-44. · 3.57 Impact Factor
[show abstract][hide abstract] ABSTRACT: A search engine user with a well-defined information need is not interested in getting thousands of hits, but a few hits that are all highly relevant to their search. Often search words need to be refined and augmented to narrow results to more relevant pages. However, an overly specific query may lead to no hits at all, while most typical queries lead to thousands or even millions of them, both undesirable outcomes. This paper suggests a query rewriting method for generating alternative query strings and proposes a hit count prediction model for predicting the number of search engine hits for each alternative query string, based on the English language frequencies of the words in the search terms. Using the hit count prediction model, different types of search strategies, such as a lowest hit count query preference, can be utilized to improve users’ search experience. We present an evaluation experiment of the hit count prediction model for three major search engines. We also discuss and quantify how far the Google, Yahoo! and Bing search engines diverge from monotonic behaviour, considering negative and positive search terms separately.
Journal of Information Science 01/2011; 37:462-475. · 1.24 Impact Factor
[show abstract][hide abstract] ABSTRACT: The terms that users pass to search engines are often ambiguous, by referring to homonyms. The search results in these cases
are a mixture of links to documents that refer to different meanings of the search terms. Current search engines provide suggested
query completion terms as a dropdown list. However, such lists are not well organized, mixing completions for different meanings.
In addition, the suggested search phrases are not discriminating enough. We propose an approach to suggesting well organized
and visually separated search completions based on ontologies. In addition, our approach supports the use of negative terms
to disambiguate the suggested completions in the list. We present an algorithm to generate the suggested search completion
terms. We have developed musician and basketball player ontologies and an Ontology-Supported Web Search (OSWS) System for
“famous people” that generates the suggested search term completions, based on these ontologies.
Current Trends in Web Engineering - 10th International Conference on Web Engineering, ICWE 2010 Workshops, Vienna, Austria, July 2010, Revised Selected Papers; 01/2010
[show abstract][hide abstract] ABSTRACT: The UMLS contains terms from many sources. Every update of a source requires reintegration. Each new term needs to be assigned to a preexisting UMLS concept, or a new concept must be created. Whenever the integration process unnecessarily creates a new concept, this is undesirable. We report on a method to detect such undesirable duplicate concepts. Terms are removed from the UMLS and reintegrated using "piecewise synonym generation." The concept of the reintegrated term is programmatically compared to the initial concept of the term (before removal). If they are different, this indicates an error, either in the integration process or in the initial concept. Thus, such a term-concept pair is deemed suspicious. A study of five hierarchies of the SNOMED found 7.7% suspicious matches. A human expert needs to evaluate the correctness of suspicious concepts. In a sample of 149 of those, 19% of concepts were found to be duplicates.
[show abstract][hide abstract] ABSTRACT: Keyword-based search engines often return an unexpected number of results. Zero hits are naturally undesirable, while too many hits are likely to be overwhelming and of low precision. We present an approach for predicting the number of hits for a given set of query terms. Using word frequencies derived from a large corpus, we construct random samples of combinations of these words as search terms. Then we derive a correlation function between the computed probabilities of search terms and the observed hit counts for them. This regression function is used to predict the hit counts for a user's new searches, with the intention of avoiding information overload. We report the results of experiments with Google, Yahoo! and Bing to validate our methodology. We further investigate the monotonicity of search results for negative search terms by those three search engines.
2010 IEEE/WIC/ACM International Conference on Web Intelligence, WI 2010, Toronto, Canada, August 31 - September 3, 2010, Main Conference Proceedings; 01/2010
[show abstract][hide abstract] ABSTRACT: SNOMED CT is gaining momentum and endorsements as an international clinical terminology. However, many vendors await a clearer business case and clients' demand. We conducted a survey of direct users of SNOMED CT to determine the current profile of users, modes of use, and attitudes towards different aspects of the terminology. A web-base survey, consisting of 43 questions was distributed in January 2010, and 215 responses were elicited. This paper summarizes findings regarding profiles of users and their SNOMED CT use. The results indicate significant use by non-researchers and by industry and government sectors. Many users are relative newcomers with less than 3 years experience with SNOMED CT, and production-related use was reported by 39% of respondents. Most users are satisfied with the level of content coverage. The results indicate that SNOMED CT has a solid footing in production systems, and that SCT is mostly used for concept searches and clinical coding.
[show abstract][hide abstract] ABSTRACT: The UMLS's integration of more than 100 source vocabularies, not necessarily consistent with one another, causes some inconsistencies. The purpose of auditing the UMLS is to detect such inconsistencies and to suggest how to resolve them while observing the requirement of fully representing the content of each source in the UMLS. A software tool, called the Neighborhood Auditing Tool (NAT), that facilitates UMLS auditing is presented. The NAT supports "neighborhood-based" auditing, where, at any given time, an auditor concentrates on a single focus concept and one of a variety of neighborhoods of its closely related concepts.Typical diagrammatic displays of concept networks have a number of shortcomings, so the NAT utilizes a hybrid diagram/text interface that features stylized neighborhood views which retain some of the best features of both the diagrammatic layouts and text windows while avoiding the shortcomings.The NAT allows an auditor to display knowledge from both the Metathesaurus (concept) level and the Semantic Network (semantic type) level. Various additional features of the NAT that support the auditing process are described. The usefulness of the NAT is demonstrated through a groupof case studies. Its impact is tested with a study involving a select group of auditors.
Journal of Biomedical Informatics 03/2009; · 2.13 Impact Factor
[show abstract][hide abstract] ABSTRACT: Synonym-substitution algorithms have been developed for the purpose of matching source vocabulary terms with existing Unified Medical Language System (UMLS) terms during the integration process. A drawback is the possible explosion in the number of newly generated (potential) synonyms, which can tax computational and expert review resources. Experiments are run using a synonym-substitution approach based on WordNet to see how constraining two methodological parameters, namely, "maximum number of substitutions per term" and "maximum term length," affects performance. Our hypothesis is that these values can be constrained rather tightly--thus greatly speeding up the methodology--without a marked decline in the additional matches produced. Furthermore, we investigate whether a limitation on only the first of the two parameters is sufficient to achieve the same results.
A four-stage synonym-substitution methodology using WordNet is presented. A group of experiments is carried out in which the two methodological parameters "maximum number of substitutions per term" and "maximum term length" are varied. The purpose is to examine their effect on the growth in the number of potential synonyms generated and the associated loss of results. The experiments are based on the re-integration of the "Minimal Standard Terminology" (MST) into the UMLS. Synonym-substitution matches found to be inconsistent with the current content of the UMLS and thus deemed to be incorrect are further manually scrutinized as an audit of the original integration of the MST.
An increase of 11% in the number of "MST term/UMLS term" matches was achieved using the synonym-substitution methodology. Importantly, this result prevailed when tight threshold values (such as a maximum of two synonym substitutions per term) were imposed on the parameters. Furthermore, it was found that limiting only the "maximum number of substitutions per term" parameter was sufficient to obtain the performance enhancement. During the additional audit phase, a number of the reported mismatches were actually seen to be correct, representing an additional 10% increase in the number of matches obtained.
A synonym-substitution methodology that utilizes WordNet is a useful automated aide in UMLS source integration. Experiments showed that there was a significant speed-up but no degradation in match results when the methodology's "maximum number of substitutions per term" parameter was relatively tightly constrained. The methodology also helped to discover errors in the MST's original integration, and improve the quality of the UMLS's conceptual content.
Artificial intelligence in medicine 01/2009; 46(2):97-109. · 1.65 Impact Factor
[show abstract][hide abstract] ABSTRACT: The goal of this paper is to audit null-annotated parent-child pairs in the UMLS Metathesaurus. We have developed techniques for identifying suspicious pairs with high likelihood of errors by using inconsistencies between the hierarchical relationships of the Metathesaurus and the Semantic Network. Two formal conditions, called semantic inversion and lack of ancestry are investigated. Analyzing two corresponding samples shows that semantic inversion is significantly more likely to indicate an error than lack of ancestry, which in turn is more likely to indicate errors than a consistent configuration. We also discuss cases of parent-child pairs with semantic inversion that may be corrected by disambiguating the child.