[show abstract][hide abstract] ABSTRACT: Abstraction networks are compact summarizations of terminologies used to support orientation and terminology quality assurance (TQA). Area taxonomies and partial-area taxonomies are abstraction networks that have been successfully employed in support of TQA of small SNOMED CT hierarchies. However, nearly half of SNOMED CT's concepts are in the large Procedure and Clinical Finding hierarchies. Abstraction network derivation methodologies applied to those hierarchies resulted in taxonomies that were too large to effectively support TQA. A methodology for deriving sub-taxonomies from large taxonomies is presented, and the resultant smaller abstraction networks are shown to facilitate TQA, allowing for the scaling of our taxonomy-based TQA regimen to large hierarchies. Specifically, sub-taxonomies are derived for the Procedure hierarchy and a review for errors and inconsistencies is performed. Concepts are divided into groups within the sub-taxonomy framework, and it is shown that small groups are statistically more likely to harbor erroneous and inconsistent concepts than large groups.
[show abstract][hide abstract] ABSTRACT: BioPortal contains over 300 ontologies, for which quality assurance (QA) is critical. Abstraction networks (ANs), compact summarizations of ontology structure and content, have been used in such QA efforts, typically in a "one-off" manner for a single ontology. Ontologies can be characterized-independently of knowledge-content focus-from a structural standpoint leading to the formulation of ontology families. A family is defined as a set of ontologies satisfying some overarching condition regarding their structural features. Seven such families, comprising 186 ontologies, are identified. To increase efficiency, a new family-based QA framework is introduced in which an automated, uniform AN derivation technique and accompanying semi-automated, uniform QA regimen are applicable to the ontologies of a given family. Specifically, across an entire family, the QA efforts exploit family-wide AN features in the characterization of sets of classes that are more likely to harbor errors. The approach is demonstrated on the Cancer Chemoprevention BioPortal ontology.
[show abstract][hide abstract] ABSTRACT: BACKGROUND: When new concepts are inserted into the UMLS, they are assigned one or several semantic types from the UMLS Semantic Network by the UMLS editors. However, not every combination of semantic types is permissible. It was observed that many concepts with rare combinations of semantic types have erroneous semantic type assignments or prohibited combinations of semantic types. The correction of such errors is resource-intensive. OBJECTIVE: We design a computational system to inform UMLS editors as to whether a specific combination of two, three, four, or five semantic types is permissible or prohibited or questionable. METHODS: We identify a set of inclusion and exclusion instructions in the UMLS Semantic Network documentation and derive corresponding rule-categories as well as rule-categories from the UMLS concept content. We then design an algorithm adviseEditor based on these rule-categories. The algorithm specifies rules for an editor how to proceed when considering a tuple (pair, triple, quadruple, quintuple) of semantic types to be assigned to a concept. RESULTS: Eight rule-categories were identified. A Web-based system was developed to implement the adviseEditor algorithm, which returns for an input combination of semantic types whether it is permitted, prohibited or (in a few cases) requires more research. The numbers of semantic type pairs assigned to each rule-category are reported. Interesting examples for each rule-category are illustrated. Cases of semantic type assignments that contradict rules are listed, including recently introduced ones. CONCLUSION: The adviseEditor system implements explicit and implicit knowledge available in the UMLS in a system that informs UMLS editors about the permissibility of a desired combination of semantic types. Using adviseEditor might help accelerate the work of the UMLS editors and prevent erroneous semantic type assignments.
Journal of Biomedical Informatics 10/2012; · 2.13 Impact Factor
[show abstract][hide abstract] ABSTRACT: Auditing healthcare terminologies for errors requires human experts. In this paper, we present a study of the performance of auditors looking for errors in the semantic type assignments of complex UMLS concepts. In this study, concepts are considered complex whenever they are assigned combinations of semantic types. Past research has shown that complex concepts have a higher likelihood of errors. The results of this study indicate that individual auditors are not reliable when auditing such concepts and their performance is low, according to various metrics. These results confirm the outcomes of an earlier pilot study. They imply that to achieve an acceptable level of reliability and performance, when auditing such concepts of the UMLS, several auditors need to be assigned the same task. A mechanism is then needed to combine the possibly differing opinions of the different auditors into a final determination. In the current study, in contrast to our previous work, we used a majority mechanism for this purpose. For a sample of 232 complex UMLS concepts, the majority opinion was found reliable and its performance for accuracy, recall, precision and the F-measure was found statistically significantly higher than the average performance of individual auditors.
Journal of Biomedical Informatics 06/2012; · 2.13 Impact Factor
[show abstract][hide abstract] ABSTRACT: An abstraction network is an auxiliary network of nodes and links that provides a compact, high-level view of an ontology. Such a view lends support to ontology orientation, comprehension, and quality-assurance efforts. A methodology is presented for deriving a kind of abstraction network, called a partial-area taxonomy, for the Ontology of Clinical Research (OCRe). OCRe was selected as a representative of ontologies implemented using the Web Ontology Language (OWL) based on shared domains. The derivation of the partial-area taxonomy for the Entity hierarchy of OCRe is described. Utilizing the visualization of the content and structure of the hierarchy provided by the taxonomy, the Entity hierarchy is audited, and several errors and inconsistencies in OCRe's modeling of its domain are exposed. After appropriate corrections are made to OCRe, a new partial-area taxonomy is derived. The generalizability of the paradigm of the derivation methodology to various families of biomedical ontologies is discussed.
[show abstract][hide abstract] ABSTRACT: Medical terminologies are large and complex. Frequently, errors are hidden in this complexity. Our objective is to find such errors, which can be aided by deriving abstraction networks from a large terminology. Abstraction networks preserve important features but eliminate many minor details, which are often not useful for identifying errors. Providing visualizations for such abstraction networks aids auditors by allowing them to quickly focus on elements of interest within a terminology. Previously we introduced area taxonomies and partial area taxonomies for SNOMED CT. In this paper, two advanced, novel kinds of abstraction networks, the relationship-constrained partial area subtaxonomy and the root-constrained partial area subtaxonomy are defined and their benefits are demonstrated. We also describe BLUSNO, an innovative software tool for quickly generating and visualizing these SNOMED CT abstraction networks. BLUSNO is a dynamic, interactive system that provides quick access to well organized information about SNOMED CT.
[show abstract][hide abstract] ABSTRACT: The Unified Medical Language System (UMLS) is a large and complex controlled medical terminology managed by the National Library of Medicine. The UMLS combines about 160 different terminologies into the Metathesaurus containing over 2.4 million concepts linked by about 50 million relationships. The process of merging 160 terminologies with often very different structures is a complex and error-prone task. Auditing terminologies for quality assurance is an important phase of any terminology's life cycle. Previously, we developed the Neighborhood Auditing Tool (NAT), a concept-centric browser and auditing tool for the UMLS. In this paper we discuss the "Relationship, Audit Set Builder and Concept Neighborhood Auditing Tool" (RAC-NAT), an expansion of the NAT into a multi-faceted browsing and auditing tool.
Journal of Integrated Design and Process Science 12/2011; 15(4):3-25.
[show abstract][hide abstract] ABSTRACT: This paper strives to overcome a major problem encountered by a previous expansion methodology for discovering concepts highly likely to be missing a specific semantic type assignment in the UMLS. This methodology is the basis for an algorithm that presents the discovered concepts to a human auditor for review and possible correction. We analyzed the problem of the previous expansion methodology and discovered that it was due to an obstacle constituted by one or more concepts assigned the UMLS Semantic Network semantic type Classification. A new methodology was designed that bypasses such an obstacle without a combinatorial explosion in the number of concepts presented to the human auditor for review. The new expansion methodology with obstacle avoidance was tested with the semantic type Experimental Model of Disease and found over 500 concepts missed by the previous methodology that are in need of this semantic type assignment. Furthermore, other semantic types suffering from the same major problem were discovered, indicating that the methodology is of more general applicability. The algorithmic discovery of concepts that are likely missing a semantic type assignment is possible even in the face of obstacles, without an explosion in the number of processed concepts.
Journal of Biomedical Informatics 09/2011; 45(1):61-70. · 2.13 Impact Factor
[show abstract][hide abstract] ABSTRACT: An algorithmically-derived abstraction network, called the partial-area taxonomy, for a SNOMED hierarchy has led to the identification of concepts considered complex. The designation "complex" is arrived at automatically on the basis of structural analyses of overlap among the constituent concept groups of the partial-area taxonomy. Such complex concepts, called overlapping concepts, constitute a tangled portion of a hierarchy and can be obstacles to users trying to gain an understanding of the hierarchy's content. A new methodology for partitioning the entire collection of overlapping concepts into singly-rooted groups, that are more manageable to work with and comprehend, is presented. Different kinds of overlapping concepts with varying degrees of complexity are identified. This leads to an abstract model of the overlapping concepts called the disjoint partial-area taxonomy, which serves as a vehicle for enhanced, high-level display. The methodology is demonstrated with an application to SNOMED's Specimen hierarchy. Overall, the resulting disjoint partial-area taxonomy offers a refined view of the hierarchy's structural organization and conceptual content that can aid users, such as maintenance personnel, working with SNOMED. The utility of the disjoint partial-area taxonomy as the basis for a SNOMED auditing regimen is presented in a companion paper.
Journal of Biomedical Informatics 08/2011; 45(1):15-29. · 2.13 Impact Factor
[show abstract][hide abstract] ABSTRACT: Little information exists concerning SNOMED CT (systematized nomenclature of medicine-clinical terms) users. This report describes current impressions and preferences of direct SNOMED CT users regarding coverage, quality, and concept details, and the change request mechanism.
A 43-question anonymous survey distributed electronically to relevant online communities.
Data on user demographic characteristics, modes and purposes of use, means and frequencies of access, satisfaction with SNOMED CT content coverage and quality and with the change request mechanism were recorded.
The survey was conducted in January 2010 and elicited 215 responses. Details regarding users' profiles, modes of use and access were reported elsewhere. The coverage of SNOMED CT was perceived to be at least 85% complete by 42% of responders, and 60% were at least satisfied with its quality. Various deficiencies were encountered at least 'somewhat often' by 28-61% of responders. Incorrect data were more bothersome than missing data. Users indicated that significant resources should be allocated to more consistent and complete conceptual representations and to further enhance content coverage. Enhanced synonym coverage and the introduction of textual definitions were important to users (54% and 63%, respectively).
A survey format with limited control over recruitment and selection bias. Lack of information regarding the SNOMED CT version used by responders.
Despite overall satisfaction, direct users indicated a strong desire to improve consistency, quality, and completeness of conceptual representations and concept details, as well as a continued desire to expand coverage. The survey provides much needed data for informed decisions regarding the use and development goals of SNOMED CT. Focused periodical surveys are warranted.
Journal of the American Medical Informatics Association 08/2011; 18 Suppl 1:i36-44. · 3.57 Impact Factor
[show abstract][hide abstract] ABSTRACT: A search engine user with a well-defined information need is not interested in getting thousands of hits, but a few hits that are all highly relevant to their search. Often search words need to be refined and augmented to narrow results to more relevant pages. However, an overly specific query may lead to no hits at all, while most typical queries lead to thousands or even millions of them, both undesirable outcomes. This paper suggests a query rewriting method for generating alternative query strings and proposes a hit count prediction model for predicting the number of search engine hits for each alternative query string, based on the English language frequencies of the words in the search terms. Using the hit count prediction model, different types of search strategies, such as a lowest hit count query preference, can be utilized to improve users’ search experience. We present an evaluation experiment of the hit count prediction model for three major search engines. We also discuss and quantify how far the Google, Yahoo! and Bing search engines diverge from monotonic behaviour, considering negative and positive search terms separately.
Journal of Information Science 01/2011; 37:462-475. · 1.24 Impact Factor
[show abstract][hide abstract] ABSTRACT: The terms that users pass to search engines are often ambiguous, by referring to homonyms. The search results in these cases
are a mixture of links to documents that refer to different meanings of the search terms. Current search engines provide suggested
query completion terms as a dropdown list. However, such lists are not well organized, mixing completions for different meanings.
In addition, the suggested search phrases are not discriminating enough. We propose an approach to suggesting well organized
and visually separated search completions based on ontologies. In addition, our approach supports the use of negative terms
to disambiguate the suggested completions in the list. We present an algorithm to generate the suggested search completion
terms. We have developed musician and basketball player ontologies and an Ontology-Supported Web Search (OSWS) System for
“famous people” that generates the suggested search term completions, based on these ontologies.
Current Trends in Web Engineering - 10th International Conference on Web Engineering, ICWE 2010 Workshops, Vienna, Austria, July 2010, Revised Selected Papers; 01/2010
[show abstract][hide abstract] ABSTRACT: The UMLS contains terms from many sources. Every update of a source requires reintegration. Each new term needs to be assigned to a preexisting UMLS concept, or a new concept must be created. Whenever the integration process unnecessarily creates a new concept, this is undesirable. We report on a method to detect such undesirable duplicate concepts. Terms are removed from the UMLS and reintegrated using "piecewise synonym generation." The concept of the reintegrated term is programmatically compared to the initial concept of the term (before removal). If they are different, this indicates an error, either in the integration process or in the initial concept. Thus, such a term-concept pair is deemed suspicious. A study of five hierarchies of the SNOMED found 7.7% suspicious matches. A human expert needs to evaluate the correctness of suspicious concepts. In a sample of 149 of those, 19% of concepts were found to be duplicates.
[show abstract][hide abstract] ABSTRACT: Keyword-based search engines often return an unexpected number of results. Zero hits are naturally undesirable, while too many hits are likely to be overwhelming and of low precision. We present an approach for predicting the number of hits for a given set of query terms. Using word frequencies derived from a large corpus, we construct random samples of combinations of these words as search terms. Then we derive a correlation function between the computed probabilities of search terms and the observed hit counts for them. This regression function is used to predict the hit counts for a user's new searches, with the intention of avoiding information overload. We report the results of experiments with Google, Yahoo! and Bing to validate our methodology. We further investigate the monotonicity of search results for negative search terms by those three search engines.
2010 IEEE/WIC/ACM International Conference on Web Intelligence, WI 2010, Toronto, Canada, August 31 - September 3, 2010, Main Conference Proceedings; 01/2010
[show abstract][hide abstract] ABSTRACT: SNOMED CT is gaining momentum and endorsements as an international clinical terminology. However, many vendors await a clearer business case and clients' demand. We conducted a survey of direct users of SNOMED CT to determine the current profile of users, modes of use, and attitudes towards different aspects of the terminology. A web-base survey, consisting of 43 questions was distributed in January 2010, and 215 responses were elicited. This paper summarizes findings regarding profiles of users and their SNOMED CT use. The results indicate significant use by non-researchers and by industry and government sectors. Many users are relative newcomers with less than 3 years experience with SNOMED CT, and production-related use was reported by 39% of respondents. Most users are satisfied with the level of content coverage. The results indicate that SNOMED CT has a solid footing in production systems, and that SCT is mostly used for concept searches and clinical coding.