Essie: A Concept Based Search Engine for Structured Biomedical Text

Lister Hill National Center for Biomedical Communications, National Library of Medicine, Bethesda, MD 20894, USA.
Journal of the American Medical Informatics Association (Impact Factor: 3.5). 02/2007; 14(3):253-63. DOI: 10.1197/jamia.M2233
Source: PubMed


This article describes the algorithms implemented in the Essie search engine that is currently serving several Web sites at the National Library of Medicine. Essie is a phrase-based search engine with term and concept query expansion and probabilistic relevancy ranking. Essie's design is motivated by an observation that query terms are often conceptually related to terms in a document, without actually occurring in the document text. Essie's performance was evaluated using data and standard evaluation methods from the 2003 and 2006 Text REtrieval Conference (TREC) Genomics track. Essie was the best-performing search engine in the 2003 TREC Genomics track and achieved results comparable to those of the highest-ranking systems on the 2006 TREC Genomics track task. Essie shows that a judicious combination of exploiting document structure, phrase searching, and concept based query expansion is a useful approach for information retrieval in the biomedical domain.

Full-text preview

Available from:
  • Source
    • "UMLSonMedline UMLSonMedline, created by NLM, consists of concepts from the 2009AB UMLS and the number of times they occurred in a snapshot of MEDLINE taken on 12/01/2009. The frequency counts were obtained by using the Essie Search Engine [16] "
    [Show abstract] [Hide abstract]
    ABSTRACT: Automatic processing of biomedical documents is made difficult by the fact that many of the terms they contain are ambiguous. Word Sense Disambiguation (WSD) systems attempt to resolve these ambiguities and identify the correct meaning. However, the published literature on WSD systems for biomedical documents report considerable differences in performance for different terms. The development of WSD systems is often expensive with respect to acquiring the necessary training data. It would therefore be useful to be able to predict in advance which terms WSD systems are likely to perform well or badly on. This paper explores various methods for estimating the performance of WSD systems on a wide range of ambiguous biomedical terms (including ambiguous words/phrases and abbreviations). The methods include both supervised and unsupervised approaches. The supervised approaches make use of information from labeled training data while the unsupervised ones rely on the UMLS Metathesaurus. The approaches are evaluated by comparing their predictions about how difficult disambiguation will be for ambiguous terms against the output of two WSD systems. We find the supervised methods are the best predictors of WSD difficulty, but are limited by their dependence on labeled training data. The unsupervised methods all perform well in some situations and can be applied more widely.
    Full-text · Article · Sep 2013 · Journal of Biomedical Informatics
  • Source
    • "In the context of ImageCLEF evaluation, each ad hoc topic contained a short sentence or phrase describing the search request in a few words with one to several relevant sample images. For our multi-modal search approach, the description of the topics were used as the search terms to search NLM's Essie search engine[20]and sample images were utilized as " Query By Example (QBE) " for the CBIR search. For the CBIR search of several sample query images of a topic, we obtained separate ranked result lists. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Images are frequently used in articles to convey essential information in context with correlated text. However, searching images in a task-specific way poses significant challenges. To minimize limitations of low-level feature representations in content-based image retrieval (CBIR), and to complement text-based search, we propose a multi-modal image search approach that exploits hierarchical organization of modalities and employs both intra and inter-modality fusion techniques. For the CBIR search, several visual features were extracted to represent the images. Modality-specific information was used for similarity fusion and selection of a relevant image subset. Intra-modality fusion of retrieval results was performed by searching images for specific informational elements. Our methods use text extracted from relevant components in a document to create structured representations as “enriched citations” for the text-based search approach. Finally, the multi-modal search consists of a weighted linear combination of similarity scores of independent output results from textual and visual search approaches (inter modality). Search results were evaluated using a standard ImageCLEFmed 2012 evaluation dataset of 300,000 images with associated annotations. We achieved a mean average precision (MAP) score of 0.2533, which is statistically significant, and better in performance (7 % improvement) over comparable results in ImageCLEFmed 2012.
    Full-text · Article · Sep 2013
  • Source
    • "For example, searching ''diabetes mellitus, type II'' on returns a list of more than 5,000 trials (as of April 2013), which are sorted just by their probabilistic relevance to the search terms, with those containing the query in the title ranked highest [7]. Supplying additional parameters, such as location or study type, can only modestly improve search specificity, especially for searches of eligibility criteria. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Objective Information overload is a significant problem facing online clinical trial searchers. We present eTACTS, a novel interactive retrieval framework using common eligibility tags to dynamically filter clinical trial search results. Materials and Methods eTACTS mines frequent eligibility tags from free-text clinical trial eligibility criteria and uses these tags for trial indexing. After an initial search, eTACTS presents to the user a tag cloud representing the current results. When the user selects a tag, eTACTS retains only those trials containing that tag in their eligibility criteria and generates a new cloud based on tag frequency and co-occurrences in the remaining trials. The user can then select a new tag or unselect a previous tag. The process iterates until a manageable number of trials is returned. We evaluated eTACTS in terms of filtering efficiency, diversity of the search results, and user eligibility to the filtered trials using both qualitative and quantitative methods. Results eTACTS (1) rapidly reduced search results from over a thousand trials to ten; (2) highlighted trials that are generally not top-ranked by conventional search engines; and (3) retrieved a greater number of suitable trials than existing search engines. Discussion eTACTS enables intuitive clinical trial searches by indexing eligibility criteria with effective tags. User evaluation was limited to one case study and a small group of evaluators due to the long duration of the experiment. Although a larger-scale evaluation could be conducted, this feasibility study demonstrated significant advantages of eTACTS over existing clinical trial search engines. Conclusion A dynamic eligibility tag cloud can potentially enhance state-of-the-art clinical trial search engines by allowing intuitive and efficient filtering of the search result space.
    Full-text · Article · Aug 2013 · Journal of Biomedical Informatics
Show more