All content in this area was uploaded by Robert F Murphy on Sep 18, 2014
Content may be subject to copyright.
A preview of the PDF is not available
... In the most basic mode of exploration, users can retrieve figures using keyword queries. To this end, several systems for figure retrieval were developed [4,6,7]. Using retrieval only, however, may not be sufficient for the effective exploration of research figure collections. ...
... Several figure search engines were previously developed for the bio-medical domain [4,6,7,10]. The BioText engine [4] enables the search of figures using keyword queries. ...
... In another system [10], figure search was also implemented with some basic exploration capabilities. The SLIF system [7] also performed figure retrieval but proposed a topic model approach to browse the result list. Finally, the FigSearch retrieval system [6] was tailored for the use-case of gene-related figures. ...
... The biologist can then click the "view associated topics" link below the displayed panel. The system will display only the topics addressed in this panel and if one of these focused topics interest the biologist, they can then browse for more panels that show the pattern(s) captured by this topic by clicking on the browse button (See [11,12] for more details). ...
... We conducted a user study to validate the usability and usefulness of our technology. A detailed description of the study is given in [12]. Here, we only highlight the main aspects of the study. ...
The SLIF project combines text-mining and image processing to extract structured information from biomedical literature.SLIF extracts images and their captions from published papers. The captions are automatically parsed for relevant biological entities (protein and cell type names), while the images are classified according to their type (e.g., micrograph or gel). Fluorescence microscopy images are further processed and classified according to the depicted subcellular localization.The results of this process can be queried online using either a user-friendly web-interface or an XML-based web-service. As an alternative to the targeted query paradigm, SLIF also supports browsing the collection based on latent topic models which are derived from both the annotated text and the image data.The SLIF web application, as well as labeled datasets used for training system components, is publicly available at http://slif.cbi.cmu.edu.
... As an effective way to communicate research results, figures are especially useful in domains such as the biomedical domain. As a result, how to support biologists to search for figures has attracted a significant amount of attention, and multiple systems were developed [10,13,24]. These previous works have focused on the development of a figure search engine system from the application perspective, but none of those systems or algorithms used in those systems has been evaluated in terms of retrieval accuracy. ...
In this paper, we introduce and study a new task of figure retrieval in which the retrieval units are figures of research articles and the task is to rank figures with response to a query. As a first step toward addressing this task, we focus on textual queries and represent a figure using text extracted from its article. We suggest and study the effectiveness of several retrieval methods for the task. We build a test collection by using research articles from the ACL Anthology corpus and treating figure captions as queries. While having some limitations, using this data set we were able to obtain some interesting preliminary results on the relative effectiveness of different representations of a figure and different retrieval methods, which also shed some light regarding possible types of information need, and potential challenges in figure retrieval.
Previous applications of information extraction methods to articles in biomedical journals have predominantly been based on interpreting article text. This often leads to uncertainty about whether statements that are found are attempts at reviews or summaries of data in other papers, conjectures or opinions, or conclusions from evidence presented in the paper at hand. The ability to extract information from the primary data presented in an article, which is often in the form of images, would allow more accurate information to be extracted. Towards this end, we have built a system that extracts information on one particular aspect of biology from a combination of text and image in journal articles. The design and performance of this system are described here, along with conclusions about possible improvements in the scientific publishing process that we have drawn from our implementation process.
Fluorescence microscope images capture information from an entire field of view, which often comprises several cells scattered on the slide. We have previously trained classifiers to accurately predict subcellular location patterns by using numerical features calculated from manually cropped 2D single-cell images. We describe here results on directly classifying fields of fluorescence microscope images using a subset of our previous features that do not require segmentation into single cells. Feature selection was conducted by stepwise discriminant analysis (SDA) to select the most discriminative features from the feature set. Better classification performance was achieved on multicell images than single-cell images, suggesting a promising future for classifying subcellular patterns in tissue images.
A major source of information (often the most crucial and informa- tive part) in scholarly articles from scientific journals, proceedings and books are the figures that directly provide images and other graphical illustrations of key experimental results and other scien- tific contents. In biological articles, a typical figure often comprises multiple panels, accompanied by either scoped or global captioned text. Moreover, the text in the caption contains important semantic entities such as protein names, gene ontology, tissues labels, etc., relevant to the images in the figure. Due to the avalanche of biolog- ical literature in recent years, and increasing popularity of various bio-imaging techniques, automatic retrieval and summarization of biological information from literature figures has emerged as a ma- jor unsolved challenge in computational knowledge extraction and management in the life science. We present a new structured prob- abilistic topic model built on a realistic figure generation scheme to model the structurally annotated biological figures, and we derive an efficient inference algorithm based on collapsed Gibbs sampling for information retrieval and visualization. The resulting program constitutes one of the key IR engines in our SLIF system that has recently entered the final round (4 out 70 competing systems) of the Elsevier Grand Challenge on Knowledge Enhancement in the Life Science. Here we present various evaluations on a number of data mining tasks to illustrate our method.
There is extensive interest in automating the collection, organization and summarization of biological data. Data in the form of figures and accompanying captions in literature present special challenges for such efforts. Based on our previously developed search engines to find fluorescence microscope images depicting protein subcellular patterns, we introduced text mining and Optical Character Recognition (OCR) techniques for caption understanding and figure-text matching, so as to build a robust, comprehensive toolset for extracting information about protein subcellular localization from the text and images found in online journals. Our current system can generate assertions such as "Figure N depicts a localization of type L for protein P in cell type C".
This paper presents the results of a pilot us- ability study of a novel approach to search user interfaces for bioscience journal arti- cles. The main idea is to support search over figure captions explicitly, and show the cor- responding figures directly within the search results. Participants in a pilot study ex- pressed surprise at the idea, noting that they had never thought of search in this way. They also reported strong positive reactions to the idea: 7 out of 8 said they would use a search system with this kind of feature, sug- gesting that this is a promising idea for jour- nal article search.
A review is presented of the image processing literature on the various approaches and models investigators have used for textures. These include statistical approaches of autocorrelation function, optical transforms, digital transforms, textural edgeness, structural element, gray tone co-occurrence, run lengths, and auto-regressive models. A discussion and generalization is presented of some structural approaches to texture based on more complex primitives than gray tone. Some structural-statistical generalizations which apply the statistical techniques to the structural primitives are given.
An object may be extracted from its background in a picture by theshold selection. Ideally, if the object has a different average gray level from that of its surrounding, the effect of thresholding will produce a white object with a black background or vice versa. In practice, it is often difficult, however, to select an appropriate threshold, and a technique is described whereby an optimum threshold may be chosen automatically as a result of an iterative process, successive iterations providing increasingly cleaner extractions of the object region. An application to low contrast images of handwritten text is discussed.
A picture is worth a thousand words. Biomedical researchers tend to incorporate a significant number of images (i.e., figures or tables) in their publications to report experimental results, to present research models, and to display examples of biomedical objects. Unfortunately, this wealth of information remains virtually inaccessible without automatic systems to organize these images. We explored supervised machine-learning systems using Support Vector Machines to automatically classify images into six representative categories based on text, image, and the fusion of both. Our experiments show a significant improvement in the average F-score of the fusion classifier (73.66%) as compared with classifiers just based on image (50.74%) or text features (68.54%).