The goal of the TREC Genomics Track is to improve information retrieval in the area of genomics by creating test collections that will allow researchers to improve and better understand failures of their systems. The 2004 track included an ad hoc retrieval task, simulating use of a search engine to obtain documents about biomedical topics. This paper describes the Genomics Track of the Text Retrieval Conference (TREC) 2004, a forum for evaluation of IR research systems, where retrieval in the genomics domain has recently begun to be assessed.
A total of 27 research groups submitted 47 different runs. The most effective runs, as measured by the primary evaluation measure of mean average precision (MAP), used a combination of domain-specific and general techniques. The best MAP obtained by any run was 0.4075. Techniques that expanded queries with gene name lists as well as words from related articles had the best efficacy. However, many runs performed more poorly than a simple baseline run, indicating that careful selection of system features is essential.
Various approaches to ad hoc retrieval provide a diversity of efficacy. The TREC Genomics Track and its test collection resources provide tools that allow improvement in information retrieval systems.
The TREC 2004 Genomics Track focused on applying information retrieval and text mining techniques to improve the use of genomic information in biomedicine. The Genomics Track consisted of two main tasks, ad hoc retrieval and document categorization. In this paper, we describe the categorization task, which focused on the classification of full-text documents, simulating the task of curators of the Mouse Genome Informatics (MGI) system and consisting of three subtasks. One subtask of the categorization task required the triage of articles likely to have experimental evidence warranting the assignment of GO terms, while the other two subtasks were concerned with the assignment of the three top-level GO categories to each paper containing evidence for these categories.
The track had 33 participating groups. The mean and maximum utility measure for the triage subtask was 0.3303, with a top score of 0.6512. No system was able to substantially improve results over simply using the MeSH term Mice. Analysis of significant feature overlap between the training and test sets was found to be less than expected. Sample coverage of GO terms assigned to papers in the collection was very sparse. Determining papers containing GO term evidence will likely need to be treated as separate tasks for each concept represented in GO, and therefore require much denser sampling than was available in the data sets. The annotation subtask had a mean F-measure of 0.3824, with a top score of 0.5611. The mean F-measure for the annotation plus evidence codes subtask was 0.3676, with a top score of 0.4224. Gene name recognition was found to be of benefit for this task.
Automated classification of documents for GO annotation is a challenging task, as was the automated extraction of GO code hierarchies and evidence codes. However, automating these tasks would provide substantial benefit to biomedical curation, and therefore work in this area must continue. Additional experience will allow comparison and further analysis about which algorithmic features are most useful in biomedical document classification, and better understanding of the task characteristics that make automated classification feasible and useful for biomedical document curation. The TREC Genomics Track will be continuing in 2005 focusing on a wider range of triage tasks and improving results from 2004.
It is often said that the life sciences are transforming into an information science. As laboratory experiments are starting to yield ever increasing amounts of data and the capacity to deal with those data is catching up, an increasing share of scientific activity is seen to be taking place outside the laboratories, sifting through the data and modelling "in silico" the processes observed "in vitro." The transformation of the life sciences and similar developments in other disciplines have inspired a variety of initiatives around the world to create technical infrastructure to support the new scientific practices that are emerging. The e-Science programme in the United Kingdom and the NSF Office for Cyberinfrastructure are examples of these. In Switzerland there have been no such national initiatives. Yet, this has not prevented scientists from exploring the development of similar types of computing infrastructures. In 2004, a group of researchers in Switzerland established a project, SwissBioGrid, to explore whether Grid computing technologies could be successfully deployed within the life sciences. This paper presents their experiences as a case study of how the life sciences are currently operating as an information science and presents the lessons learned about how existing institutional and technical arrangements facilitate or impede this operation.
SwissBioGrid gave rise to two pilot projects: one for proteomics data analysis and the other for high-throughput molecular docking ("virtual screening") to find new drugs for neglected diseases (specifically, for dengue fever). The proteomics project was an example of a data management problem, applying many different analysis algorithms to Terabyte-sized datasets from mass spectrometry, involving comparisons with many different reference databases; the virtual screening project was more a purely computational problem, modelling the interactions of millions of small molecules with a limited number of protein targets on the coat of the dengue virus. Both present interesting lessons about how scientific practices are changing when they tackle the problems of large-scale data analysis and data management by means of creating a novel technical infrastructure.
In the experience of SwissBioGrid, data intensive discovery has a lot to gain from close collaboration with industry and harnessing distributed computing power. Yet the diversity in life science research implies only a limited role for generic infrastructure; and the transience of support means that researchers need to integrate their efforts with others if they want to sustain the benefits of their success, which are otherwise lost.
Two-dimensional (2D) videoconferencing has been explored widely in the past 15-20 years to support collaboration in healthcare. Two issues that arise in most evaluations of 2D videoconferencing in telemedicine are the difficulty obtaining optimal camera views and poor depth perception. To address these problems, we are exploring the use of a small array of cameras to reconstruct dynamic three-dimensional (3D) views of a remote environment and of events taking place within. The 3D views could be sent across wired or wireless networks to remote healthcare professionals equipped with fixed displays or with mobile devices such as personal digital assistants (PDAs). The remote professionals' viewpoints could be specified manually or automatically (continuously) via user head or PDA tracking, giving the remote viewers head-slaved or hand-slaved virtual cameras for monoscopic or stereoscopic viewing of the dynamic reconstructions. We call this idea remote 3D medical collaboration. In this article we motivate and explain the vision for 3D medical collaboration technology; we describe the relevant computer vision, computer graphics, display, and networking research; we present a proof-of-concept prototype system; and we present evaluation results supporting the general hypothesis that 3D remote medical collaboration technology could offer benefits over conventional 2D videoconferencing in emergency healthcare.
Knowledge bases that summarize the published literature provide useful online references for specific areas of systems-level biology that are not otherwise supported by large-scale databases. In the field of neuroanatomy, groups of small focused teams have constructed medium size knowledge bases to summarize the literature describing tract-tracing experiments in several species. Despite years of collation and curation, these databases only provide partial coverage of the available published literature. Given that the scientists reading these papers must all generate the interpretations that would normally be entered into such a system, we attempt here to provide general-purpose annotation tools to make it easy for members of the community to contribute to the task of data collation.
In this paper, we describe an open-source, freely available knowledge management system called 'NeuroScholar' that allows straightforward structured markup of the PDF files according to a well-designed schema to capture the essential details of this class of experiment. Although, the example worked through in this paper is quite specific to neuroanatomical connectivity, the design is freely extensible and could conceivably be used to construct local knowledge bases for other experiment types. Knowledge representations of the experiment are also directly linked to the contributing textual fragments from the original research article. Through the use of this system, not only could members of the community contribute to the collation task, but input data can be gathered for automated approaches to permit knowledge acquisition through the use of Natural Language Processing (NLP).
We present a functional, working tool to permit users to populate knowledge bases for neuroanatomical connectivity data from the literature through the use of structured questionnaires. This system is open-source, fully functional and available for download from .
Background. We are witnessing an exponential increase in biomedical research citations in PubMed. However, translating biomedical discoveries into practical treatments is estimated to take around 17 years, according to the 2000 Yearbook of Medical Informatics, and much information is lost during this transition. Pharmaceutical companies spend huge sums to identify opinion leaders and centers of excellence. Conventional methods such as literature search, survey, observation, self-identification, expert opinion, and sociometry not only need much human effort, but are also noncomprehensive. Such huge delays and costs can be reduced by “connecting those who produce the knowledge with those who apply it”. A humble step in this direction is large scale discovery of persons and organizations involved in specific areas of research. This can be achieved by automatically extracting and disambiguating author names and affiliation strings retrieved through Medical Subject Heading (MeSH) terms and other keywords associated with articles in PubMed. In this study, we propose NEMO (Normalization Engine for Matching Organizations), a system for extracting organization names from the affiliation strings provided in PubMed abstracts, building a thesaurus (list of synonyms) of organization names, and subsequently normalizing them to a canonical organization name using the thesaurus.
Results: We used a parsing process that involves multi-layered rule matching with multiple dictionaries. The normalization process involves clustering based on weighted local sequence alignment metrics to address synonymy at word level, and local learning based on finding connected components to address synonymy. The graphical user interface and java client library of NEMO are available at http://lnxnemo.sourceforge.net
Conclusion: NEMO is developed to associate each biomedical paper and its authors with a unique organization name and the geopolitical location of that organization. This system provides more accurate information about organizations than the raw affiliation strings provided in PubMed abstracts. It can be used for : a) bimodal social network analysis that evaluates the research relationships between individual researchers and their institutions; b) improving author name disambiguation; c) augmenting National Library of Medicine (NLM)’s Medical Articles Record System (MARS) system for correcting errors due to OCR on affiliation strings that are in small fonts; and d) improving PubMed citation indexing strategies (authority control) based on normalized organization name and country.
Understanding how humans differ from other animals, as well as how we are like them, requires comparative investigations. For the purpose of documenting the distinctive features of humans, the most informative research involves comparing humans to our closest relatives-the chimpanzees and other great apes. Psychology and anthropology have maintained a tradition of empirical comparative research on human specializations of cognition. The neurosciences, by contrast, have been dominated by the model-animal research paradigm, which presupposes the commonality of "basic" features of brain organization across species and discourages serious treatment of species differences. As a result, the neurosciences have made little progress in understanding human brain specializations. Recent developments in neuroimaging, genomics, and other non-invasive techniques make it possible to directly compare humans and nonhuman species at levels of organization that were previously inaccessible, offering the hope of gaining a better understanding of the species-specific features of the human brain. This hope will be dashed, however, if chimpanzees and other great ape species become unavailable for even non-invasive research.
Bioinformatics visualization tools are often not robust enough to support biomedical specialistsâ€™ complex exploratory analyses. Tools need to accommodate the workflows that scientists actually perform for specific translational research questions. To understand and model one of these workflows, we conducted a case-based, cognitive task analysis of a biomedical specialistâ€™s exploratory workflow for the question: What functional interactions among gene products of high throughput expression data suggest previously unknown mechanisms of a disease?
From our cognitive task analysis four complementary representations of the targeted workflow were developed. They include: usage scenarios, flow diagrams, a cognitive task taxonomy, and a mapping between cognitive tasks and user-centered visualization requirements. The representations capture the flows of cognitive tasks that led a biomedical specialist to inferences critical to hypothesizing. We created representations at levels of detail that could strategically guide visualization development, and we confirmed this by making a trial prototype based on user requirements for a small portion of the workflow.
Our results imply that visualizations should make available to scientific users â€œbundles of featuresâ€ consonant with the compositional cognitive tasks purposefully enacted at specific points in the workflow. We also highlight certain aspects of visualizations that: (a) need more built-in flexibility; (b) are critical for negotiating meaning; and (c) are necessary for essential metacognitive support.
PubMed is designed to provide rapid, comprehensive retrieval of papers that discuss a given topic. However, because PubMed does not organize the search output further, it is difficult for users to grasp an overview of the retrieved literature according to non-topical dimensions, to drill-down to find individual articles relevant to a particular individual's need, or to browse the collection.
In this paper, we present Anne O'Tate, a web-based tool that processes articles retrieved from PubMed and displays multiple aspects of the articles to the user, according to pre-defined categories such as the "most important" words found in titles or abstracts; topics; journals; authors; publication years; and affiliations. Clicking on a given item opens a new window that displays all papers that contain that item. One can navigate by drilling down through the categories progressively, e.g., one can first restrict the articles according to author name and then restrict that subset by affiliation. Alternatively, one can expand small sets of articles to display the most closely related articles. We also implemented a novel cluster-by-topic method that generates a concise set of topics covering most of the retrieved articles.
Anne O'Tate is an integrated, generic tool for summarization, drill-down and browsing of PubMed search results that accommodates a wide range of biomedical users and needs. It can be accessed at 4. Peer review and editorial matters for this article were handled by Aaron Cohen.
Annotation of proteins with gene ontology (GO) terms is ongoing work and a complex task. Manual GO annotation is precise and precious, but it is time-consuming. Therefore, instead of curated annotations most of the proteins come with uncurated annotations, which have been generated automatically. Text-mining systems that use literature for automatic annotation have been proposed but they do not satisfy the high quality expectations of curators.
In this paper we describe an approach that links uncurated annotations to text extracted from literature. The selection of the text is based on the similarity of the text to the term from the uncurated annotation. Besides substantiating the uncurated annotations, the extracted texts also lead to novel annotations. In addition, the approach uses the GO hierarchy to achieve high precision. Our approach is integrated into GOAnnotator, a tool that assists the curation process for GO annotation of UniProt proteins.
The GO curators assessed GOAnnotator with a set of 66 distinct UniProt/SwissProt proteins with uncurated annotations. GOAnnotator provided correct evidence text at 93% precision. This high precision results from using the GO hierarchy to only select GO terms similar to GO terms from uncurated annotations in GOA. Our approach is the first one to achieve high precision, which is crucial for the efficient support of GO curators. GOAnnotator was implemented as a web tool that is freely available at http://xldb.di.fc.ul.pt/rebil/tools/goa/.
ABSTRACT : Occasionally, multiple names are given to the same gene/protein. When this happens, different names can be used in subsequent publications, for example in different research areas, sometimes with little or no awareness that the same entity known under a different name may have a major role in another field of science. Recent reports about the protein p11 presented findings that this protein, commonly known as S100A10, may play a crucial role in depression and antidepressant treatment mechanisms. One set of data showed an increased expression of this protein in the brain of mice treated with antidepressants. P11/S100A10 is only one of several S100 proteins expressed in the brain. Interestingly, it has been previously noted that antidepressant treatment increases the brain content of another S100 protein, S100B. It appears that up-regulating the brain content of various S100 proteins might be a common feature of antidepressants. In cells coexpressing S100A10 and S100B, these proteins may interact and exert opposite regulatory roles. Nevertheless, S100A10 is predominantly expressed in certain types of neurons whereas S100B is more abundant in glia. Thus, an interplay among multiple members of the S100 proteins might be important in determining the region and cell specificity of antidepressant mechanisms. Calling the p11 protein by its other name, S100A10, may prompt more investigators from different fields to participate in this new direction of neurobiological research.
Arrowsmith is a unique computer-assisted strategy designed to assist investigators in detecting biologically-relevant connections between two disparate sets of articles in Medline. This paper describes how an inter-institutional consortium of neuroscientists used the UIC Arrowsmith web interface http://arrowsmith.psych.uic.edu in their daily work and guided the development, refinement and expansion of the system into a suite of tools intended for use by the wider scientific community.
The annual James Arthur lecture series on the Evolution of the Human Brain was inaugurated at the American Museum of Natural History in 1932, through a bequest from a successful manufacturer with a particular interest in mechanisms. Karl Pribram's thirty-ninth lecture of the series, delivered in 1970, was a seminal event that heralded much of the research agenda, since pursued by representatives of diverse disciplines, that touches on the evolution of human uniqueness.
In his James Arthur lecture Pribram raised questions about the coding of information in the brain and about the complex association between language, symbol, and the unique human cognitive system. These questions are as pertinent today as in 1970. The emergence of modern human symbolic cognition is often viewed as a gradual, incremental process, governed by inexorable natural selection and propelled by the apparent advantages of increasing intelligence. However, there are numerous theoretical considerations that render such a scenario implausible, and an examination of the pattern of acquisition of behavioral and anatomical novelties in human evolution indicates that, throughout, major change was both sporadic and rare. What is more, modern bony anatomy and brain size were apparently both achieved well before we have any evidence for symbolic behavior patterns. This suggests that the biological substrate underlying the symbolic thought that is so distinctive of Homo sapiens today was exaptively achieved, long before its potential was actually put to use. In which case we need to look for the agent, perforce a cultural one, that stimulated the adoption of symbolic thought patterns. That stimulus may well have been the spontaneous invention of articulate language.
What makes man human is his brain. This brain is obviously different from those of nonhuman primates. It is larger, shows hemispheric dominance and specialization, and is cytoarchitecturally somewhat more generalized. But are these the essential characteristics that determine the humanness of man? This paper cannot give an answer to this question for the answer is not known. But the problem can be stated more specifically, alternatives spelled out on the basis of available research results, and directions given for further inquiry. My theme will be that the human brain is so constructed that man, and only man, feels the thrust to make meaningful all his experiences and encounters. Development of this theme demands an analysis of the brain mechanisms that make meaning-and an attempt to define biologically the process of meaning. In this pursuit of meaning a fascinating variety of topics comes into focus: the coding and recoding operations of the brain; how it engenders and processes information and redundancy; and, how it makes possible signs and symbols and prepositional utterances. Of these, current research results indicate that only in the making of propositions is man unique-so here perhaps are to be found the keynotes that compose the theme.
This informal tutorial is intended for investigators and students who would like to understand the workings of information retrieval systems, including the most frequently used search engines: PubMed and Google. Having a basic knowledge of the terms and concepts of information retrieval should improve the efficiency and productivity of searches. As well, this knowledge is needed in order to follow current research efforts in biomedical information retrieval and text mining that are developing new systems not only for finding documents on a given topic, but extracting and integrating knowledge across documents.
The term blue skies research implies a freedom to carry out flexible, curiosity-driven research that leads to outcomes not envisaged at the outset. This research often challenges accepted thinking and introduces new fields of study. Science policy in the UK has given growing support for short-term goal-oriented scientific research projects, with pressure being applied on researchers to demonstrate the future application of their work. These policies carry the risk of restricting freedom, curbing research direction, and stifling rather than stimulating the creativity needed for scientific discovery.
This study tracks the tortuous routes that led to three major discoveries in cardiology. It then investigates the constraints in current research, and opportunities that may be lost with existing funding processes, by interviewing selected scientists and fund providers for their views on curiosity-driven research and the freedom needed to allow science to flourish. The transcripts were analysed using a grounded theory approach to gather recurrent themes from the interviews.
The results from these interviews suggest that scientists often cannot predict the future applications of research. Constraints such as lack of scientific freedom, and a narrow focus on relevance and accountability were believed to stifle the discovery process. Although it was acknowledged that some research projects do need a clear and measurable framework, the interviewees saw a need for inquisitive, blue skies research to be managed in a different way. They provided examples of situations where money allocated to 'safe' funding was used for more innovative research.
This sample of key UK scientists and grant providers acknowledge the importance of basic blue skies research. Yet the current evaluation process often requires that scientists predict their likely findings and estimate short-term impact, which does not permit freedom of research direction. There is a vital need for prominent scientists and for universities to help the media, the public, and policy makers to understand the importance of innovative thought along with the need for scientists to have the freedom to challenge accepted thinking. Encouraging an avenue for blue skies research could have immense influence over future scientific discoveries.
The Center for Behavioral Neuroscience was launched in the fall of 1999 with support from the National Science Foundation, the Georgia Research Alliance, and our eight participating institutions (Georgia State University, Emory University, Georgia Institute of Technology, Morehouse School of Medicine, Clark-Atlanta University, Spelman College, Morehouse College, Morris Brown College). The CBN provides the resources to foster innovative research in behavioral neuroscience, with a specific focus on the neurobiology of social behavior. Center faculty working in collaboratories use diverse model systems from invertebrates to humans to investigate fear, aggression, affiliation, and reproductive behaviors. The addition of new research foci in reward and reinforcement, memory and cognition, and sex differences has expanded the potential for collaborations among Center investigators. Technology core laboratories develop the molecular, cellular, systems, behavioral, and imaging tools essential for investigating how the brain influences complex social behavior and, in turn, how social experience influences brain function.
In addition to scientific discovery, a major goal of the CBN is to train the next generation of behavioral neuroscientists and to increase the number of women and under-represented minorities in neuroscience. Educational programs are offered for K-12 students to spark an interest in science. Undergraduate and graduate initiatives encourage students to participate in interdisciplinary and inter-institutional programs, while postdoctoral programs provide a bridge between laboratories and allow the interdisciplinary research and educational ventures to flourish. Finally, the CBN is committed to knowledge transfer, partnering with community organizations to bring neuroscience to the public. This multifaceted approach through research, education, and knowledge transfer will have a major impact on how we study interactions between the brain and behavior, as well as how the public views brain function and neuroscience.
Large‐scale electronic health record research introduces biases compared to traditional
manually curated retrospective research. We used data from a community‐acquired pneumonia
study for which we had a gold standard to illustrate such biases. The challenges include data
inaccuracy, incompleteness, and complexity, and they can produce in distorted results. We
found that a naïve approach approximated the gold standard, but errors on a minority of cases
shifted mortality substantially. Manual review revealed errors in both selecting and
characterizing the cohort, and narrowing the cohort improved the result. Nevertheless, a
significantly narrowed cohort might contain its own biases that would be difficult to estimate.
For a neurobiologist, the core of human nature is the human cerebral cortex, especially the prefrontal areas, and the question "what makes us human?" translates into studies of the development and evolution of the human cerebral cortex, a clear oversimplification. In this comment, after pointing out this oversimplification, I would like to show that it is impossible to understand our cerebral cortex if we focus too narrowly on it. Like other organs, our cortex evolved from that in stem amniotes, and it still bears marks of that ancestry. More comparative studies of brain development are clearly needed if we want to understand our brain in its historical context. Similarly, comparative genomics is a superb tool to help us understand evolution, but again, studies should not be limited to mammals or to comparisons between human and chimpanzee, and more resources should be invested in investigation of many vertebrate phyla. Finally, the most widely used rodent models for studies of cortical development are of obvious interest but they cannot be considered models of a "stem cortex" from which the human type evolved. It remains of paramount importance to study cortical development directly in other species, particularly in primate models, and, whenever ethically justifiable, in human.
Nanotechnology research has lately been of intense interest because of its perceived potential for many diverse fields of science. Nanotechnology's tools have found application in diverse fields, from biology to device physics. By the 1990s, there was a concerted effort in the United States to develop a national initiative to promote such research. The success of this effort led to a significant influx of resources and interest in nanotechnology and nanobiotechnology and to the establishment of centralized research programs and facilities. Further government initiatives (at federal, state, and local levels) have firmly cemented these disciplines as 'big science,' with efforts increasingly concentrated at select laboratories and centers. In many respects, these trends mirror certain changes in academic science over the past twenty years, with a greater emphasis on applied science and research that can be more directly utilized for commercial applications.
We also compare the National Nanotechnology Initiative and its successors to the Human Genome Project, another large-scale, government funded initiative. These precedents made acceptance of shifts in nanotechnology easier for researchers to accept, as they followed trends already established within most fields of science. Finally, these trends are examined in the design of technologies for detection and treatment of cancer, through the Alliance for Nanotechnology in Cancer initiative of the National Cancer Institute. Federal funding of these nanotechnology initiatives has allowed for expansion into diverse fields and the impetus for expanding the scope of research of several fields, especially biomedicine, though the ultimate utility and impact of all these efforts remains to be seen.
Data management and integration are complicated and ongoing problems that will require commitment of resources and expertise from the various biological science communities. Primary components of successful cross-scale integration are smooth information management and migration from one context to another. We call for a broadening of the definition of bioinformatics and bioinformatics training to span biological disciplines and biological scales. Training programs are needed that educate a new kind of informatics professional, Biological Information Specialists, to work in collaboration with various discipline-specific research personnel. Biological Information Specialists are an extension of the informationist movement that began within library and information science (LIS) over 30 years ago as a professional position to fill a gap in clinical medicine. These professionals will help advance science by improving access to scientific information and by freeing scientists who are not interested in data management to concentrate on their science.
Current usability studies of bioinformatics tools suggest that tools for exploratory analysis support some tasks related to finding relationships of interest but not the deep causal insights necessary for formulating plausible and credible hypotheses. To better understand design requirements for gaining these causal insights in systems biology analyses a longitudinal field study of 15 biomedical researchers was conducted. Researchers interacted with the same protein-protein interaction tools to discover possible disease mechanisms for further experimentation.
Findings reveal patterns in scientists' exploratory and explanatory analysis and reveal that tools positively supported a number of well-structured query and analysis tasks. But for several of scientists' more complex, higher order ways of knowing and reasoning the tools did not offer adequate support. Results show that for a better fit with scientists' cognition for exploratory analysis systems biology tools need to better match scientists' processes for validating, for making a transition from classification to model-based reasoning, and for engaging in causal mental modelling.
As the next great frontier in bioinformatics usability, tool designs for exploratory systems biology analysis need to move beyond the successes already achieved in supporting formulaic query and analysis tasks and now reduce current mismatches with several of scientists' higher order analytical practices. The implications of results for tool designs are discussed.
Background. EpiphaNet is an interactive knowledge discovery system which enables researchers to explore visually sets of relations extracted from MEDLINE using a combination of language processing techniques. In this paper, we discuss the theoretical and methodological foundations of the system, and evaluate the utility of the models that underlie it for literature-based discovery. In addition, we present a summary of results drawn from a qualitative analysis of over six hours of interaction with the system by basic medical scientists.
The system is able to simulate open and closed discovery, and is shown to generate associations that are both surprising and interesting within the area of expertise of the researchers concerned.
EpiphaNet provides an interactive visual representation of associations between concepts, which is derived from distributional statistics drawn from across the spectrum of biomedical citations in MEDLINE. This tool is available online, providing biomedical scientists with the opportunity to identify and explore associations of interest to them.
Improved evaluation methodologies have been identified as a necessary prerequisite to the improvement of text mining theory and practice. This paper presents a publicly available framework that facilitates thorough, structured, and large-scale evaluations of text mining technologies. The extensibility of this framework and its ability to uncover system-wide characteristics by analyzing component parts as well as its usefulness for facilitating third-party application integration are demonstrated through examples in the biomedical domain.
Our evaluation framework was assembled using the Unstructured Information Management Architecture. It was used to analyze a set of gene mention identification systems involving 225 combinations of system, evaluation corpus, and correctness measure. Interactions between all three were found to affect the relative rankings of the systems. A second experiment evaluated gene normalization system performance using as input 4,097 combinations of gene mention systems and gene mention system-combining strategies. Gene mention system recall is shown to affect gene normalization system performance much more than does gene mention system precision, and high gene normalization performance is shown to be achievable with remarkably low levels of gene mention system precision.
The software presented in this paper demonstrates the potential for novel discovery resulting from the structured evaluation of biomedical language processing systems, as well as the usefulness of such an evaluation framework for promoting collaboration between developers of biomedical language processing technologies. The code base is available as part of the BioNLP UIMA Component Repository on SourceForge.net.
Biomedical scientists need to access figures to validate research facts and to formulate or to test novel research hypotheses. However, figures are difficult to comprehend without associated text (e.g., figure legend and other reference text). We are developing automated systems to extract the relevant explanatory information along with figures extracted from full text articles. Such systems could be very useful in improving figure retrieval and in reducing the workload of biomedical scientists, who otherwise have to retrieve and read the entire full-text journal article to determine which figures are relevant to their research. As a crucial step, we studied the importance of associated text in biomedical figure comprehension.
Twenty subjects evaluated three figure-text combinations: figure+legend, figure+legend+title+abstract, and figure+full-text. Using a Likert scale, each subject scored each figure+text according to the extent to which the subject thought he/she understood the meaning of the figure and the confidence in providing the assigned score. Additionally, each subject entered a free text summary for each figure-text. We identified missing information using indicator words present within the text summaries. Both the Likert scores and the missing information were statistically analyzed for differences among the figure-text types. We also evaluated the quality of text summaries with the text-summarization evaluation method the ROUGE score.
Our results showed statistically significant differences in figure comprehension when varying levels of text were provided. When the full-text article is not available, presenting just the figure+legend left biomedical researchers lacking 39-68% of the information about a figure as compared to having complete figure comprehension; adding the title and abstract improved the situation, but still left biomedical researchers missing 30% of the information. When the full-text article is available, figure comprehension increased to 86-97%; this indicates that researchers felt that only 3-14% of the necessary information for full figure comprehension was missing when full text was available to them. Clearly there is information in the abstract and in the full text that biomedical scientists deem important for understanding the figures that appear in full-text biomedical articles.
We conclude that the texts that appear in full-text biomedical articles are useful for understanding the meaning of a figure, and an effective figure-mining system needs to unlock the information beyond figure legend. Our work provides important guidance to the figure mining systems that extract information only from figure and figure legend.
Background: The Context and Purpose of the Study
Over the last decade China has emerged as a major producer of scientific publications, currently ranking second behind the US. During that time Chinese strategic policy initiatives have placed indigenous innovation at the heart of its economy while focusing internal R&D investments and the attraction of foreign investment in nanotechnology as one of their four top areas. China’s scientific research publication and nanotechnology research publication production has reached a rank of second in the world, behind only the US. Despite these impressive gains, some scholars argue that the quality of Chinese nanotech research is inferior to US research quality due to lower overall times cited rates, suggesting that the US is still the world leader. We combine citation analysis, text mining, mapping, and data visualization to gauge the development and application of nanotechnology in China, particularly in biopharmananotechnology, and to measure the impact of Chinese policy on nanotechnology research production.
Results, the main findings
Our text mining-based methods provide results that counter existing claims about Chinese nanotechnology research quality. Due in large part to its strategic innovation policy, China’s output of nanotechnology publications is on pace to surpass US production in or around 2012.A closer look at Chinese nanotechnology research literature reveals a large increase in research activity in China’s biopharmananotechnology research since the implementation in January, 2006 of China’s Medium & Long Term Scientific and Technological Development Plan Guidelines for the period 2006-2020 (“MLP”).
Since the implementation of the MLP, China has enjoyed a great deal of success producing bionano research findings while attracting a great deal of foreign investment from pharmaceutical corporations setting up advanced drug discovery operations. Given the combination of current scientific production growth as well as economic growth, a relatively low scientific capacity, and the ability of its policy to enhance such trends, China is in some sense already the new world leader in nanotechnology. Further, the Chinese national innovation system may be the new standard by which other national S&T policies should be measured.
Polymerase chain reaction (PCR) was a seminal genomic technology discovered, developed, and patented in an industry setting. Since the first of its core patents expired in March, 2005, we are in a position to view the entire lifespan of the patent, examining how the intellectual property rights have impacted its use in the biomedical community. Given its essential role in the world of molecular biology and its commercial success, the technology can serve as a case study for evaluating the effects of patenting biological research tools on biomedical research.
Following its discovery, the technique was subjected to two years of in-house development, during which issues of inventorship and publishing/patenting strategies caused friction between members of the development team. Some have feared that this delay impeded subsequent research and may have been due to trade secrecy or the desire for obtaining lucrative intellectual property rights. However, our analysis of the history indicates that the main reasons for the delay were benign and were primarily due to difficulties in perfecting the PCR technique. Following this initial development period, the technology was made widely available, but was subject to strict licensing terms and patent protection, leading to an extensive litigation history.
PCR has earned approximately $2 billion in royalties for the various rights-holders while also becoming an essential research tool. However, using citation trend analysis, we are able to see that PCR's patented status did not preclude it from being adopted in a similar manner as other non-patented genomic research tools (specifically, pBR322 cloning vector and Maxam-Gilbert sequencing).
Despite the heavy patent protection and rigid licensing schemes, PCR seems to have disseminated so widely because of the practices of the corporate entities which have controlled these patents, namely through the use of business partnerships and broad corporate licensing, adaptive licensing strategies, and a "rational forbearance" from suing researchers for patent infringement. While far from definitive, our analysis seems to suggest that, at least in the case of PCR, patenting of genomic research tools need not impede their dissemination, if the technology is made available through appropriate business practices.
The hypothesis that distance matters but that in recent years geographical proximity has become less important for research collaboration was tested. We have chosen a sample--authors at German immunological institutes--that is relatively homogeneous with regard to research field, language and culture, which beside distance are other possible factors influencing the willingness to co-operate. We analyse yearly distributions of co-authorship links between institutes and compare them with the yearly distributions of distances of all institutes producing papers in journals indexed in the Science Citation Index, editions 1992 till 2002. We weight both types of distributions properly with paper numbers.
One interesting result is that place matters but if a researcher has to leave the home town to find a collaborator distance does not matter any longer. This result holds for all years considered, but is statistically most significant in 2002. The tendency to leave the own town for collaborators has slightly increased in the sample. In addition, yearly productivity distributions of institutes have been found to be lognormal.
The Internet did not change much the collaboration patterns between German immunological institutes.
This paper documents the cognitive strategies that led to Faraday's first significant scientific discovery. For Faraday, discovery is essentially a matter seeing as, of substituting for the eye all possess the eye of analysis all scientists must develop. In the process of making his first significant discovery, Faraday learns to dismiss the magnetic attractions and repulsions he and others had observed; by means of systematic variations in his experimental set-up, he learns to see these motions as circular: it is the first indication that an electro-magnetic field exists. In communicating his discoveries, Faraday, of course, takes into consideration his various audiences' varying needs and their differences in scientific competence; but whatever his audience, Faraday learns to convey what it feels like to do science, to shift from seeing to seeing as, from sight to insight.
This paper advances a detailed exploration of the complex relationships among terms, concepts, and synonymy in the UMLS Metathesaurus, and proposes the study and understanding of the Metathesaurus from a model-theoretic perspective. Initial sections provide the background and motivation for such an approach, and a careful informal treatment of these notions is offered as a context and basis for the formal analysis. What emerges from this is a set of puzzles and confusions in the Metathesaurus and its literature pertaining to synonymy and its relation to terms and concepts. A model theory for a segment of the Metathesaurus is then constructed, and its adequacy relative to the informal treatment is demonstrated. Finally, it is shown how this approach clarifies and addresses the puzzles educed from the informal discussion, and how the model-theoretic perspective may be employed to evaluate some fundamental criticisms of the Metathesaurus. For users of the UMLS, two significant results of this analysis are a rigorous clarification of the different senses of synonymy that appear in treatments of the Metathesaurus and an illustration of the dangers in computing inferences involving ambiguous terms.
Most biomedical corpora have not been used outside of the lab that created them, despite the fact that the availability of the gold-standard evaluation data that they provide is one of the rate-limiting factors for the progress of biomedical text mining. Data suggest that one major factor affecting the use of a corpus outside of its home laboratory is the format in which it is distributed. This paper tests the hypothesis that corpus refactoring - changing the format of a corpus without altering its semantics - is a feasible goal, namely that it can be accomplished with a semi-automatable process and in a time-effcient way. We used simple text processing methods and limited human validation to convert the Protein Design Group corpus into two new formats: WordFreak and embedded XML. We tracked the total time expended and the success rates of the automated steps.
The refactored corpus is available for download at the BioNLP SourceForge website http://bionlp.sourceforge.net. The total time expended was just over three person-weeks, consisting of about 102 hours of programming time (much of which is one-time development cost) and 20 hours of manual validation of automatic outputs. Additionally, the steps required to refactor any corpus are presented.
We conclude that refactoring of publicly available corpora is a technically and economically feasible method for increasing the usage of data already available for evaluating biomedical language processing systems.
The network model of innovation widely adopted among researchers in the economics of science and technology posits relatively porous boundaries between firms and academic research programs and a bi-directional flow of inventions, personnel, and tacit knowledge between sites of university and industry innovation. Moreover, the model suggests that these bi-directional flows should be considered as mutual stimulation of research and invention in both industry and academe, operating as a positive feedback loop. One side of this bi-directional flow – namely; the flow of inventions into industry through the licensing of university-based technologies – has been well studied; but the reverse phenomenon of the stimulation of university research through the absorption of new directions emanating from industry has yet to be investigated in much detail. We discuss the role of federal funding of academic research in the microarray field, and the multiple pathways through which federally supported development of commercial microarray technologies have transformed core academic research fields.
Our study confirms the picture put forward by several scholars that the open character of networked economies is what makes them truly innovative. In an open system innovations emerge from the network. The emergence and diffusion of microarray technologies we have traced here provides an excellent example of an open system of innovation in action. Whether they originated in a startup company environment that operated like a think-tank, such as Affymax, the research labs of a large firm, such as Agilent, or within a research university, the inventors we have followed drew heavily on knowledge resources from all parts of the network in bringing microarray platforms to light.
Federal funding for high-tech startups and new industrial development was important at several phases in the early history of microarrays, and federal funding of academic researchers using microarrays was fundamental to transforming the research agendas of several fields within academe. The typical story told about the role of federal funding emphasizes the spillovers from federally funded academic research to industry. Our study shows that the knowledge spillovers worked both ways, with federal funding of non-university research providing the impetus for reshaping the research agendas of several academic fields.
Biological organisms and their components are better conceived within categories based on similarity rather than on identity. Biologists routinely operate with similarity-based concepts such as "model organism" and "motif." There has been little exploration of the characteristics of the similarity-based categories that exist in biology. This study uses the case of the discovery and classification of zinc finger proteins to explore how biological categories based in similarity are represented.
The existence of a category of "zinc finger proteins" was based in 1) a lumpy gradient of similarity, 2) a link between function and structure, 3) establishment of a range of appearance across systems and organisms, and 4) an evolutionary locus as a historically based common-ground.
More systematic application of the idea of similarity-based categorization might eliminate the assumption that biological characteristics can only contribute to narrow categorization of humans. It also raises possibilities for refining data-driven exploration efforts.
It is possible to find in the medical literature many articles that have been neglected or ignored, in some cases for many years, but which are worth bringing to light because they report unusual findings that may be of current scientific interest. Resurrecting previously published but neglected hypotheses that have merit might be overlooked because it would seem to lack the novelty of "discovery" -- but the potential value of so doing is hardly arguable. Finding neglected hypotheses may be not only of great practical value, but also affords the opportunity to study the structure of such hypotheses in the hope of illuminating the more general problem of hypothesis generation.
Discovery, as a public attribution, and discovering, the act of conducting research, are experiences that entail "languaging" the unknown. This distinguishing property of language - its ability to bring forth, out of the unspoken realm, new knowledge, original ideas, and novel thinking - is essential to the discovery process. In sharing their ideas and views, scientists create co-negotiated linguistic distinctions that prompt the revision of established mental maps and the adoption of new ones. While scientific mastery entails command of the conversational domain unique to a specific discipline, there is an emerging conversational domain that must be mastered that goes beyond the language unique to any particular specialty. Mastery of this new conversational domain gives researchers access to their hidden mental maps that limit their ways of thinking about and doing science. The most effective scientists use language to recontextualize their approach to problem-solving, which triggers new insights (previously unavailable) that result in new discoveries. While language is not a replacement for intuition and other means of knowing, when we try to understand what's outside of language we have to use language to do so.
In 1970, Karl Pribram took on the immense challenge of asking the question, what makes us human? Nearly four decades later, the most significant finding has been the undeniable realization of how incredibly subtle and fine-scaled the unique biological features of our species must be. The recent explosion in the availability of large-scale sequence data, however, and the consequent emergence of comparative genomics, are rapidly transforming the study of human evolution. The field of comparative genomics is allowing us to reach unparalleled resolution, reframing our questions in reference to DNA sequence--the very unit that evolution operates on. But like any reductionist approach, it comes at a price. Comparative genomics may provide the necessary resolution for identifying rare DNA sequence differences in a vast sea of conservation, but ultimately we will have to face the challenge of figuring out how DNA sequence divergence translates into phenotypic divergence. Our goal here is to provide a brief outline of the major findings made in the study of human brain evolution since the Pribram lecture, focusing specifically on the field of comparative genomics. We then discuss the broader implications of these findings and the future challenges that are in store.
The MEDLINE database of medical literature is routinely used by researchers and doctors to find articles pertaining to their area of interest. Insight into historical changes in research areas may be gained by chronological analysis of the 18 million records currently in the database, however such analysis is generally complex and time consuming. The authors' MLTrends web application graphs term usage in MEDLINE over time, allowing the determination of emergence dates for biomedical terms and historical variations in term usage intensity. MLTrends may be used at: http://www.ogic.ca/mltrends.
Scientific and popular lore have promulgated a connection between emotion and the limbic forebrain. However, there are a variety of structures that are considered limbic, and disagreement as to what is meant by "emotion". This essay traces the initial studies upon which the connection between emotion and the limbic forebrain was based and how subsequent experimental evidence led to confusion both with regard to brain systems and to the behaviors examined. In the process of sorting out the bases of the confusion the following rough outlines are sketched: 1) Motivation and emotion need to be distinguished. 2) Motivation and emotion are processed by the basal ganglia; motivation by the striatum and related structures, emotion by limbic basal ganglia: the amygdala and related structures. 3) The striatum processes activation of readiness, both behavioral and perceptual; the amygdala processes arousal, an intensive dimension that varies from interest to panic. 4) Activation of readiness deals with "what to do?" Arousal deals with novelty, with "what is it?" 5) Thus both motivation and emotion are the proactive aspects of representations, of memory: motivation, an activation of readiness; emotion, a processing of novelty, a departure from the familiar. 6) The hippocampal-cingulate circuit deals with efficiently relating emotion and motivation by establishing dispositions, attitudes. 7) The prefrontal cortex fine-tunes motivation, emotion and attitude when choices among complex or ambiguous circumstances are made.
Collaborative efforts of physicians and basic scientists are often necessary in the investigation of complex disorders. Difficulties can arise, however, when large amounts of information need to reviewed. Advanced information retrieval can be beneficial in combining and reviewing data obtained from the various scientific fields. In this paper, a team of investigators with varying backgrounds has applied advanced information retrieval methods, in the form of text mining and entity relationship tools, to review the current literature, with the intention to generate new insights into the molecular mechanisms underlying a complex disorder. As an example of such a disorder the Complex Regional Pain Syndrome (CRPS) was chosen. CRPS is a painful and debilitating syndrome with a complex etiology that is still unraveled for a considerable part, resulting in suboptimal diagnosis and treatment.
A text mining based approach combined with a simple network analysis identified Nuclear Factor kappa B (NFkappaB) as a possible central mediator in both the initiation and progression of CRPS.
The result shows the added value of a multidisciplinary approach combined with information retrieval in hypothesis discovery in biomedical research. The new hypothesis, which was derived in silico, provides a framework for further mechanistic studies into the underlying molecular mechanisms of CRPS and requires evaluation in clinical and epidemiological studies.
We propose NEMO, a system for extracting organization names in the affiliation and normalizing them to a canonical organization name. Our parsing process involves multi-layered rule matching with multiple dictionaries. The system achieves more than 98% f-score in extracting organization names. Our process of normalization that involves clustering based on local sequence alignment metrics and local learning based on finding connected components. A high precision was also observed in normalization. NEMO is the missing link in associating each biomedical paper and its authors to an organization name in its canonical form and the Geopolitical location of the organization. This research could potentially help in analyzing large social networks of organizations for landscaping a particular topic, improving performance of author disambiguation, adding weak links in the co-author network of authors, augmenting NLM's MARS system for correcting errors in OCR output of affiliation field, and automatically indexing the PubMed citations with the normalized organization name and country. Our system is available as a graphical user interface available for download along with this paper.
One of the most pressing and timely scientific questions concerns the evolution of man. In 1970, Karl Pribram delivered the James Arthur Lecture at the American Museum of Natural History in New York City. His lecture, "What Makes Man Human," was one of the most eloquent and brilliant syntheses of this problem ever made. The Journal is proud to publish this Lecture for the first time in an open access format that will make its insights available widely to a new generation of students and investigators. Accompanying the lecture is a new commentary written by Prof. Pribram, and four additional commentaries from prominent investigators who were invited to consider the question from their own perspectives. Together, these articles provide a scholarly, yet accessible, snapshot of different approaches to the study of human evolution in 2006.
In the present paper, we have created and characterized several similarity metrics for relating any two Medical Subject Headings (MeSH terms) to each other. The article-based metric measures the tendency of two MeSH terms to appear in the MEDLINE record of the same article. The author-based metric measures the tendency of two MeSH terms to appear in the body of articles written by the same individual (using the 2009 Author-ity author name disambiguation dataset as a gold standard). The two metrics are only modestly correlated with each other (r = 0.50), indicating that they capture different aspects of term usage. The article-based metric provides a measure of semantic relatedness, and MeSH term pairs that co-occur more often than expected by chance may reflect relations between the two terms. In contrast, the author metric is indicative of how individuals practice science, and may have value for author name disambiguation and studies of scientific discovery. We have calculated article metrics for all MeSH terms appearing in at least 25 articles in MEDLINE (as of 2014) and author metrics for MeSH terms published as of 2009. The dataset is freely available for download and can be queried at http://arrowsmith.psych.uic.edu/arrowsmith_uic/mesh_pair_metrics.html. Handling editor: Elizabeth Workman, MLIS, PhD.
Background Malaria remains a Public Health challenge in Sub-Sahara Africa causing high mortality and morbidity. This study looks at the Prevalence and Management Outcomes of malaria cases that presented by different age categories at a Private Secondary Hospital by the residents of Agbani Road, Enugu state. Methodology This is hospital based cross-sectional retrospective study of patients who presented with fever from Month of January 2017 to Month of December, 2020. Malaria Diagnosis was made by WHO recommended RDT test kits and Microscopy, Structured questionnaire was used to collect information on Long Lasting Insecticidal Nets (LLINs) use from all the patients who patronize the hospital. All malaria positive cases were treated according to Malaria Treatment Guidelines. Analysis of data was done by SPSS.Version22. Results Out of 678 febrile Patients, 65% (443/678) stronglytested positive to malaria: P-value = 0.00135. 50% (221/443) malaria cases, 68% (70/103) admitted cases, 78% (31/40) complicated cases and 91% (10/11) death respectively occurred moreamong < 10years and pregnant mothers. 89% (396/443) successful treatment rate was achieved (P=0.000) and 83% (39/47) of resistantmalaria responded to antibiotics, P-value0.0001).9% (40/443) of the treated cases and Only 1 % (7/73) of cases who use LLINs returnedwith malaria within 28days. Conclusion Malaria Prevalence and outcomesis worseamong < 10 years and pregnant mothers and some anti-malaria resistant cases responded to antibiotics.LLINs use reducesprevalence of malaria re-infection in this setting. Keywords: Malaria;Prevalence; Management outcomes; Agbani Road; Enugu State; Nigeria.
The ability to locate publicly available gene expression microarray datasets effectively and efficiently facilitates the reuse of these potentially valuable resources. Centralized biomedical databases allow users to query dataset metadata descriptions, but these annotations are often too sparse and diverse to allow complex and accurate queries. In this study we examined the ability of PubMed article identifiers to locate publicly available gene expression microarray datasets, and investigated whether the retrieved datasets were representative of publicly available datasets found through statements of data sharing in the associated research articles.
In a recent article, Ochsner and colleagues identified 397 studies that had generated gene expression microarray data. Their search of the full text of each publication for statements of data sharing revealed 203 publicly available datasets, including 179 in the Gene Expression Omnibus (GEO) or ArrayExpress databases. Our scripted search of GEO and ArrayExpress for PubMed identifiers of the same 397 studies returned 160 datasets, including six not found by the original search for data sharing statements. As a proportion of datasets found by either method, the search for data sharing statements identified 91.4% of the 209 publicly available datasets, compared to 76.6% found by our search for PubMed identifiers. Searching GEO or ArrayExpress alone retrieved 63.2% and 46.9% of all available datasets, respectively. Studies retrieved through PubMed identifiers were representative of all datasets in terms of research theme, technology, size, and impact, though the recall was highest for datasets published by the highest-impact journals.
Searching database entries using PubMed identifiers can identify the majority of publicly available datasets. We urge authors of all datasets to complete the citation fields for their dataset submissions once publication details are known, thereby ensuring their work has maximum visibility and can contribute to subsequent studies.
The Journal of Biomedical Discovery and Collaboration was created to provide, for the first time, a unified forum to consider all factors that affect scientific practice and scientific discovery – with an emphasis on the changing face of contemporary biomedical science. In this endeavor we are bringing together three different groups of scholars: a) laboratory investigators, who make the discoveries that are the currency of the scientific enterprise; b) computer science and informatics investigators, who devise tools for data analysis, mining, visualization and integration; and c) social scientists, including sociologists, historians, and philosophers, who study scientific practice, collaboration, and information needs. We will publish original research articles, case studies, focus pieces, reviews, and software articles. All articles in the Journal of Biomedical Discovery and Collaboration will be peer reviewed, published immediately upon acceptance, freely available online via open access, and archived in PubMed Central and other international full-text repositories.