Conference Paper

The Development of a Schema for the Annotation of Terms in the Biocaster Disease Detecting/Tracking System.

Conference: KR-MED 2006, Formal Biomedical Knowledge Representation, Proceedings of the Second International Workshop on Formal Biomedical Knowledge Representation: "Biomedical Ontology in Action" (KR-MED 2006), Collocated with the 4th International Conference on Formal Ontology in Information Systems (FOIS-2006), Baltimore, Maryland, USA, November 8, 2006
Source: DBLP
Download full-text


Available from: Roberto A Barrero,
  • Source
    • "The terminology relevant to this domain spans several concept classes including: microorganisms, genes and proteins, and several concept classes from the Gene Ontology, notably cellular components, biological processes, and molecular function. In terms of potential pathogens, there has been some research on disease recognition [20], [21], [22], [23], [24], [25]. There has been very substantial research on recognition of genes and proteins, through several community evaluations, such as JNLPBA-2004 [26] and BioCreative [27], [28]. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Research on specialized biological systems is often hampered by a lack of consistent terminology, especially across species. In bacterial Type IV secretion systems genes within one set of orthologs may have over a dozen different names. Classifying research publications based on biological processes, cellular components, molecular functions, and microorganism species should improve the precision and recall of literature searches allowing researchers to keep up with the exponentially growing literature, through resources such as the Pathosystems Resource Integration Center (PATRIC, We developed named entity recognition (NER) tools for four entities related to Type IV secretion systems: 1) bacteria names, 2) biological processes, 3) molecular functions, and 4) cellular components. These four entities are important to pathogenesis and virulence research but have received less attention than other entities, e.g., genes and proteins. Based on an annotated corpus, large domain terminological resources, and machine learning techniques, we developed recognizers for these entities. High accuracy rates (>80%) are achieved for bacteria, biological processes, and molecular function. Contrastive experiments highlighted the effectiveness of alternate recognition strategies; results of term extraction on contrasting document sets demonstrated the utility of these classes for identifying T4SS-related documents.
    PLoS ONE 03/2011; 6(3):e14780. DOI:10.1371/journal.pone.0014780 · 3.23 Impact Factor
  • Source
    • "The spatial attribute of the event-predicate can be selected from any expression considered to be a location entity according to the BioCaster named entity annotation specification [40]. In the BioCaster project, the location entity is the expression that absolutely refers to the politically or geographically defined location at any granularity. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Current public concern over the spread of infectious diseases has underscored the importance of health surveillance systems for the speedy detection of disease outbreaks. Several international report-based monitoring systems have been developed, including GPHIN, Argus, HealthMap, and BioCaster. A vital feature of these report-based systems is the geo-temporal encoding of outbreak-related textual data. Until now, automated systems have tended to use an ad-hoc strategy for processing geo-temporal information, normally involving the detection of locations that match pre-determined criteria, and the use of document publication dates as a proxy for disease event dates. Although these strategies appear to be effective enough for reporting events at the country and province levels, they may be less effective at discovering geo-temporal information at more detailed levels of granularity. In order to improve the capabilities of current Web-based health surveillance systems, we introduce the design for a novel scheme called spatiotemporal zoning. The proposed scheme classifies news articles into zones according to the spatiotemporal characteristics of their content. In order to study the reliability of the annotation scheme, we analyzed the inter-annotator agreements on a group of human annotators for over 1000 reported events. Qualitative and quantitative evaluation is made on the results including the kappa and percentage agreement. The reliability evaluation of our scheme yielded very promising inter-annotator agreement, more than a 0.9 kappa and a 0.9 percentage agreement for event type annotation and temporal attributes annotation, respectively, with a slight degradation for the spatial attribute. However, for events indicating an outbreak situation, the annotators usually had inter-annotator agreements with the lowest granularity location. We developed and evaluated a novel spatiotemporal zoning annotation scheme. The results of the scheme evaluation indicate that our annotated corpus and the proposed annotation scheme are reliable and could be effectively used for developing an automatic system. Given the current advances in natural language processing techniques, including the availability of language resources and tools, we believe that a reliable automatic spatiotemporal zoning system can be achieved. In the next stage of this work, we plan to develop an automatic zoning system and evaluate its usability within an operational health surveillance system.
    BMC Medical Informatics and Decision Making 01/2010; 10(1):1. DOI:10.1186/1472-6947-10-1 · 1.83 Impact Factor
  • Source
    • "There have been a number of approaches to named entity recognition and more generally to information extraction problems (see eg [12,13] or [14] for a name entity recognition system being used in biosurveillance), exploiting as we do, syntactic and contextual information. They however usually rely on supervised approaches, which require heavily annotated datasets to account for the human experience. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Automated surveillance of the Internet provides a timely and sensitive method for alerting on global emerging infectious disease threats. HealthMap is part of a new generation of online systems designed to monitor and visualize, on a real-time basis, disease outbreak alerts as reported by online news media and public health sources. HealthMap is of specific interest for national and international public health organizations and international travelers. A particular task that makes such a surveillance useful is the automated discovery of the geographic references contained in the retrieved outbreak alerts. This task is sometimes referred to as "geo-parsing". A typical approach to geo-parsing would demand an expensive training corpus of alerts manually tagged by a human. Given that human readers perform this kind of task by using both their lexical and contextual knowledge, we developed an approach which relies on a relatively small expert-built gazetteer, thus limiting the need of human input, but focuses on learning the context in which geographic references appear. We show in a set of experiments, that this approach exhibits a substantial capacity to discover geographic locations outside of its initial lexicon. The results of this analysis provide a framework for future automated global surveillance efforts that reduce manual input and improve timeliness of reporting.
    BMC Bioinformatics 11/2009; 10(1):385. DOI:10.1186/1471-2105-10-385 · 2.58 Impact Factor
Show more