[Show abstract][Hide abstract] ABSTRACT: We report on the recent development of ParZu, a German dependency parser. We discuss the effect of POS tagging and morphological analysis on parsing performance, and present novel ways of improving performance of the components, including the use of morphological features for POS-tagging, the use of syntactic information to select good POS sequences from an n-best list, and using parsed text as training data for POS tagging and statistical parsing. We also describe our efforts towards reducing the dependency on restrictively licensed and closed-source NLP resources.
Proceedings of the International Conference on Recent Advances in Natural Language Processing; 09/2013
[Show abstract][Hide abstract] ABSTRACT: In this article, we describe the architecture of the OntoGene Relation mining pipeline and its application in the triage task of BioCreative 2012. The aim of the task is to support the triage of abstracts relevant to the process of curation of the Comparative Toxicogenomics Database. We use a conventional information retrieval system (Lucene) to provide a baseline ranking, which we then combine with information provided by our relation mining system, in order to achieve an optimized ranking. Our approach additionally delivers domain entities mentioned in each input document as well as candidate relationships, both ranked according to a confidence score computed by the system. This information is presented to the user through an advanced interface aimed at supporting the process of interactive curation. Thanks, in particular, to the high-quality entity recognition, the OntoGene system achieved the best overall results in the task.
Database The Journal of Biological Databases and Curation 01/2013; 2013:bas053. DOI:10.1093/database/bas053 · 4.46 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: We give an overview of our approach to the extraction of interactions between pharmacogenomic entities like drugs, genes and diseases and suggest classes of interaction types driven by data from PharmGKB and partly following the top level ontology WordNet and biomedical types from BioNLP. Our text mining approach to the extraction of interactions is based on syntactic analysis. We use syntactic analyses to explore domain events and to suggest a set of interaction labels for the pharmacogenomics domain.
[Show abstract][Hide abstract] ABSTRACT: The mutual interactions among genes, diseases, and drugs are at the heart of biomedical research, and are especially important for the pharmacological industry. The recent trend towards personalized medicine makes it increasingly relevant to be able to tailor drugs to specific genetic makeups. The pharmacogenetics and pharmacogenomics knowledge base (PharmGKB) aims at capturing relevant information about such interactions from several sources, including curation of the biomedical literature. Advanced text mining tools which can support the process of manual curation are increasingly necessary in order to cope with the deluge of new published results. However, effective evaluation of those tools requires the availability of manually curated data as gold standard. In this paper we discuss how the existing PharmGKB database can be used for such an evaluation task in a way similar to the usage of gold standard data derived from protein-protein interaction databases in one of the recent BioCreative shared tasks. Additionally, we present our own considerations and results on the feasibility and difficulty of such a task.
[Show abstract][Hide abstract] ABSTRACT: Determining usefulness of biomedical text mining systems requires realistic task definition and data selection criteria without artificial constraints, measuring performance aspects that go beyond traditional metrics. The BioCreative III Protein-Protein Interaction (PPI) tasks were motivated by such considerations, trying to address aspects including how the end user would oversee the generated output, for instance by providing ranked results, textual evidence for human interpretation or measuring time savings by using automated systems. Detecting articles describing complex biological events like PPIs was addressed in the Article Classification Task (ACT), where participants were asked to implement tools for detecting PPI-describing abstracts. Therefore the BCIII-ACT corpus was provided, which includes a training, development and test set of over 12,000 PPI relevant and non-relevant PubMed abstracts labeled manually by domain experts and recording also the human classification times. The Interaction Method Task (IMT) went beyond abstracts and required mining for associations between more than 3,500 full text articles and interaction detection method ontology concepts that had been applied to detect the PPIs reported in them.
A total of 11 teams participated in at least one of the two PPI tasks (10 in ACT and 8 in the IMT) and a total of 62 persons were involved either as participants or in preparing data sets/evaluating these tasks. Per task, each team was allowed to submit five runs offline and another five online via the BioCreative Meta-Server. From the 52 runs submitted for the ACT, the highest Matthew's Correlation Coefficient (MCC) score measured was 0.55 at an accuracy of 89% and the best AUC iP/R was 68%. Most ACT teams explored machine learning methods, some of them also used lexical resources like MeSH terms, PSI-MI concepts or particular lists of verbs and nouns, some integrated NER approaches. For the IMT, a total of 42 runs were evaluated by comparing systems against manually generated annotations done by curators from the BioGRID and MINT databases. The highest AUC iP/R achieved by any run was 53%, the best MCC score 0.55. In case of competitive systems with an acceptable recall (above 35%) the macro-averaged precision ranged between 50% and 80%, with a maximum F-Score of 55%.
The results of the ACT task of BioCreative III indicate that classification of large unbalanced article collections reflecting the real class imbalance is still challenging. Nevertheless, text-mining tools that report ranked lists of relevant articles for manual selection can potentially reduce the time needed to identify half of the relevant articles to less than 1/4 of the time when compared to unranked results. Detecting associations between full text articles and interaction detection method PSI-MI terms (IMT) is more difficult than might be anticipated. This is due to the variability of method term mentions, errors resulting from pre-processing of articles provided as PDF files, and the heterogeneity and different granularity of method term concepts encountered in the ontology. However, combining the sophisticated techniques developed by the participants with supporting evidence strings derived from the articles for human interpretation could result in practical modules for biological annotation workflows.
[Show abstract][Hide abstract] ABSTRACT: This article describes the approaches taken by the OntoGene group at the University of Zurich in dealing with two tasks of the BioCreative III competition: classification of articles which contain curatable protein-protein interactions (PPI-ACT) and extraction of experimental methods (PPI-IMT).
Two main achievements are described in this paper: (a) a system for document classification which crucially relies on the results of an advanced pipeline of natural language processing tools; (b) a system which is capable of detecting all experimental methods mentioned in scientific literature, and listing them with a competitive ranking (AUC iP/R > 0.5).
The results of the BioCreative III shared evaluation clearly demonstrate that significant progress has been achieved in the domain of biomedical text mining in the past few years. Our own contribution, together with the results of other participants, provides evidence that natural language processing techniques have become by now an integral part of advanced text mining approaches.
[Show abstract][Hide abstract] ABSTRACT: We introduce our incremental coreference resolution system for the BioNLP 2011 Shared Task on Protein/Gene interaction. The benefits of an incremental architecture over a mention-pair model are: a reduction of the number of candidate pairs, a means to overcome the problem of underspecified items in pair-wise classification and the natural integration of global constraints such as transitivity. A filtering system takes into account specific features of different anaphora types. We do not apply Machine Learning, instead the system classifies with an empirically derived salience measure based on the dependency labels of the true mentions. The OntoGene pipeline is used for preprocessing.
Proceedings of the BioNLP Shared Task 2011 Workshop; 06/2011
[Show abstract][Hide abstract] ABSTRACT: Syntactic alternations like the dative shift are well researched. But most decisions
which speakers take are more complex than binary choices. Multifactorial lexicogrammatical
approaches and a large inventory of syntactic patterns are needed to
supplement current approaches. We use the term semantic alternation for the many
ways in which a relation between entities, conveying broadly the same meaning, can be
expressed. We use a well-resourced domain, biomedical research texts, for a corpusdriven
approach. As entities we use proteins, and as relations we use interactions between
them, using Text Mining training data. We discuss three approaches: first, manually
designed syntactic patterns, second a corpus-based semi-automatic approach and
third a machine-learning language model. The machine-learning approach learns the
probability that a syntactic configuration expresses a relevant interaction from an annotated
corpus. The inventory of configurations define the envelope of variation and its
multitude of forms.
[Show abstract][Hide abstract] ABSTRACT: In this paper, we present a declarative formalism for writing rule sets to con- vert constituent trees into dependency graphs. The formalism is designed to be independent of the annotation scheme and provides a highly task-related syntax, abstracting away from the underlying graph data structures. We have implemented the formalism in our search tool and used a prelim- inary version to create a rule set that converts more than 97% of the TIGER corpus.
[Show abstract][Hide abstract] ABSTRACT: We present ODIN (Ontogene Document INspector): a system for interactive curation of biomedical
literature, developed within the scope of the SASEBio project (Semi-Automated Semantic Enrichment of the Biomedical Literature), as a collaboration between the OntoGene group at the University of Zurich and the NITAS/TMS group of Novartis Pharma AG. The purpose of the system is to allow a human annotator/curator to leverage upon the results of an advanced text mining system in order to enhance the speed and effectiveness of the annotation process.
The OntoGene system takes as input a document (e.g a full paper from PubMed Central) and processes it with a custom NLP pipeline, which includes Named Entity recognition and relation extraction. Entities which are currently supported include proteins, genes, experimental methods, cell lines, species. Entities detected in the input document are disambiguated with respect to a reference database (UniProt, EntrezGene, NCBI taxonomy, PSI-MI ontology). The annotated documents are handed back to the ODIN interface, which allows multiple display modalities. The curator/annotator can view the whole document with in-line annotations highlighted, or can browse the extracted entities and be pointed back to the mentions of the entities within the original document. All entity mentions are entirely editable: the curator can easily add or delete any of them, and also change their extent (i.e. add/remove words to its right or left) with a simple click of the mouse. Different entity views are supported, with sorting capabilities according to different criteria (entity type, entity mention, confidence score, etc.). Selective highlighting of text units (e.g. sentences containing desired entities) is supported. Additionally, extensive logging functionalities are provided. All documents and entities are fully interlinked to reference databases, for the purpose of simplified inspection. Entities can be grouped in classes (e.g. by species) and actions can be applied to whole classes, for selective editing or removal.
[Show abstract][Hide abstract] ABSTRACT: A common framework under which the various studies on terminology processing can be viewed is to consider not only the texts from which the terminological resources are built but particularly the applications targeted. The current book, first published as a Special Issue of Terminology 11:1 (2005), analyses the influence of applications on term definition and processing. Two types of applications have been identified: intermediary and terminal applications (involving end users). Intermediary applications concern the building of terminological knowledge resources such as domain-specific dictionaries, ontologies, thesaurus or taxonomies. These knowledge resources then form the inputs to terminal applications such as information extraction, information retrieval, science and technology watch or automated book index building. Most of the applications dealt with in the book fall into the first category. This book represents the first attempt, from a pluridisciplinary viewpoint, to take into account the role of applications in the processing of terminology.
[Show abstract][Hide abstract] ABSTRACT: We describe a system for the detection of mentions of protein-protein interactions in the biomedical scientific literature. The original system was developed as a part of the OntoGene project, which focuses on using advanced computational linguistic techniques for text mining applications in the biomedical domain. In this paper, we focus in particular on the participation to the BioCreative II.5 challenge, where the OntoGene system achieved best-ranked results. Additionally, we describe a feature-analysis experiment performed after the challenge, which shows the unexpected result that one single feature alone performs better than the combination of features used in the challenge.
IEEE/ACM transactions on computational biology and bioinformatics / IEEE, ACM 07/2010; 7(3):472-80. DOI:10.1109/TCBB.2010.50 · 1.54 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: The detection of mentions of protein-protein interactions in the scientific literature has recently emerged as a core task in biomedical text mining. We present effective techniques for this task, which have been developed using the IntAct database as a gold standard, and have been evaluated in two text mining competitions.
[Show abstract][Hide abstract] ABSTRACT: We present an approach towards the automatic detection of names of proteins, genes, species, etc. in biomedical literature and their grounding to widely accepted identifiers. The annotation is based on a large term list that contains the common expression of the terms, a normalization step that matches the terms with their actual representation in the texts, and a disambiguation step that resolves the ambiguity of matched terms. We describe various characteristics of the terms found in existing term resources and of the terms that are used in biomedical texts. We evaluate our results against a corpus of manually annotated protein mentions and achieve a precision of 57% and recall of 72%.
Proceedings of the 12th Conference on Artificial Intelligence in Medicine: Artificial Intelligence in Medicine; 07/2009