Gerold Schneider

University of Zurich, Zürich, ZH, Switzerland

Are you Gerold Schneider?

Claim your profile

Publications (59)28.48 Total impact

  • Conference Paper: UZH in BioNLP 2013
    Proceedings of the BioNLP Shared Task 2013 Workshop; 08/2013
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In this article, we describe the architecture of the OntoGene Relation mining pipeline and its application in the triage task of BioCreative 2012. The aim of the task is to support the triage of abstracts relevant to the process of curation of the Comparative Toxicogenomics Database. We use a conventional information retrieval system (Lucene) to provide a baseline ranking, which we then combine with information provided by our relation mining system, in order to achieve an optimized ranking. Our approach additionally delivers domain entities mentioned in each input document as well as candidate relationships, both ranked according to a confidence score computed by the system. This information is presented to the user through an advanced interface aimed at supporting the process of interactive curation. Thanks, in particular, to the high-quality entity recognition, the OntoGene system achieved the best overall results in the task.
    Database The Journal of Biological Databases and Curation 01/2013; 2013:bas053. DOI:10.1093/database/bas053 · 4.46 Impact Factor
  • Rico Sennrich, Martin Volk, Gerold Schneider
    Proceedings of the International Conference on Recent Advances in Natural Language Processing; 01/2013
  • Fabio Rinaldi, Gerold Schneider, Simon Clematide
    [Show abstract] [Hide abstract]
    ABSTRACT: The mutual interactions among genes, diseases, and drugs are at the heart of biomedical research, and are especially important for the pharmacological industry. The recent trend towards personalized medicine makes it increasingly relevant to be able to tailor drugs to specific genetic makeups. The pharmacogenetics and pharmacogenomics knowledge base (PharmGKB) aims at capturing relevant information about such interactions from several sources, including curation of the biomedical literature. Advanced text mining tools which can support the process of manual curation are increasingly necessary in order to cope with the deluge of new published results. However, effective evaluation of those tools requires the availability of manually curated data as gold standard. In this paper we discuss how the existing PharmGKB database can be used for such an evaluation task in a way similar to the usage of gold standard data derived from protein-protein interaction databases in one of the recent BioCreative shared tasks. Additionally, we present our own considerations and results on the feasibility and difficulty of such a task.
    Journal of Biomedical Informatics 05/2012; 45(5):851-61. DOI:10.1016/j.jbi.2012.04.014 · 2.48 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Determining usefulness of biomedical text mining systems requires realistic task definition and data selection criteria without artificial constraints, measuring performance aspects that go beyond traditional metrics. The BioCreative III Protein-Protein Interaction (PPI) tasks were motivated by such considerations, trying to address aspects including how the end user would oversee the generated output, for instance by providing ranked results, textual evidence for human interpretation or measuring time savings by using automated systems. Detecting articles describing complex biological events like PPIs was addressed in the Article Classification Task (ACT), where participants were asked to implement tools for detecting PPI-describing abstracts. Therefore the BCIII-ACT corpus was provided, which includes a training, development and test set of over 12,000 PPI relevant and non-relevant PubMed abstracts labeled manually by domain experts and recording also the human classification times. The Interaction Method Task (IMT) went beyond abstracts and required mining for associations between more than 3,500 full text articles and interaction detection method ontology concepts that had been applied to detect the PPIs reported in them. A total of 11 teams participated in at least one of the two PPI tasks (10 in ACT and 8 in the IMT) and a total of 62 persons were involved either as participants or in preparing data sets/evaluating these tasks. Per task, each team was allowed to submit five runs offline and another five online via the BioCreative Meta-Server. From the 52 runs submitted for the ACT, the highest Matthew's Correlation Coefficient (MCC) score measured was 0.55 at an accuracy of 89% and the best AUC iP/R was 68%. Most ACT teams explored machine learning methods, some of them also used lexical resources like MeSH terms, PSI-MI concepts or particular lists of verbs and nouns, some integrated NER approaches. For the IMT, a total of 42 runs were evaluated by comparing systems against manually generated annotations done by curators from the BioGRID and MINT databases. The highest AUC iP/R achieved by any run was 53%, the best MCC score 0.55. In case of competitive systems with an acceptable recall (above 35%) the macro-averaged precision ranged between 50% and 80%, with a maximum F-Score of 55%. The results of the ACT task of BioCreative III indicate that classification of large unbalanced article collections reflecting the real class imbalance is still challenging. Nevertheless, text-mining tools that report ranked lists of relevant articles for manual selection can potentially reduce the time needed to identify half of the relevant articles to less than 1/4 of the time when compared to unranked results. Detecting associations between full text articles and interaction detection method PSI-MI terms (IMT) is more difficult than might be anticipated. This is due to the variability of method term mentions, errors resulting from pre-processing of articles provided as PDF files, and the heterogeneity and different granularity of method term concepts encountered in the ontology. However, combining the sophisticated techniques developed by the participants with supporting evidence strings derived from the articles for human interpretation could result in practical modules for biological annotation workflows.
    BMC Bioinformatics 10/2011; 12 Suppl 8(Suppl 8):S3. DOI:10.1186/1471-2105-12-S8-S3 · 2.67 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: We introduce our incremental coreference resolution system for the BioNLP 2011 Shared Task on Protein/Gene interaction. The benefits of an incremental architecture over a mention-pair model are: a reduction of the number of candidate pairs, a means to overcome the problem of underspecified items in pair-wise classification and the natural integration of global constraints such as transitivity. A filtering system takes into account specific features of different anaphora types. We do not apply Machine Learning, instead the system classifies with an empirically derived salience measure based on the dependency labels of the true mentions. The OntoGene pipeline is used for preprocessing.
    Proceedings of the BioNLP Shared Task 2011 Workshop; 06/2011
  • Gerold Schneider, Simon Clematide, Fabio Rinaldi
  • Source
    Gerold Schneider, Simon Clematide, Fabio Rinaldi
    [Show abstract] [Hide abstract]
    ABSTRACT: This article describes the approaches taken by the OntoGene group at the University of Zurich in dealing with two tasks of the BioCreative III competition: classification of articles which contain curatable protein-protein interactions (PPI-ACT) and extraction of experimental methods (PPI-IMT). Two main achievements are described in this paper: (a) a system for document classification which crucially relies on the results of an advanced pipeline of natural language processing tools; (b) a system which is capable of detecting all experimental methods mentioned in scientific literature, and listing them with a competitive ranking (AUC iP/R > 0.5). The results of the BioCreative III shared evaluation clearly demonstrate that significant progress has been achieved in the domain of biomedical text mining in the past few years. Our own contribution, together with the results of other participants, provides evidence that natural language processing techniques have become by now an integral part of advanced text mining approaches.
    BMC Bioinformatics 01/2011; 12 Suppl 8:S13. DOI:10.1186/1471-2105-12-S8-S13 · 2.67 Impact Factor
  • Source
    Fabio Rinaldi, Gerold Schneider, Simon Clematide
    Proceedings of the workshop "Mining the Pharmacogenomics Literature", Pacific Symposium on Biocomputing, Hawaii; 01/2011
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We present *ODIN* (Ontogene Document INspector): a system for interactive curation of biomedical literature, developed within the scope of the "*SASEBio* project":http://www.ontogene.org/projects/sasebio/ (Semi-Automated Semantic Enrichment of the Biomedical Literature), as a collaboration between the "OntoGene group":http://www.ontogene.org/ at the University of Zurich and the NITAS/TMS group of Novartis Pharma AG. The purpose of the system is to allow a human annotator/curator to leverage upon the results of an advanced text mining system in order to enhance the speed and effectiveness of the annotation process. The OntoGene system takes as input a document (e.g a full paper from PubMed Central) and processes it with a custom NLP pipeline, which includes Named Entity recognition and relation extraction. Entities which are currently supported include proteins, genes, experimental methods, cell lines, species. Entities detected in the input document are disambiguated with respect to a reference database (UniProt, EntrezGene, NCBI taxonomy, PSI-MI ontology). The annotated documents are handed back to the ODIN interface, which allows multiple display modalities. The curator/annotator can view the whole document with in-line annotations highlighted, or can browse the extracted entities and be pointed back to the mentions of the entities within the original document. All entity mentions are entirely editable: the curator can easily add or delete any of them, and also change their extent (i.e. add/remove words to its right or left) with a simple click of the mouse. Different entity views are supported, with sorting capabilities according to different criteria (entity type, entity mention, confidence score, etc.). Selective highlighting of text units (e.g. sentences containing desired entities) is supported. Additionally, extensive logging functionalities are provided. All documents and entities are fully interlinked to reference databases, for the purpose of simplified inspection. Entities can be grouped in classes (e.g. by species) and actions can be applied to whole classes, for selective editing or removal.
    Nature Precedings 11/2010; DOI:10.1038/npre.2010.5169.1
  • [Show abstract] [Hide abstract]
    ABSTRACT: We describe a system for the detection of mentions of protein-protein interactions in the biomedical scientific literature. The original system was developed as a part of the OntoGene project, which focuses on using advanced computational linguistic techniques for text mining applications in the biomedical domain. In this paper, we focus in particular on the participation to the BioCreative II.5 challenge, where the OntoGene system achieved best-ranked results. Additionally, we describe a feature-analysis experiment performed after the challenge, which shows the unexpected result that one single feature alone performs better than the combination of features used in the challenge.
    IEEE/ACM transactions on computational biology and bioinformatics / IEEE, ACM 07/2010; 7(3):472-80. DOI:10.1109/TCBB.2010.50 · 1.54 Impact Factor
  • Conference Paper: OntoGene in CALBC
    Fabio Rinaldi, Simon Clematide, Gerold Schneider
    Proceedings of the CALBC workshop.; 01/2010
  • [Show abstract] [Hide abstract]
    ABSTRACT: We present an approach towards the automatic detection of names of proteins, genes, species, etc. in biomedical literature and their grounding to widely accepted identifiers. The annotation is based on a large term list that contains the common expression of the terms, a normalization step that matches the terms with their actual representation in the texts, and a disambiguation step that resolves the ambiguity of matched terms. We describe various characteristics of the terms found in existing term resources and of the terms that are used in biomedical texts. We evaluate our results against a corpus of manually annotated protein mentions and achieve a precision of 57% and recall of 72%.
    Proceedings of the 12th Conference on Artificial Intelligence in Medicine: Artificial Intelligence in Medicine; 07/2009
  • Source
    Kaarel Kaljurand, Gerold Schneider, Fabio Rinaldi
    [Show abstract] [Hide abstract]
    ABSTRACT: We describe a biological event detection method implemented for the BioNLP 2009 Shared Task 1. The method relies entirely on the chunk and syntactic dependency relations provided by a general NLP pipeline which was not adapted in any way for the purposes of the shared task. The method maps the syn- tactic relations to event structures while be- ing guided by the probabilities of the syntactic features of events which were automatically learned from the training data. Our method achieved a recall of 26% and a precision of 44% in the official test run, under "strict equal- ity" of events.
    Proceedings of the BioNLP workshop, Boulder, Colorado; 01/2009
  • Source
    Proceedings of CICLING 2009, Mexico City; 01/2009
  • Proceedings of the LBM 2009 Conference; 01/2009
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We describe the task of automatically detecting interactions between proteins in biomedical literature. We use a syntactic parser, a corpus annotated for proteins, and manual decisions as training material. After automatically parsing the GENIA corpus, which is manually annotated for proteins, all syntactic paths between proteins are extracted. These syntactic paths are manually disambiguated between meaningful paths and irrelevant paths. Meaningful paths are paths that express an interaction between the syntactically connected proteins, irrelevant paths are paths that do not convey any interaction. The resource created by these manual decisions is used in two ways. First, words that appear frequently inside a meaningful paths are learnt using simple machine learning. Second, these resources are applied to the task of automatically detecting interactions between proteins in biomedical literature. We use the IntAct corpus as an application corpus. After detecting proteins in the IntAct texts, we automatically parse them and classify the syntactic paths between them using the meaningful paths from the resource created on GENIA and addressing sparse data problems by shortening the paths based on the words frequently appearing inside the meaningful paths, so-called transparent words. We conduct an evaluation showing that we achieve acceptable recall and good precision, and we discuss the importance of transparent words for the task.
    Proceedings of CICLING 2009; 01/2009
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Research scientists and companies working in the domains of biomedicine and genomics are increasingly faced with the problem of efficiently locating, within the vast body of published scientific findings, the critical pieces of information that are needed to direct current and future research investment. In this report we describe approaches taken within the scope of the second BioCreative competition in order to solve two aspects of this problem: detection of novel protein interactions reported in scientific articles, and detection of the experimental method that was used to confirm the interaction. Our approach to the former problem is based on a high-recall protein annotation step, followed by two strict disambiguation steps. The remaining proteins are then combined according to a number of lexico-syntactic filters, which deliver high-precision results while maintaining reasonable recall. The detection of the experimental methods is tackled by a pattern matching approach, which has delivered the best results in the official BioCreative evaluation. Although the results of BioCreative clearly show that no tool is sufficiently reliable for fully automated annotations, a few of the proposed approaches (including our own) already perform at a competitive level. This makes them interesting either as standalone tools for preliminary document inspection, or as modules within an environment aimed at supporting the process of curation of biomedical literature.
    Genome biology 09/2008; 9 Suppl 2(Suppl 2):S13. DOI:10.1186/gb-2008-9-s2-s13 · 10.47 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper we present techniques aimed at detecting, within scientific papers which describe newly discovered protein interac- tions, the methods used by the authors of the research to experimentally verify the in- teraction(s). We compare previous results over the BioCreAtIvE data set with more recent re- sults over a larger data set, using INTACT annotations as gold standard. This compar- ison shows the generality of the proposed approach and suggests that practical appli- cation of these techniques within a curation environment might not be that far away.
    Third International Symposium on Semantic Mining in Biomedicine (SMBM); 01/2008
  • Fabio Rinaldi, Gerold Schneider, Kaarel Kaljurand
    Genome to Systems, Manchester, UK; 01/2008