Thomas Wächter

Technische Universität Dresden, Dresden, Saxony, Germany

Are you Thomas Wächter?

Claim your profile

Publications (17)27.06 Total impact

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The knowledge-based search engine Go3R, www.Go3R.org, has been developed to assist scientists from industry and regulatory authorities in collecting comprehensive toxicological information with a special focus on identifying available alternatives to animal testing. The semantic search paradigm of Go3R makes use of expert knowledge on 3Rs methods and regulatory toxicology, laid down in the ontology, a network of concepts, terms, and synonyms, to recognize the contents of documents. Search results are automatically sorted into a dynamic table of contents presented alongside the list of documents retrieved. This table of contents allows the user to quickly filter the set of documents by topics of interest. Documents containing hazard information are automatically assigned to a user interface following the endpoint-specific IUCLID5 categorization scheme required, e.g., for REACH registration dossiers. For this purpose, complex endpoint-specific search queries were compiled and integrated into the search engine (based upon a gold standard of 310 references that had been assigned manually to the different endpoint categories). Go3R sorts 87% of the references concordantly into the respective IUCLID5 categories. Currently, Go3R searches in the 22 Million documents available in the PubMed and TOXNET databases. However it can be customized to search in other databases including in-house databanks.
    Toxicology in Vitro 01/2014; · 2.65 Impact Factor
  • Source
    Götz Fabian, Thomas Wächter, Michael Schroeder
    [Show abstract] [Hide abstract]
    ABSTRACT: Ontologies are an everyday tool in biomedicine to capture and represent knowledge. However, many ontologies lack a high degree of coverage in their domain and need to improve their overall quality and maturity. Automatically extending sets of existing terms will enable ontology engineers to systematically improve text-based ontologies level by level. We developed an approach to extend ontologies by discovering new terms which are in a sibling relationship to existing terms of an ontology. For this purpose, we combined two approaches which retrieve new terms from the web. The first approach extracts siblings by exploiting the structure of HTML documents, whereas the second approach uses text mining techniques to extract siblings from unstructured text. Our evaluation against MeSH (Medical Subject Headings) shows that our method for sibling discovery is able to suggest first-class ontology terms and can be used as an initial step towards assessing the completeness of ontologies. The evaluation yields a recall of 80% at a precision of 61% where the two independent approaches are complementing each other. For MeSH in particular, we show that it can be considered complete in its medical focus area. We integrated the work into DOG4DAG, an ontology generation plugin for the editors OBO-Edit and Protégé, making it the first plugin that supports sibling discovery on-the-fly. Sibling discovery for ontology is available as part of DOG4DAG (www.biotec.tu-dresden.de/research/schroeder/dog4dag) for both Protégé 4.1 and OBO-Edit 2.1.
    Bioinformatics 06/2012; 28(12):i292-300. · 5.47 Impact Factor
  • Thomas Wächter, Götz Fabian, Michael Schroeder
    [Show abstract] [Hide abstract]
    ABSTRACT: In the biomedical domain, Protégé and OBO-Edit are the main ontology editors supporting the manual construction of ontologies. Since manual creation is a laborious and hence costly process, there have been efforts to automate parts of this process. Here, we give a demo of the capabilities of DOG4DAG, the Dresden Ontology Generator for Directed Acyclic Graphs, which is available as plugin to both OBO-Edit and Protégé. In the demo, we describe how to generate terms and in particular siblings, definitions, and is-a relationships using an example in the domain of nervous system diseases. We summarise the strengths and limits of the different the steps of the generation process.
    12/2011;
  • [Show abstract] [Hide abstract]
    ABSTRACT: The risk assessment of nano-sized materials (NM) currently suffers from great uncertainties regarding their putative toxicity for humans and the environment. An extensive amount of the respective original research literature has to be evaluated before a targeted and hypothesis-driven Environmental and Health Safety research can be stipulated. Furthermore, to comply with the European animal protection legislation in vitro testing has to be preferred whenever possible. Against this background, there is the need for tools that enable producers of NM and risk assessors for a fast and comprehensive data retrieval, thereby linking the 3Rs principle to the hazard identification of NM. Here we report on the development of a knowledge-based search engine that is tailored to the particular needs of risk assessors in the area of NM. Comprehensive retrieval of data from studies utilising in vitro as well as in vivo methods relying on the PubMed database is presented exemplarily with a titanium dioxide case study. A fast, relevant and reliable information retrieval is of paramount importance for the scientific community dedicated to develop safe NM in various product areas, and for risk assessors obliged to identify data gaps, to define additional data requirements for approval of NM and to create strategies for integrated testing using alternative methods.
    Regulatory Toxicology and Pharmacology 02/2011; 59(1):47-52. · 2.13 Impact Factor
  • Source
    Thomas Wächter, Michael Schroeder
    [Show abstract] [Hide abstract]
    ABSTRACT: Ontologies and taxonomies have proven highly beneficial for biocuration. The Open Biomedical Ontology (OBO) Foundry alone lists over 90 ontologies mainly built with OBO-Edit. Creating and maintaining such ontologies is a labour-intensive, difficult, manual process. Automating parts of it is of great importance for the further development of ontologies and for biocuration. We have developed the Dresden Ontology Generator for Directed Acyclic Graphs (DOG4DAG), a system which supports the creation and extension of OBO ontologies by semi-automatically generating terms, definitions and parent-child relations from text in PubMed, the web and PDF repositories. DOG4DAG is seamlessly integrated into OBO-Edit. It generates terms by identifying statistically significant noun phrases in text. For definitions and parent-child relations it employs pattern-based web searches. We systematically evaluate each generation step using manually validated benchmarks. The term generation leads to high-quality terms also found in manually created ontologies. Up to 78% of definitions are valid and up to 54% of child-ancestor relations can be retrieved. There is no other validated system that achieves comparable results. By combining the prediction of high-quality terms, definitions and parent-child relations with the ontology editor OBO-Edit we contribute a thoroughly validated tool for all OBO ontology engineers. DOG4DAG is available within OBO-Edit 2.1 at http://www.oboedit.org. Supplementary data are available at Bioinformatics online.
    Bioinformatics 06/2010; 26(12):i88-96. · 5.47 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Ontology term labels can be ambiguous and have multiple senses. While this is no problem for human annotators, it is a challenge to automated methods, which identify ontology terms in text. Classical approaches to word sense disambiguation use co-occurring words or terms. However, most treat ontologies as simple terminologies, without making use of the ontology structure or the semantic similarity between terms. Another useful source of information for disambiguation are metadata. Here, we systematically compare three approaches to word sense disambiguation, which use ontologies and metadata, respectively. The 'Closest Sense' method assumes that the ontology defines multiple senses of the term. It computes the shortest path of co-occurring terms in the document to one of these senses. The 'Term Cooc' method defines a log-odds ratio for co-occurring terms including co-occurrences inferred from the ontology structure. The 'MetaData' approach trains a classifier on metadata. It does not require any ontology, but requires training data, which the other methods do not. To evaluate these approaches we defined a manually curated training corpus of 2600 documents for seven ambiguous terms from the Gene Ontology and MeSH. All approaches over all conditions achieve 80% success rate on average. The 'MetaData' approach performed best with 96%, when trained on high-quality data. Its performance deteriorates as quality of the training data decreases. The 'Term Cooc' approach performs better on Gene Ontology (92% success) than on MeSH (73% success) as MeSH is not a strict is-a/part-of, but rather a loose is-related-to hierarchy. The 'Closest Sense' approach achieves on average 80% success rate. Metadata is valuable for disambiguation, but requires high quality training data. Closest Sense requires no training, but a large, consistently modelled ontology, which are two opposing conditions. Term Cooc achieves greater 90% success given a consistently modelled ontology. Overall, the results show that well structured ontologies can play a very important role to improve disambiguation. The three benchmark datasets created for the purpose of disambiguation are available in Additional file 1.
    BMC Bioinformatics 02/2009; 10:28. · 3.02 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Consideration and incorporation of all available scientific information is an important part of the planning of any scientific project. As regards research with sentient animals, EU Directive 86/609/EEC for the protection of laboratory animals requires scientists to consider whether any planned animal experiment can be substituted by other scientifically satisfactory methods not entailing the use of animals or entailing less animals or less animal suffering, before performing the experiment. Thus, collection of relevant information is indispensable in order to meet this legal obligation. However, no standard procedures or services exist to provide convenient access to the information required to reliably determine whether it is possible to replace, reduce or refine a planned animal experiment in accordance with the 3Rs principle. The search engine Go3R, which is available free of charge under http://Go3R.org, runs up to become such a standard service. Go3R is the world-wide first search engine on alternative methods building on new semantic technologies that use an expert-knowledge based ontology to identify relevant documents. Due to Go3R's concept and design, the search engine can be used without lengthy instructions. It enables all those involved in the planning, authorisation and performance of animal experiments to determine the availability of non-animal methodologies in a fast, comprehensive and transparent manner. Thereby, Go3R strives to significantly contribute to the avoidance and replacement of animal experiments.
    ALTEX. 02/2009; 26(1):17-31.
  • [Show abstract] [Hide abstract]
    ABSTRACT: EU Directive 86/609/EEC for the protection of laboratory animals obliges scientists to consider whether a planned animal experiment can be replaced, reduced or refined (3Rs principle). To meet this regulatory obligation, scientists must consult the relevant scientific literature prior to any experimental study using laboratory animals. More than 50 million potentially 3Rs relevant documents are spread over the World Wide Web, biomedical literature and patent databases. In April 2008, the beta version of Go3R ("www.Go3R.org":http://www.Go3R.org), the first knowledge-based semantic search engine for alternative methods to animal experiments, was released. Go3R is free of charge and enables scientists and regulatory authorities involved in the planning, authorisation and performance of animal experiments to determine the availability of alternative methods in a fast and comprehensive manner. The technical basis of this search engine is specific 3Rs expert knowledge captured within the Go3R Ontology containing 87,218 labels and synonyms. A total of 16,620 concepts were structured in 28 branches, where 1,227 concepts were newly defined to specifically describe directly 3Rs relevant knowledge. Additionally relevant headings from MeSH where referenced to reflect the topics associated with the definition of Animal Testing Alternatives. Therefore it is distinguished between thematic-defining and directly 3Rs relevant branches. In addition to the assignment of direct parent-child relationships, further relationship types were introduced to allow to model 3Rs relevant domain knowledge. Examples for such knowledge are e.g. (1) the characteristics of cell culture tests methods, which usually utilize “specific cell types” or “cell lines” and are associated with a specific “endpoint” and “endpoint detection method” or (2) named test methods like “PREDISAFE™”, which replaces an animal test namely the “eye irritation test” in rabbits and uses specific cells namely “SIRC Cells” or (3) the “Haemagglutinin-Neuraminidase Protein Assay”, which detects a protein of the “Newcastle disease virus”. Thereby, an article in which e.g. a specific 3Rs method is not explicitly mentioned could still be recognized as relevant for the specific topic searched for in an indirect manner, for example if it mentions specific cells, endpoints or endpoint detection methods, which are relevant for the respective application. The search engine Go3R with its novel ontology is already well recognized by the 3Rs community and will be further maintained and developed.
    Nature Precedings. 01/2009;
  • [Show abstract] [Hide abstract]
    ABSTRACT: The biomedical literature can be seen as a large integrated, but unstructured data repository. Extracting facts from literature and making them accessible is approached from two directions: manual curation efforts develop ontologies and vocabularies to annotate gene products based on statements in papers. Text mining aims to automatically identify entities and their relationships in text using information retrieval and natural language processing techniques. Manual curation is highly accurate but time consuming, and does not scale with the ever increasing growth of literature. Text mining as a high-throughput computational technique scales well, but is error-prone due to the complexity of natural language. How can both be married to combine scalability and accuracy? Here, we review the state-of-the-art text mining approaches that are relevant to annotation and discuss available online services analysing biomedical literature by means of text mining techniques, which could also be utilised by annotation projects. We then examine how far text mining has already been utilised in existing annotation projects and conclude how these techniques could be tightly integrated into the manual annotation process through novel authoring systems to scale-up high-quality manual curation.
    Briefings in Bioinformatics 01/2009; 9(6):466-78. · 5.30 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: With the ever increasing size of scientific literature, finding relevant documents and answering questions has become even more of a challenge. Recently, ontologies—hierarchical, controlled vocabularies—have been introduced to annotate genomic data. They can also improve the question and answering and the selection of relevant documents in the literature search. Search engines such as GoPubMed.org use ontological background knowledge to give an overview over large query results and to answer questions. We review the problems and solutions underlying these next-generation intelligent search engines and give examples of the power of this new search paradigm. KeywordsPubMed-Literature search-Ontology-Intelligent search
    12/2008: pages 385-399;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The engineering of ontologies, especially with a view to a text-mining use, is still a new research field. There does not yet exist a well-defined theory and technology for ontology construction. Many of the ontology design steps remain manual and are based on personal experience and intuition. However, there exist a few efforts on automatic construction of ontologies in the form of extracted lists of terms and relations between them. We share experience acquired during the manual development of a lipoprotein metabolism ontology (LMO) to be used for text-mining. We compare the manually created ontology terms with the automatically derived terminology from four different automatic term recognition (ATR) methods. The top 50 predicted terms contain up to 89% relevant terms. For the top 1000 terms the best method still generates 51% relevant terms. In a corpus of 3066 documents 53% of LMO terms are contained and 38% can be generated with one of the methods. Given high precision, automatic methods can help decrease development time and provide significant support for the identification of domain-specific vocabulary. The coverage of the domain vocabulary depends strongly on the underlying documents. Ontology development for text mining should be performed in a semi-automatic way; taking ATR results as input and following the guidelines we described. The TFIDF term recognition is available as Web Service, described at http://gopubmed4.biotec.tu-dresden.de/IdavollWebService/services/CandidateTermGeneratorService?wsdl.
    BMC Bioinformatics 02/2008; 9 Suppl 4:S2. · 3.02 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Background: The engineering of ontologies, especially with a view to a text-mining use, is still a new research field. There does not yet exist a well-defined theory and technology for ontology construction. Many of the ontology design steps remain manual and are based on personal experience and intuition. However, there exist a few efforts on automatic construction of ontologies in the form of extracted lists of terms and relations between them. Results: We share experience acquired during the manual development of a lipoprotein metabolism ontology (LMO) to be used for text-mining. We compare the manually created ontology terms with the automatically derived terminology from four different automatic term recognition methods. The top 50 predicted terms contain up to 89% relevant terms. For the top 1000 terms the best method still generates 51% relevant terms. In a corpus of 3066 documents 53% of LMO terms are contained and 38% can be generated with one of the methods. Secondly we present a use case for ontology-based search for toxicological methods. Conclusions: Given high precision, automatic methods can help decrease development time and provide significant support for the identification of domain-specific vocabulary. The coverage of the domain vocabulary depends strongly on the underlying documents. Ontology development for text mining should be performed in a semi-automatic way; taking automatic term recognition results as input. Availability: The automatic term recognition method is available as web service, described at http://gopubmed4.biotec.tu- dresden.de/IdavollWebService/services/CandidateTermGeneratorService?wsdl @InProceedings{alexopoulou_et_al:DSP:2008:1506, author = {Dimitra Alexopoulou and Thomas W{"a}chter and Laura Pickersgill and Cecilia Eyre and Michael Schroeder}, title = {Ontology learning with text mining: Two use cases in lipoprotein metabolism and toxicology}, booktitle = {Ontologies and Text Mining for Life Sciences : Current Status and Future Perspectives}, year = {2008}, editor = {Michael Ashburner and Ulf Leser and Dietrich Rebholz-Schuhmann}, number = {08131}, series = {Dagstuhl Seminar Proceedings}, ISSN = {1862-4405}, publisher = {Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik, Germany}, address = {Dagstuhl, Germany}, URL = {http://drops.dagstuhl.de/opus/volltexte/2008/1506}, annote = {Keywords: Automatic Term Recognition, Ontology Learning, Lipoprotein Metabolism} }
    01/2008;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Semantic web technologies promise to ease the pain of data and system integration in the life sciences. The semantic web consists of standards such as XML for mark-up of contents, RDF for representation of triplets, and OWL to define ontologies. We discuss three approaches for querying semantic web contents and building integrated bioinformatics applications, which allows bioinformaticians to make an informed choice for their data integration needs. Besides already established approach such as XQuery, we compare two novel rule-based approaches, namely Xcerpt - a versatile XML and RDF query language, and Prova - a language for rule-based Java scripting. We demonstrate the core features and limitations of these three approaches through a case study, which comprises an ontology browser, which supports retrieval of protein structure and sequence information for proteins annotated with terms from the ontology.
    12/2006: pages 31-52;
  • [Show abstract] [Hide abstract]
    ABSTRACT: The life sciences are a promising application area for semantic web technologies as there are large online structured and unstructured data repositories and ontologies, which structure this knowledge. We briefly give an overview over biomedical ontologies and show how they can help to locate, retrieve, and integrate biomedical data. Annotating literature with ontology terms is an important problem to support such ontology-based searches. We review the steps involved in this text mining task and introduce the ontology-based search engine GoPubMed. As the underlying data sources evolve, so do the ontologies. We give a brief overview over different approaches supporting the semi-automatic evolution of ontologies.
    Reasoning Web, Second International Summer School 2006, Lisbon, Portugal, September 4-8, 2006, Tutorial Lectures; 01/2006
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Bio-ontologies are hierarchical vocabularies, which are used to annotate other data sources such as sequence and structure databases. With the wide use of ontologies their integration, design, and evolution becomes an important problem. We show how textmining on relevant text corpora can be used to identify matching ontology terms of two separate ontologies and to propose new ontology terms for a given term. We evaluate these approaches on the GeneOntology.
    Proceedings of the Winter Simulation Conference WSC 2006, Monterey, California, USA, December 3-6, 2006; 01/2006
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In the context of genome research, the method of gene expression analysis has been used for several years. Related microarray experiments are conducted all over the world, and conse- quently, a vast amount of microarray data sets are produced. Having access to this variety of repositories, researchers would like to incorpo- rate this data in their analyses to increase the sta- tistical significance of their results. In this pa- per, we present a new two-phase clustering strat- egy which is based on the combination of local clustering results to obtain a global clustering. The advantage of such a technique is that each microarray data set can be normalized and clus- tered separately. The set of different relevant lo- cal clustering results is then used to calculate the global clustering result. Furthermore, we present an approach based on technical as well as biolog- ical quality measures to determine weighting fac- tors for quantifying the local results proportion within the global result. The better the attested quality of the local results, the stronger their im- pact on the global result.
    Proceedings of the 2006 ACM Symposium on Applied Computing (SAC), Dijon, France, April 23-27, 2006; 01/2006
  • [Show abstract] [Hide abstract]
    ABSTRACT: Many ontologies and vocabularies have been designed to annotate genes and gene products based on evidence from literature. They are also useful to search literature systematically. GoPubMed is such an ontology-based literature search engine. It allows users to explore PubMed search results with hierarchical vocabularies such as the Gene Ontology or MeSH.We demonstrate the use of GoPubMed and MeshPubMed to answer questions relating to anatomy. Then, we discuss MousePubMed, the adaption of GoPubMed to vocabularies used in the Edinburgh Mouse Atlas with genes, tissues, and developmental stages. We develop a specific text mining algorithm for MousePubMed and demonstrate its usefulness by evaluating it on the Mouse Atlas. For nearly 1500 genes and over 10.000 triples of gene, tissue and stage, we are able to reconstruct with MousePubMed 37% of genes, 31% of gene-tissue associations and 13% of gene-tissue-stage associations from PubMed abstracts. These figures are encouraging as only abstracts are used.