Andreas Karwath

Johannes Gutenberg-Universität Mainz, Mayence, Rheinland-Pfalz, Germany

Are you Andreas Karwath?

Claim your profile

Publications (29)31.64 Total impact

  • [Show abstract] [Hide abstract]
    ABSTRACT: (Q)SAR model validation is essential to ensure the quality of inferred models and to indicate future model predictivity on unseen compounds. Proper validation is also one of the requirements of regulatory authorities in order to accept the (Q)SAR model, and to approve its use in real world scenarios as alternative testing method. However, at the same time, the question of how to validate a (Q)SAR model, in particular whether to employ variants of cross-validation or external test set validation, is still under discussion. In this paper, we empirically compare a k-fold cross-validation with external test set validation. To this end we introduce a workflow allowing to realistically simulate the common problem setting of building predictive models for relatively small datasets. The workflow allows to apply the built and validated models on large amounts of unseen data, and to compare the performance of the different validation approaches. The experimental results indicate that cross-validation produces higher performant (Q)SAR models than external test set validation, reduces the variance of the results, while at the same time underestimates the performance on unseen compounds. The experimental results reported in this paper suggest that, contrary to current conception in the community, cross-validation may play a significant role in evaluating the predictivity of (Q)SAR models.
    Molecular Informatics. 10/2013; 32(9‐10).
  • Madeleine Seeland, Andreas Karwath, Stefan Kramer
    [Show abstract] [Hide abstract]
    ABSTRACT: In recent years, graph kernels have received considerable interest within the machine learning and data mining community. Here, we introduce a novel approach enabling kernel methods to utilize additional information hidden in the structural neighborhood of the graphs under consideration. Our novel structural cluster kernel (SCK) incorporates similarities induced by a structural clustering algorithm to improve state-of-the-art graph kernels. The approach taken is based on the idea that graph similarity can not only be described by the similarity between the graphs themselves, but also by the similarity they possess with respect to their structural neighborhood. We applied our novel kernel in a supervised and a semi-supervised setting to regression and classification problems on a number of real-world datasets of molecular graphs. Our results show that the structural cluster similarity information can indeed leverage the prediction performance of the base kernel, particularly when the dataset is structurally sparse and consequently structurally diverse. By additionally taking into account a large number of unlabeled instances the performance of the structural cluster kernel can further be improved.
    08/2012;
  • Source
    Martin Gütlein, Andreas Karwath, Stefan Kramer
    [Show abstract] [Hide abstract]
    ABSTRACT: Analyzing chemical datasets is a challenging task for scientific researchers in the field of chemoinformatics. It is important, yet difficult to understand the relationship between the structure of chemical compounds, their physico-chemical properties, and biological or toxic effects. To that respect, visualization tools can help to better comprehend the underlying correlations. Our recently developed 3D molecular viewer CheS-Mapper (Chemical Space Mapper) divides large datasets into clusters of similar compounds and consequently arranges them in 3D space, such that their spatial proximity reflects their similarity. The user can indirectly determine similarity, by selecting which features to employ in the process. The tool can use and calculate different kind of features, like structural fragments as well as quantitative chemical descriptors. These features can be highlighted within CheS-Mapper, which aids the chemist to better understand patterns and regularities and relate the observations to established scientific knowledge. As a final function, the tool can also be used to select and export specific subsets of a given dataset for further analysis.
    Journal of Cheminformatics 03/2012; 4(1):7. · 3.59 Impact Factor
  • IEEE Transactions on Robotics (T-RO). 01/2012; 8(1):234-245.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We present a novel approach to incrementally determine the trajectory of a person in 3-D based on its motions and activities in real time. In our algorithm, we estimate the motions and activities of the user given the data that are obtained from a motion capture suit equipped with several inertial measurement units. These activities include walking up and down staircases, as well as opening and closing doors. We interpret the first two types of activities as motion constraints and door-handling events as landmark detections in a graph-based simultaneous localization and mapping (SLAM) framework. Since we cannot distinguish between individual doors, we employ a multihypothesis tracking approach on top of the SLAM procedure to deal with the high data-association uncertainty. As a result, we are able to accurately and robustly recover the trajectory of the person. Additionally, we present an algorithm to build approximate geometrical and topological maps based on the estimated trajectory and detected activities. We evaluate our approach in practical experiments that are carried out with different subjects and in various environments.
    IEEE Transactions on Robotics 01/2012; 28:234-245. · 2.57 Impact Factor
  • Embedded Reasoning, Papers from the 2010 AAAI Spring Symposium, Technical Report SS-10-04, Stanford, California, USA, March 22-24, 2010; 01/2010
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We present a novel approach to build approximate maps of structured environments utilizing human motion and activity. Our approach uses data recorded with a data suit which is equipped with several IMUs to detect movements of a person and door opening and closing events. In our approach we interpret the movements as motion constraints and door handling events as landmark detections in a graph-based SLAM framework. As we cannot distinguish between individual doors, we employ a multi-hypothesis approach on top of the SLAM system to deal with the high data-association uncertainty. As a result, our approach is able to accurately and robustly recover the trajectory of the person. We additionally take advantage of the fact that people traverse free space and that doors separate rooms to recover the geometric structure of the environment after the graph optimization. We evaluate our approach in several experiments carried out with different users and in environments of different types.
    IEEE International Conference on Robotics and Automation, ICRA 2010, Anchorage, Alaska, USA, 3-7 May 2010; 01/2010
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: OpenTox provides an interoperable, standards-based Framework for the support of predictive toxicology data management, algorithms, modelling, validation and reporting. It is relevant to satisfying the chemical safety assessment requirements of the REACH legislation as it supports access to experimental data, (Quantitative) Structure-Activity Relationship models, and toxicological information through an integrating platform that adheres to regulatory requirements and OECD validation principles. Initial research defined the essential components of the Framework including the approach to data access, schema and management, use of controlled vocabularies and ontologies, architecture, web service and communications protocols, and selection and integration of algorithms for predictive modelling. OpenTox provides end-user oriented tools to non-computational specialists, risk assessors, and toxicological experts in addition to Application Programming Interfaces (APIs) for developers of new applications. OpenTox actively supports public standards for data representation, interfaces, vocabularies and ontologies, Open Source approaches to core platform components, and community-based collaboration approaches, so as to progress system interoperability goals.The OpenTox Framework includes APIs and services for compounds, datasets, features, algorithms, models, ontologies, tasks, validation, and reporting which may be combined into multiple applications satisfying a variety of different user needs. OpenTox applications are based on a set of distributed, interoperable OpenTox API-compliant REST web services. The OpenTox approach to ontology allows for efficient mapping of complementary data coming from different datasets into a unifying structure having a shared terminology and representation.Two initial OpenTox applications are presented as an illustration of the potential impact of OpenTox for high-quality and consistent structure-activity relationship modelling of REACH-relevant endpoints: ToxPredict which predicts and reports on toxicities for endpoints for an input chemical structure, and ToxCreate which builds and validates a predictive toxicity model based on an input toxicology dataset. Because of the extensible nature of the standardised Framework design, barriers of interoperability between applications and content are removed, as the user may combine data, models and validation from multiple sources in a dependable and time-effective way.
    Journal of Cheminformatics 01/2010; 2(1):7. · 3.59 Impact Factor
  • Source
    Hannes Schulz, Kristian Kersting, Andreas Karwath
    [Show abstract] [Hide abstract]
    ABSTRACT: Relational data is complex. This complexity makes one of the basic steps of ILP difficult: understanding the data and results. If the user cannot easily understand it, he draws incomplete conclusions. The situation is very much as in the parable of the blind men and the elephant that appears in many cultures. In this tale the blind work independently and with quite different pieces of information, thereby drawing very different conclusions about the nature of the beast. In contrast, visual representations make it easy to shift from one perspective to another while exploring and analyzing data. This paper describes a method for embedding interpretations and queries into a single, common Euclidean space based on their co-proven statistics. We demonstrate our method on real-world datasets showing that ILP results can indeed be captured at a glance.
    Inductive Logic Programming, 19th International Conference, ILP 2009, Leuven, Belgium, July 02-04, 2009. Revised Papers; 01/2009
  • Source
    Andreas Karwath, Kristian Kersting, Niels Landwehr
    [Show abstract] [Hide abstract]
    ABSTRACT: The task of aligning sequences arises in many applications. Classical dynamic programming approaches require the explicit state enumeration in the reward model. This is often impractical: the number of states grows very quickly with the number of domain objects and relations among these objects. Relational sequence alignment aims at exploiting symbolic structure to avoid the full enumeration. This comes at the expense of a more complex reward model selection problem: virtually infinitely many abstraction levels have to be explored. In this paper, we apply gradient-based boosting to leverage this problem. Specifically, we show how to reduce the learning problem to a series of relational regressions problems. The main benefit of this is that interactions between states variables are introduced only as needed, so that the potentially infinite search space is not explicitly considered. As our experimental results show, this boosting approach can significantly improve upon established results in challenging applications.
    Proceedings of the 8th IEEE International Conference on Data Mining (ICDM 2008), December 15-19, 2008, Pisa, Italy; 01/2008
  • Source
    Andreas Karwath, Kristian Kersting
    [Show abstract] [Hide abstract]
    ABSTRACT: The need to measure sequence similarity arises in informa- tion extraction, music mining, biological sequence analysis, and other domains, and often coincides with sequence alignment: the more similar two sequences are, the better they can be aligned. Aligning sequences not only shows how similar sequences are, it also shows where there are dierences and correspondences between the sequences. Traditionally, the alignment has been considered for sequences of flat symbols only. Many real world sequences such as protein secondary struc- tures, however, exhibit a rich internal structures. This is akin to the problem of dealing with structured examples studied in the field of in- ductive logic programming (ILP). In this paper, we propose to use well- established ILP distance measures within alignment methods. Although straight-forward, our initial experimental results show that this approach performs well in practice and is worth to be explored.
    01/2008;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Sequential behavior and sequence learning are essential to intelligence. Often the elements of sequences exhibit an internal structure that can elegantly be represented using relational atoms. Applying traditional sequential learning techniques to such relational sequences requires one either to ignore the internal structure or to live with a combinatorial explosion of the model complexity. This chapter briefly reviews relational sequence learning and describes several techniques tailored towards realizing this, such as local pattern mining techniques, (hidden) Markov models, conditional random fields, dynamic programming and reinforcement learning.
    Probabilistic Inductive Logic Programming - Theory and Applications; 01/2008
  • Source
    Andreas Karwath, Luc De Raedt
    [Show abstract] [Hide abstract]
    ABSTRACT: Most approaches to structure-activity-relationship (SAR) prediction proceed in two steps. In the first step, a typically large set of fingerprints, or fragments of interest, is constructed (either by hand or by some recent data mining techniques). In the second step, machine learning techniques are applied to obtain a predictive model. The result is often not only a highly accurate but also hard to interpret model. In this paper, we demonstrate the capabilities of a novel SAR algorithm, SMIREP, which tightly integrates the fragment and model generation steps and which yields simple models in the form of a small set of IF-THEN rules. These rules contain SMILES fragments, which are easy to understand to the computational chemist. SMIREP combines ideas from the well-known IREP rule learner with a novel fragmentation algorithm for SMILES strings. SMIREP has been evaluated on three problems: the prediction of binding activities for the estrogen receptor (Environmental Protection Agency's (EPA's) Distributed Structure-Searchable Toxicity (DSSTox) National Center for Toxicological Research estrogen receptor (NCTRER) Database), the prediction of mutagenicity using the carcinogenic potency database (CPDB), and the prediction of biodegradability on a subset of the Environmental Fate Database (EFDB). In these applications, SMIREP has the advantage of producing easily interpretable rules while having predictive accuracies that are comparable to those of alternative state-of-the-art techniques.
    Journal of Chemical Information and Modeling 01/2007; 46(6):2432-44. · 4.30 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: This paper is a manifesto aimed at computer scientists interested in developing and applying scientific discovery methods. It argues that: science is experiencing an unprecedented “explosion” in the amount of available data; traditional data analysis methods cannot deal with this increased quantity of data; there is an urgent need to automate the process of refining scientific data into scientific knowledge; inductive logic programming (ILP) is a data analysis framework well suited for this task; and exciting new scientific discoveries can be achieved using ILP scientific discovery methods. We describe an example of using ILP to analyse a large and complex bioinformatic database that has produced unexpected and interesting scientific results in functional genomics. We then point a possible way forward to integrating machine learning with scientific databases to form intelligent databases.
    Computational Discovery of Scientific Knowledge, Introduction, Techniques, and Applications in Environmental and Life Sciences; 01/2007
  • A Clare, A Karwath, H Ougham, R D King
    [Show abstract] [Hide abstract]
    ABSTRACT: MOTIVATION: The genome of Arabidopsis thaliana, which has the best understood plant genome, still has approximately one-third of its genes with no functional annotation at all from either MIPS or TAIR. We have applied our Data Mining Prediction (DMP) method to the problem of predicting the functional classes of these protein sequences. This method is based on using a hybrid machine-learning/data-mining method to identify patterns in the bioinformatic data about sequences that are predictive of function. We use data about sequence, predicted secondary structure, predicted structural domain, InterPro patterns, sequence similarity profile and expressions data. RESULTS: We predicted the functional class of a high percentage of the Arabidopsis genes with currently unknown function. These predictions are interpretable and have good test accuracies. We describe in detail seven of the rules produced.
    Bioinformatics 06/2006; 22(9):1130-6. · 5.32 Impact Factor
  • Bioinformatics. 01/2006; 22:1130-1136.
  • Source
    Andreas Karwath, Kristian Kersting
    [Show abstract] [Hide abstract]
    ABSTRACT: The need to measure sequence similarity arises in many ap- plicitation domains and often coincides with sequence alignment: the more similar two sequences are, the better they can be aligned. Aligning sequences not only shows how similar sequences are, it also shows where there are differences and correspondences between the sequences. Traditionally, the alignment has been considered for sequences of flat symbols only. Many real world sequences such as natural language sen- tences and protein secondary structures, however, exhibit rich internal structures. This is akin to the problem of dealing with structured exam- ples studied in the field of inductive logic programming (ILP). In this pa- per, we introduce Real, which is a powerful, yet simple approach to align sequence of structured symbols using well-established ILP distance mea- sures within traditional alignment methods. Although straight-forward, experiments on protein data and Medline abstracts show that this ap- proach works well in practice, that the resulting alignments can indeed provide more information than flat ones, and that they are meaningful to experts when represented graphically.
    Inductive Logic Programming, 16th International Conference, ILP 2006, Santiago de Compostela, Spain, August 24-27, 2006, Revised Selected Papers; 01/2006
  • Bioinformatics. 01/2006; 22:1674.
  • Source
    Praxis der Informationsverarbeitung und Kommunikation. 01/2006; 29:81-87.
  • Source
    Christian Stolle, Andreas Karwath, Luc De Raedt
    [Show abstract] [Hide abstract]
    ABSTRACT: A novel inductive logic programming system, called Clas- sic'cl is presented. Classic'cl integrates several settings for learning, in particular learning from interpretations and learning from satisabilit y. Within these settings, it addresses predictive, descriptive and probabilis- tic modeling tasks. As such, Classic'cl (C-armr, cLAudien, icl-S(S)at, ICl, and CLlpad) integrates several well-known inductive logic program- ming systems such as Claudien, Warmr (and its extension C-armr), ICL, ICL-SAT, and CLLPAD. We report on the implementation, the integra- tion issues as well as on some experiments that compare Classic'cl with its predecessors.
    Discovery Science, 8th International Conference, DS 2005, Singapore, October 8-11, 2005, Proceedings; 01/2005

Publication Stats

298 Citations
31.64 Total Impact Points

Institutions

  • 2012–2013
    • Johannes Gutenberg-Universität Mainz
      • Institute for Computer Science
      Mayence, Rheinland-Pfalz, Germany
  • 2005–2012
    • University of Freiburg
      • Department of Computer Science
      Freiburg, Lower Saxony, Germany
  • 2000–2002
    • University of Wales
      • Department of Computer Science
      Cardiff, WLS, United Kingdom