About
150
Publications
49,607
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
3,314
Citations
Introduction
Current institution
Additional affiliations
March 2013 - present
Publications
Publications (150)
Mining frequent patterns in a single network (graph) poses a number of challenges. Already only to match one path pattern to a network (upto subgraph isomorphism) is NP-complete. Matching algorithms that exist, become intractable even for reasonably small patterns, on networks which are large or have a high average degree. Based on recent advances...
Many machine learning algorithms are based on the assumption that training
examples are drawn independently. However, this assumption does not hold
anymore when learning from a networked sample where two or more training
examples may share common features. We propose an efficient weighting method
for learning from networked examples and show the sa...
Trypsin is the workhorse protease in mass spectrometry based proteomics experiments and is used to digest proteins into more readily analyzable peptides. To identify these peptides after mass spectrometric analysis, the actual digestion has to be mimicked as faithfully as possible in silico. In this paper we introduce CP-DT (Cleavage Prediction wit...
Maximum common substructures (MCS) have received a lot of attention in the chemoinformatics community. They are typically used as a similarity measure between molecules, showing high predictive performance when used in classification tasks, while being easily explainable substructures. In the present work, we applied the Pairwise Maximum Common Sub...
Significance
Systems biology involves the development of large computational models of biological systems. The radical improvement of systems biology models will necessarily involve the automation of model improvement cycles. We present here a general approach to automating systems biology model improvement. Humans are eukaryotic organisms, and the...
Counting the number of times a pattern occurs in a database is a fundamental data mining problem. It is a subroutine in a diverse set of tasks ranging from pattern mining to supervised learning and probabilistic model learning. While a pattern and a database can take many forms, this paper focuses on the case where both the pattern and the database...
Transposable elements (TEs) are repetitive nucleotide sequences that make up a large portion of eukaryotic genomes. They can move and duplicate within a genome, increasing genome size and contributing to genetic diversity within and across species. Accurate identification and classification of TEs present in a genome is an important step towards un...
We provide a systematic approach to deal with the following problem. Let
$X_1,\ldots,X_n$ be, possibly dependent, $[0,1]$-valued random variables. What
is a sharp upper bound on the probability that their sum is significantly
larger than their mean? In the case of independent random variables, a
fundamental tool for bounding such probabilities is d...
This article introduces a new type of structural fragment called a geometrical pattern. Such geometrical patterns are defined as molecular graphs that include a labelling of atoms together with constraints on interatomic distances. The discovery of geometrical patterns in a chemical dataset relies on the induction of multiple decision trees combine...
We provide a lower bound on the probability that a binomial random variable is exceeding its mean. Our proof employs estimates on the mean absolute deviation and the tail conditional expectation of binomial random variables.
We provide a lower bound on the probability that a binomial random variable is exceeding its mean. Our proof employs estimates on the mean absolute deviation and the tail conditional expectation of binomial random variables.
With the current expanded technical capabilities to perform mass spectrometry-based biomedical proteomics experiments, an improved focus on the design of experiments is crucial. As it is clear that ignoring the importance of a good design leads to an unprecedented rate of false discoveries which would poison our results, more and more tools are dev...
Let $Y_v, v\in V,$ be $[0,1]$-valued random variables having a dependency
graph $G=(V,E)$. We show that \[ \mathbb{E}\left[\prod_{v\in V} Y_{v} \right]
\leq \prod_{v\in V} \left\{ \mathbb{E}\left[Y_v^{\frac{\chi_b}{b}}\right]
\right\}^{\frac{b}{\chi_b}}, \] where $\chi_b$ is the $b$-fold chromatic number
of $G$. This inequality may be seen as a dep...
Kernels for structured data have gained a lot of attention in a world with an ever increasing amount of complex data, generated from domains such as biology, chemistry, or engineering. However, while many applications involve spatial aspects, up to now only few kernel methods have been designed to take 3D information into account. We introduce a no...
We consider the naive bottom-up concatenation scheme for a context-free language and show that this scheme has the incremental polynomial time property. This means that all members of the language can be enumerated without duplicates so that the time between two consecutive outputs is bounded by a polynomial in the number of strings already generat...
Statistical relational learning (SRL) is concerned with developing formalisms for representing and learning from data that exhibit both uncertainty and complex, relational structure. Most of the work in SRL has focused on modeling and learning from data that only contain discrete variables. As many important problems are characterized by the presen...
We show that the, so-called, Bernstein-Hoeffding method can be employed to a
larger class of generalized moments. This class contains the exponential
moments whose properties play a key role in the proof of a well-known
inequality of Wassily Hoeffding, regarding sums of independent and bounded
random variables whose mean is assumed to be known. As...
Subject Areas: biotechnology, computational biology, synthetic biology There is an urgent need to make drug discovery cheaper and faster. This will enable the development of treatments for diseases currently neglected for economic reasons, such as tropical and orphan diseases, and generally increase the supply of new drugs. Here, we report the Robo...
Background
A key challenge in the field of HIV-1 protein evolution is the identification of coevolving amino acids at the molecular level. In the past decades, many sequence-based methods have been designed to detect position-specific coevolution within and between different proteins. However, an ensemble coevolution system that integrates differen...
Causal polytrees are singly connected causal models and they are frequently applied in prac-tice. However, in various applications, many variables remain unobserved and causal poly-trees cannot be applied without explicitly includ-ing unobserved variables. Our study thus propos-es the ancestral polytree model, a novel combi-nation of ancestral grap...
Many machine learning algorithms are based on the assumption that training
examples are drawn identically and independently. However, this assumption does
not hold anymore when learning from a networked sample because two or more
training examples may share some common objects, and hence share the features
of these shared objects. We first show tha...
Machine learning is a subdiscipline within artificial intelligence that focuses on algorithms that allow computers to learn solving a (complex) problem from existing data. This ability can be used to generate a solution to a particularly intractable problem, given that enough data are available to train and subsequently evaluate an algorithm on. Si...
Metrics for structured data have received an increasing interest in the machine learning community. Graphs provide a natural representation for structured data, but a lot of operations on graphs are computationally intractable. In this article, we present a polynomial-time algorithm that computes a maximum common subgraph of two outerplanar graphs....
Graph support measures are functions measuring how frequently a given subgraph pattern occurs in a given database graph. An important class of support measures relies on overlap graphs. A major advantage of overlap-graph based approaches is that they combine anti-monotonicity with counting the occurrences of a subgraph pattern which are independent...
Présentation accessible ici : http://sfci2013.loria.fr/wp-content/uploads/2013/10/SFCi13-Ve_11_09h00-Comparing_chemical_fingerprints_for_ecotoxicology.pdf
This paper explores the use of predicate logic as a modeling language. Using
IDP3, a finite model generator that supports first order logic enriched with
types, inductive definitions, aggregates and partial functions, search problems
stated in a variant of predicate logic are solved. This variant is introduced
and applied on a range of problems ste...
We present PIUS, a tool that identifies peptides from tandem mass spectrometry data by analyzing the six-frame translation of a complete genome. It differs from earlier studies that have performed such a genomic search in two ways: (i) it considers a larger search space and (ii) it is designed for natural peptide identification rather than proteomi...
Monte Carlo tree search (MCTS) is a sampling and simulation based technique for searching in large search spaces containing both decision nodes and probabilistic events. This technique has recently become popular due to its successful application to games, e.g. Poker Van den Broeck et al. (2009) and Go Coulom (2006); Chaslot et al. (2006); Gelly an...
Graph support measures are functions measuring how frequently a given subgraph pattern occurs in a given database graph. An important class of support measures relies on overlap graphs. A major advantage of the overlap graph based approaches is that they combine anti-monotonicity with counting occurrences of a pattern which are independent accordin...
This paper reports on the use of the FO(·) language and the IDP framework for modeling and solving some machine learning and data mining tasks. The core component of a model in the IDP framework is an FO(·) theory consisting of formulas in first order logic and definitions; the latter are basically logic programs where clause bodies can have arbitr...
Development of acute kidney injury (AKI) during the postoperative period is associated with increases in both morbidity and mortality. The aim of this study is to develop a statistical model capable of predicting the occurrence of AKI in patients after elective cardiac surgery.
Probabilistic logical models have proven to be very success-ful at modelling uncertain, complex relational data. Most current models and implementations focus on modelling domains that only have discrete variables. Yet many real-world problems are hybrid and have both dis-crete and continuous variables. In this paper we focus on the Logical Bayesia...
In graph mining, a frequency measure for graphs is anti-monotonic if the frequency of a pattern never exceeds the frequency of a subpattern. The efficiency and correctness of most graph pattern miners relies critically on this property. We study the case where frequent subgraphs have to be found in one graph. Vanetik, Gudes and Shimony already gave...
The intensive care unit (ICU) length of stay (LOS) of patients undergoing cardiac surgery may vary considerably, and is often difficult to predict within the first hours after admission. The early clinical evolution of a cardiac surgery patient might be predictive for his LOS. The purpose of the present study was to develop a predictive model for I...
Inductive logic programming, or relational learning, is a powerful paradigm
for machine learning or data mining. However, in order for ILP to become
practically useful, the efficiency of ILP systems must improve substantially.
To this end, the notion of a query pack is introduced: it structures sets of
similar queries. Furthermore, a mechanism is d...
The standard approach to feature construction and predictive learning in molecular datasets is to employ computationally expensive
graph mining techniques and to bias the feature search exploration using frequency or correlation measures. These features
are then typically employed in predictive models that can be constructed using, for example, SVM...
It is increasingly clear that machine learning algorithms need to be integrated in an iterative scientific discovery loop,
in which data is queried repeatedly by means of inductive queries and where the computer provides guidance to the experiments that are being performed. In this chapter, we summarise several
key challenges in achieving this inte...
The frequent connected subgraph mining problem, i.e., the problem of listing all connected graphs that are subgraph isomorphic to at least a certain number of transaction graphs of a database, cannot be solved in output polynomial time in the general case. If, however, the transaction graphs are restricted to forests then the problem becomes tracta...
This work studies the impact of using dynamic information as features in a machine learning algorithm for the prediction task of classifying critically ill patients in two classes according to the time they need to reach a stable state after coronary bypass surgery: less or more than 9 h. On the basis of five physiological variables (heart rate, sy...
Probability trees are decision trees that predict class probabilities rather than the most likely class. The pruning criterion used to learn a probability tree strongly influences the size of the tree and thereby also the quality of its probability estimates. While the effect of pruning criteria on classification accuracy is well-studied, only rece...
We investigate the use of Monte-Carlo Tree Search (MCTS) within the field of computer Poker, more specifically No-Limit Texas Hold’em. The hidden information in Poker results in so called miximax game trees where opponent decision nodes have to be modeled as chance nodes. The probability distribution in these nodes is modeled by an opponent model t...
Experimental results often present a substantial fraction of missing and censored values. Here we propose a strategy to perform principal component analysis under this specific incomplete information hypothesis. This allows the reconstruction of the missing information in a way consistent with the experimental observations.
Computerization in healthcare in general, and in the operating room (OR) and intensive care unit (ICU) in particular, is on the rise. This leads to large patient databases, with specific properties. Machine learning techniques are able to examine and to extract knowledge from large databases in an automatic way. Although the number of potential app...
Many pattern recognition and machine learning approaches employ a distance met- ric on patterns, or a generality relation to partially order the patterns. We investi- gate the relationship amongst them and prove a theorem that shows how a distance metric can be derived from a partial order (and a corresponding size on patterns) under mild condition...
This work studies the impact of using dynamic information as features in a machine learning algorithm for the prediction task of classifying critically ill patients in two classes according to the time they need to reach a stable state after coronary bypass surgery: less or more than nine hours. On the basis of five physiological variables differen...
In this paper we investigate the evolutionary dynamics of strategic behavior in the game of poker by means of data gathered from a large number of real world poker games. We perform this study from an evolutionary game theoretic perspective using two Replicator Dynamics models. First we consider the basic selection model on this data, secondly we u...
Algorithms that list graphs such that no two listed graphs are isomorphic, are important building blocks of systems for mining and learning in graphs. Algorithms are already known that solve this problem efficiently for many classes of graphs of restricted topology, such as trees. In this article we introduce the concept of a dense augmentation sch...
Bioinformatics is an application domain where information is naturally represented in terms of relations between heterogenous objects. Modern experimentation and data acquisition techniques allow the study of complex interactions in biological systems. This raises interesting challenges for machine learning and data mining researchers, as the amoun...
Relational reinforcement learning (RRL) has emerged in the machine learning community as a new promising subfield of reinforcement learning (RL) (e.g. [1]). It upgrades RL techniques by using relational representations for states, actions and learned value-functions or policies to allow more natural representations and abstractions of complex tasks...
In recent years, there has been a growing interest in using rich repre- sentations such as relational languages for reinforcement learning. However, while expressive languages have many advantages in terms of generalization and reasoning, extending existing approaches to such a relational setting is a non-trivial problem. For a relational reinforce...
In graph mining, a frequency measure is anti-monotonic if the frequency of a pattern never exceeds the frequency of a subpattern. The efficiency and correctness of most graph pattern miners relies critically on this property. We study the case where the dataset is a single graph. Vanetik, Gudes and Shimony already gave sufficient and necessary cond...
In machine learning, there has been an increased interest in metrics on structured data. The application we focus on is drug
discovery. Although graphs have become very popular for the representation of molecules, a lot of operations on graphs are
NP-complete. Representing the molecules as outerplanar graphs, a subclass within general graphs, and u...
An important task in many scientific and engineering disciplines is to set up experiments with the goal of finding the best
instances (substances, compositions, designs) as evaluated on an unknown target function using limited resources. We study
this problem using machine learning principles, and introduce the novel task of active k-optimization....
The frequent connected subgraph mining problem, i.e., the problem of listing all connected graphs that are subgraph isomorphic to at least a certain number of transaction graphs of a database, cannot be solved in output polynomial time in the general case. If, however, the transaction graphs are restricted to forests then the problem becomes tracta...
Recently, there has been an increasing interest in directed probabilistic logical models and a variety of formalisms for describing
such models has been proposed. Although many authors provide high-level arguments to show that in principle models in their
formalism can be learned from data, most of the proposed learning algorithms have not yet been...
State representation for intelligent agents is a continuous challenge as the need for abstraction is unavoidable in large state spaces. Pre- dictive representations offer one way to obtain state abstraction by replacing a state with a set of predictions about future interactions with the world. One such formalism is the Temporal-Difference Net- wor...
We propose an opponent modeling approach for no- limit Texas hold-em poker that starts from a (learned) prior, i.e., general expectations about opponent behav- ior and learns a relational regression tree-function that adapts these priors to specific opponents. An important asset is that this approach can learn from incomplete in- formation (i.e. wi...
In this paper we investigate the evolutionary dynamics of strategic behaviour in the game of poker by means of data gathered from a large number of real-world poker games. We perform this study from an evolutionary game theoretic perspective using the Replicator Dynamics model. We investigate the dynamic properties by studying how players switch be...
In graph mining, a frequency measure is anti-monotonic if the frequency of a pattern never exceeds the frequency of a subpattern. The efficiency and correctness of most graph pattern miners relies critically on this property. We study the case where the dataset is a single graph. Vanetik, Gudes and Shimony already gave sufficient and necessary cond...
We study the task of approximating the k best instances with regard to a function us- ing a limited number of evaluations. We also apply an active learning algorithm based on Gaussian processes to the problem, and evaluate it on a challenging set of structure- activity relationship prediction tasks.
Recently, there has been an increasing interest in directed probabilistic logical models and a variety of languages for describing
such models has been proposed. Although many authors provide high-level arguments to show that in principle models in their
language can be learned from data, most of the proposed learning algorithms have not yet been s...
In this paper we investigate the relation between transfer learning in reinforcement learning with function approximation and su- pervised learning with concept drift. We present a new incremental rela- tional regression tree algorithm that is capable of dealing with concept drift through tree restructuring and show that it enables a reinforcement...
We discuss how to learn non-recursive directed probabilistic logical models from relational data. This problem has been tackled
before by upgrading the structure-search algorithm initially proposed for Bayesian networks. In this paper we propose to upgrade
another algorithm, namely ordering-search, since for Bayesian networks this was found to work...
In this paper we describe the application of data mining methods for predicting the evolution of patients in an intensive care unit. We discuss the importance of such methods for health care and other application domains of engineering. We argue that this problem is an important but challenging one for the current state of the art data mining metho...
There is an increasing interest in upgrading Bayesian networks to the relational case, resulting in directed probabilistic logical models. Many formalisms to describe such models have been introduced and learning algorithms have been developed for several such formalisms. Most of these algorithms are upgrades of the traditional structure
search alg...
In recent years, there has been a growing inter- est in using rich representations such as relational languages for reinforcement learning. However, while expressive languages have many advantages in terms of generalization and reasoning, extending existing approaches to such a relational setting is a non-trivial problem. In this paper, we present...
We consider the problem of policy learning in a Markov Decision Process . A MDP consists of a state spaceS, a set of actions A, a transition probability function t(s,a,s0) and a reward function R : S ! R. The problem is to find a policy, a
Graphs are mathematical structures that are capable of representing relational data. In the chemoinformatics context, they have be- come very popular for the representation of molecules. However, a lot of operations on graphs are NP-complete, so no ecien t al- gorithms that can handle these structures exist. In this paper we focus on outerpla- nar...
Of many graph mining algorithms an essential component is its procedure for enumerating graphs such that no two enumerated graphs are isomorphic. All frequent subgraph miners require such a component [14, 5, 1, 6], but also other
We propose a novel machine learning al-gorithm for learning mutation pathways of viruses from a population of viral DNA strands. More specifically, given a number of sequences, the algorithm constructs a phy-logenetic tree that expresses the ancestry of the sequences, and at the same time builds a model describing dependencies between mu-tations th...
Reinforcement learning is a well-suited approach for many decision-making problems. Lots of interesting domains are, however, not solvable in practice by this approach due to their size: traditional reinforcement learning algorithm need to store every combination of state and action which was encountered. A common method for dealing with large stat...
In relational learning, predictions for an individual are based not only on its own properties but also on the properties of a set of related individuals. Many systems use aggregates to summarize this set. Features thus introduced compare the result of an aggregate function to a thresh- old. We consider the case where the set to be aggregated is ge...
Model trees are a special case of regression trees in which lin- ear regression models are predicted in the leaves. Little attention has been paid to model trees in relational learning, mainly because the task of learn- ing linear regression equations in this context involves dealing with non- determinacy of predictive attributes. Whereas existing...
In recent years there has been an increased interest in frequent pattern discovery in large databases of graph structured objects. While the frequent connected subgraph mining problem for tree datasets can be solved in incremental polynomial time, it becomes intractable for arbitrary graph databases. Existing approaches have therefore resorted to v...
In recent years there has been an increased interest in frequent pattern discovery in large databases of graph structured objects. While the frequent connected subgraph mining problem for tree datasets can be solved in incremental polynomial time, it becomes intractable for arbitrary graph databases. Existing approaches have therefore resorted to v...
In this paper we describe an application of data mining methods for different prediction tasks in an intensive care unit. Some of the challenging aspects of performing data min-ing in this domain are highlighted. The ap-plied methods result in models with good performances within medical standards that can be valuable in assisting medical decision...
In this paper we present the use of Gaussian Processes for regression in the application of prediction in Intensive Care. We propose a preliminary solution to predicting the evolution of a patient's state during his stay in intensive care by means of defined patient specific characteristics.
A representation of the search space in optical pulse shaping problems employing an acousto-optic programmable dispersive filter (AOPDF) is presented for use in closed-loop learning experiments where the optimal spectral phase function to some control problem is determined by an iterative learning algorithm. The representation allows the algorithm...
This paper ofiers an approach to the problem of large state spaces for reinforce- ment learning by constructing a state-action pair aggregation (treating similar state-action pairs as if they were the same) with the use of domain knowledge. Arbitrary aggregation is known to give possibly very large errors. In this pa- per it is shown how, by using...
In this paper we describe an interesting application of tem- poral data mining, predicting the evolution of critically ill patients. We point out several issues which make this ap- plication particularly challenging. We outline our work in progress and discuss directions for further work.
In this paper a method to learn a single interpretable model from a relational ensemble is presented. The new model is obtained by artificially generating partial data examples using the distributions im- plicit in the ensemble and by building a new relational model from this artificial data.
Relational reinforcement learning is a Q-learning technique for relational state-action spaces. It aims to enable agents to
learn how to act in an environment that has no natural representation as a tuple of constants. In this case, the learning
algorithm used to approximate the mapping between state-action pairs and their so called Q(uality)-value...
Probability trees (or Probability Estimation Trees, PET’s) are decision trees with probability distributions in the leaves. Several alternative approaches for learning probability trees have been proposed but no thorough comparison of these approaches exists.
In this paper we experimentally compare the main approaches using the relational decision...
Logical Bayesian Networks (LBNs) have recently been introduced as another language for knowledge based model construction of Bayesian networks, besides existing languages such as Probabilistic Relational Models (PRMs) and Bayesian Logic Programs (BLPs). The original description of LBNs introduces them as a variant of BLPs and discusses the differen...
In this paper we study Relational Reinforcement Learning in a
multi-agent setting. There is growing evidence in the Reinforcement Learning
research community that a relational representation of the state space has
many bene ts over a propositional one. Complex tasks as planning or information
retrieval on the web can be represented more naturally i...