Interactive Discriminative Mining of Chemical Fragments.
Article: VMD: Visual molecular dynamics[show abstract] [hide abstract]
ABSTRACT: VMD is a molecular graphics program designed for the display and analysis of molecular assemblies, in particular biopolymers such as proteins and nucleic acids. VMD can simultaneously display any number of structures using a wide variety of rendering styles and coloring methods. Molecules are displayed as one or more “representations,” in which each representation embodies a particular rendering method and coloring scheme for a selected subset of atoms. The atoms displayed in each representation are chosen using an extensive atom selection syntax, which includes Boolean operators and regular expressions. VMD provides a complete graphical user interface for program control, as well as a text interface using the Tcl embeddable parser to allow for complex scripts with variable substitution, control loops, and function calls. Full session logging is supported, which produces a VMD command script for later playback. High-resolution raster images of displayed molecules may be produced by generating input scripts for use by a number of photorealistic image-rendering applications. VMD has also been expressly designed with the ability to animate molecular dynamics (MD) simulation trajectories, imported either from files or from a direct connection to a running MD simulation. VMD is the visualization component of MDScope, a set of tools for interactive problem solving in structural biology, which also includes the parallel MD program NAMD, and the MDCOMM software used to connect the visualization and simulation programs. VMD is written in C++, using an object-oriented design; the program, including source code and extensive documentation, is freely available via anonymous ftp and through the World Wide Web.Journal of Molecular Graphics.
Conference Proceeding: LogCHEM: Interactive Discriminative Mining of Chemical Structure[show abstract] [hide abstract]
ABSTRACT: One of the most well known successes of Inductive Logic Programming (ILP) is on Structure-Activity Relationship (SAR) problems. In such problems, ILP has proved several times to be capable of constructing expert comprehensible models that help to explain the activity of chemical compounds based on their structure and properties. However, despite its successes on SAR problems, ILP has severe scalability problems that prevent its application on larger datasets. In this paper we present LogCHEM, an ILP based tool for discriminative interactive mining of chemical fragments. LogCHEM tackles ILP's scalability issues in the context of SAR applications. We show that LogCHEM benefits from the flexibility of ILP, both by its ability to quickly extend the original mining model, and by its ability to interface with external tools. Furthermore, we demonstrate that LogCHEM can be used to mine effectively large chemoinformatics datasets, namely several datasets from EPA's DSSTox database and on a dataset based on the DTP AIDS anti-viral screen.Bioinformatics and Biomedicine, 2008. BIBM '08. IEEE International Conference on; 12/2008
[show abstract] [hide abstract]
ABSTRACT: The ever-increasing number of chemical compounds added every year has not been accompanied by a similar growth in our ability to analyze and classify these compounds. The problem of prevention of cancer caused by many of these chemicals has been of great scientific and humanitarian value. The use of AI discovery tools for predicting chemical toxicity is being investigated. The basic idea behind the work is to obtain structure-activity representation (SARs)[Srinivasan et al.], which relates molecular structures to cancerous activity. The data is obtained from the U.S National Toxicology Program conducted by the National Institute of Environmental Health Sciences (NIEHS). A general approach to automatically discover repetitive substructures from the datasets is outlined by this research. Relevant SARs are identified using the Subdue substructure discovery system that discovers commonly occurring substructures in a given set of compounds. The best substructure given by Subdue is used as a...09/1999;
Interactive Discriminative Mining of Chemical
Nuno A. Fonseca1, Max Pereira2, V´ ıtor Santos Costa1, and Rui Camacho2
1CRACS-INESC Porto LA, Universidade do Porto,
Rua do Campo Alegre 1021/1055, 4169-007 Porto, Portugal
2LIAAD-INESC Porto LA & DEI-FEUP, Universidade do Porto,
Rua Dr Roberto Frias s/n, 4200-465 Porto, Portugal
Abstract. Structural activity prediction is one of the most important
tasks in chemoinformatics. The goal is to predict a property of interest
given structural data on a set of small compounds or drugs. Ideally,
systems that address this task should not just be accurate, they should
also be able to identify an interpretable discriminative structure which
describes the most discriminant structural elements with respect to some
The application of ILP in an interactive software for discriminative min-
ing of chemical fragments is presented in this paper. In particular, it
is described the coupling of an ILP system with a molecular visualisa-
tion software that allows a chemist to graphically control the search for
interesting patterns in chemical fragments. Furthermore, we show how
structural information, such as rings, functional groups like carboxyl,
amine, methyl, ester, etc are integrated and exploited in the search.
Keywords: Drug design, graphical mining, efficiency
Structural activity prediction is one of the most important tasks in chemoinfor-
matics. The goal is to predict a property of interest given structural data on a
set of small compounds or drugs. This task can be seen as an instance of a more
general task, Structure-Activity Relationship (SAR), where one aims at predict-
ing the activity of a compound under certain conditions, given structural data
on the compound. Ideally, systems that address this task should not just be ac-
curate, they should be able to identify an interpretable discriminative structure
which describes the most discriminant structural elements with respect to some
In an invited talk to Computational Logic 2000 and ILP’2000 David Page 
highlighted the importance of interactive ILP systems for SAR problems. The
application of Inductive Logic Programming (ILP) in an interactive software
for discriminative mining of chemical fragments is presented in this paper. In
particular, it is described a software application, called iLogCHEM, that allows
a chemist to graphically control the search for interesting patterns in chemical
fragments. iLogCHEM couples an ILP system with a molecular visualisation
software, thus leveraging the flexibility of ILP while addressing the SAR task
mentioned above. iLogCHEM can input data from chemical representations,
such as MDL’s SDF file format, and display molecules and matching patterns
using visualisation tools such as VMD . It has been demonstrated  that
iLogCHEM can be used to mine effectively large chemoinformatics data sets,
such as the DTP AIDS data set .
The focus of this paper is on allowing domain expert users to participate in
the drug discovery process in a number of ways:
1. We propose the ability to incorporate user-provided abstractions of inter-
est to the chemoinformatics domain, that can be used to aid the discovery
process. As a first experiment, we have allowed users to specify a common
chemical structure, aromatic rings. The user has available in iLogCHEM,
apart from the aromatic rings, functional groups such as carboxyl, amine,
ester, methyl, phenyl etc. This is supported through a macro mechanism (de-
scribed in more detail in Section 4) where the user provides a pattern which
is used to control rule refinement.
2. We propose an interactive refinement process where the user can interact
with the proposed model, adapting it, evaluating it, and using it to guide
(constrain) the search.
3. A common procedure in drug design is to introduce small variations in
well known molecules. This procedure leads to data bases with groups of
molecules that are very similar. When data sets are assembled from these
data bases there is a “similarity bias”. To attenuate that effect iLogCHEM
allows the user to compute the similarity between the data set molecules
and discard the more similar ones, retaining a set of “representative” ones.
A more detailed description of this facility is described in Section 3.
The rest of the paper is organised as follows. Section 2 provides a brief intro-
duction to the SAR problem and the issue of molecular representations. Section 3
introduces iLogCHEM and describes its main components. Section 4 describes
its ability to incorporate user-provided abstractions of interest to the chemoinfor-
matics domain through the use of what we have designated as macros. Section 5
explains the facilities for interactive search and refinement. Finally, conclusions
and future work are described in Section 6.
Structure activity relationships (SAR) describe empirically derived relationships
between a molecule and its activity as a drug. In a typical SAR problem the
goal is to construct a predictive theory relating the structure of a molecule to
its activity given a set of molecules of known structure and activity.
A problem that one has to address is how to describe molecules. Coordinate-
based representations usually operate by generating features from a molecule’s
3D-structure . The number of features of interest can grow very quickly, hence
the problem that these systems need to address is how to select the most inter-
esting features and build a classifier from them. Coordinate-free representations
can use atom pair descriptors or just the atom-bond structure of the molecule.
In the latter case, finding a discriminative component quite often reduces to the
problem of finding a Maximum Common Substructure (MCS).
Exact MCS search in a molecule represented as a set of atoms and bonds
can be seen as a graph-mining task. In this case, a molecule is represented as a
graph GM= (V,E) where V , the vertices, are atom labels, and E, the edges, are
bonds. The search can be improved by adding atom and bond properties. The
earliest approaches to search for common substructures or fragments were based
on ideas from Inductive Logic Programming (ILP). ILP techniques are very
appealing because they are based on a very expressive representation language,
first order logic, but they have been criticised for exhibiting significant efficiency
problems. As stated by Karwath and De Raedt , “their application has been
restricted to finding relatively small fragments in relatively small databases”.
Specialised graph miners have therefore become quite popular. Systems such
as SUBDUE  started from the empty graph and then generate refinements
either using beam-search or breadth-first search. More recent systems such as
MoFa , gSpan , FFSM , Gaston , FMiner  and SMIREP ,
use depth-first search, and use compact and efficient representations, such as
SMILES, for which matching and canonical forms algorithms exist. Arguably,
although such systems differ widely, they all use three main principles: (i) only
refine fragments that appear in the database; (ii) filter duplicates; and (iii)
perform efficient homomorphism testing.
3The iLogCHEM System
iLogCHEM system is an interactive tool for discriminative mining of chemical
fragments. iLogCHEM uses a logic representation of the molecules, where atoms
and bonds are facts stored in a database. Although our representation is less com-
pact than a specialised representation such as SMILES, used in MOLFEA 
and SMIREP , it offers a number of important advantages. First, it is pos-
sible to store information both on atoms and on their location: this is useful
for interfacing with external tools. Second, iLogCHEM can take advantage of
the large number of search algorithms implemented in ILP. Third, given that
we implement the basic operations efficiently, we can now take advantage of the
flexibility of our framework to implement structured information.
The interaction with the system is made through a graphical user interface.
The system requires two input files: one is a SDF format with atom and bond
data on a set of molecules; the other is a file which labels (discriminates) the
compounds. We use SDF because it is highly popular and because it can convey
3D structure information. Other formats, such as SML can be translated to SDF
through tools such as OpenBabel . Also note that some datasets, such as the
DSSTox  collection of datasets with at most 2000 molecules, include 2D and
3D information in the SDF format. Furthermore, the user may choose from 22
1D descriptors, 300 molecular fingerprints and 242 2D descriptors, predefined by
chemists. These descriptors can be analysed with propositional tools, not just
The input files (in SDF format) are processed and given as input to a rule
discovery algorithm, that is implemented as an extension of an ILP system (cur-
rently Aleph ). We significantly improved the ILP search algorithm for this
task, as explained in the next section. The ILP engine allows the introduction of
extra background knowledge for rule discovery. As an example, we take advan-
tage of this flexibility by allowing the user to introduce well-known molecular
structures in the search process. This is supported through a macro mechanism
(described in more detail in Section 4) where the user provides a pattern which
is used to control rule refinement.
The output of the ILP system will be a set of rules, or theory. Most of-
ten, chemists will be interested in looking at individual rules. iLogCHEM first
matches the rules against the database, and then allows the user to navigate
through the list of matches and visualise them. iLogCHEM uses VMD  to dis-
play the molecules and the matching substructures.
The key component of iLogCHEM is rule discovery. From a number of ILP
algorithms, we chose to base our work on Progol’s greedy cover algorithm with
Mode Directed Inverse Entailment algorithm (MDIE) , as implemented in
the Progol, April , and Aleph systems . We rely on MDIE to achieve
directed search, and we use greedy cover removal as a natural algorithm for
finding interesting patterns. Figure 1 shows an example pattern for the HIV
data set. The pattern is shown as a wider atoms and bonds, and it includes a
sulphur atom and part of an aromatic ring.
Fig.1. HIV Pat-
tern (wider atoms
and bonds) discov-
ered by ILP.
Molecular Filtering It is a common practise in drug de-
sign to take a small molecule that exhibits some activity
and introduce small changes to it to improve its activity.
This procedure produces a large set of similar molecules.
Most of the available data for drug design suffer from this
“similarity bias”. Using iLogCHEM the similarities be-
tween any two pairs of molecules can be computed and
retain only the “representative” ones producing an unbias
data set. iLogCHEM uses Tanimoto distance to assess the
similarity of two molecules. The user can specify a thresh-
old value to be used in the filtering procedure.
Pattern Enumeration iLogCHEM enumerates patterns
(or sub-graphs) contained in an example molecule, the
seed. To do so it uses the LogCHEM algorithm , based
on the Aleph ILP system  to constrain the search
space. This algorithm keeps a trie with previously gen-
erated clauses, according to a Morgan normal form, and tries to optimise rule
evaluation for the specific domain of chemical compounds.
from a Small Or-
Pattern Matching Given a new pattern, we are inter-
ested in finding out how many molecules support the pat-
tern. ILP systems rely on refutation for this purpose. How-
ever, this introduces a problem. Consider the clause:
that represents a N = C = N pattern. Figure 2 matches
the molecule A-alpha-C against the pattern. Clearly, there
is no match. Unfortunately, Prolog finds a match by
matching the same nitrogen against the pattern twice. This
problem, known as Object Identity , is addressed by dynamically rewriting
the rules so that different variables match different atoms:
Id1?= Id3∧ Id2?= Id3
iLogCHEM includes a number of further optimisations. Namely, we rewrite
bond information in such a way as to minimise backtracking. Also, by default,
iLogCHEM compiles every pattern, instead of interpreting them, as usual in
4Integrating Structural Information in the search
The iLogCHEM system has the ability to integrate complementary information
in the pattern search process. Our work was motivated by two observations.
First, quite often chemists rely on well-known structures that are typically in-
fluential in the chemical properties of compounds. Second, global properties of
the compound may be good indicators of activity.
A first step forward stems from observing Figure 1: does the pattern include part
of the ring because only part of the ring matters or, as it is more natural from
the chemists point of view, should we believe that the whole ring should be in
the pattern? Quite often discriminative miners will only include part of a ring
because it is sufficient for classification purposes. But this may not be sufficient
to validate the pattern.
The logical representation used in iLogCHEM makes it natural to support
macro structures, such as rings used in MoFa  in a straightforward fashion.
The next example shows such a description:
Initial experiments with iLogCHEM show that using such macros results in
similar accuracy, but returns easier to interpret rules.
iLogCHEM has available a library of functional groups that may be used as
macros to speed up search and are very useful to improve understandability of
the models. Some of the functional groups available include: aldehyde, amine,
methyl, ester, ketone, hydroxyl, cyano, carboxylic acid, etc
One of the major benefits of ILP for SAR is its ability to combine very diverse
sources of information. iLogCHEM allows the user to select chemical properties
of interest for a compound, and combine them with pattern generation. Proper-
ties of interest are obtained through the graphical interface, and then passed on
to the miner. In iLogCHEM the user may choose from a wide set of 1D molecu-
lar descriptors. As an example, consider this predicate clause for the CPDBAS
logp(A,B), B =< -0.73333657,
atom(A,C,c), atom_bond(A,C,D,c,c,4), atom_bond(A,D,E,c,o,4).
The constant −0.73333657 is obtained from a seed example by saturation.
Experience with NCTRER dataset show that this feature can be quite useful in
complementing graph search.
5 Interactive Search and Refinement
After choosing the data set of molecules and filtering them out using Tanimoto
distance the user may launch the ILP and obtain a first model. The user may
then choose to visualise each rule of the model and overlap a rule (pattern) on
the structure a molecule covered by that rule (as shown in Figure 1).
Once a model is constructed there are two possible interaction the user can
take. The user may decide to do a “local and manual” search or he can specify
constraints on the visualised pattern and ask the ILP system to produce a new
In the first case the user may incrementally produce changes in the pattern
(adding or deleting atom and/or bonds) and then ask iLogCHEM to immediately
evaluate the changed pattern. Whenever an evaluation is done the user can see
a list of the “positive and negative” molecules covered.
If “local and manual” search does not produce the desired results the user
may interactively (again adding/removing atoms and bonds) define a new pat-
tern. This new pattern can be converted into a clause and used as the starting
clause of the search space. That is the user is commanding the system to find
useful refinements of the provided pattern.
This paper reports on iLogCHEM, an interactive tool to be used in interactive
drug design tasks. iLogCHEM is designed to give users who have little knowledge
of, or interest in ILP the benefits of this learning mechanism. Thus, it can be
seen as step forward in “enhancing human–computer interaction to make ILP
systems true collaborators with human experts” .
iLogCHEM is founded on previous work to create an effective rule miner for
ILP . The system was driven by expert requirements, extending the previous
work as follows by introducing i) a library of preexisting common patterns,
considered relevant by experts, that are immediately available for discovery; and
ii) the ability to define a new pattern graphically and then translate it to the
iLogCHEM internal representation. These new facilities enable the expert to: i)
look at the pattern highlighted on the molecule structure; ii) interact with the
visualisation tool and specify constraints not satisfied by the pattern presented;
and iii) rerun the ILP system with the specified constraints added to the data
set. These steps are the centre of the main loop of the interaction where the
expert guides the process of pattern discovery. Additionally the tool also allows
the expert user to specify a list of chemical structures (rings and functional
groups) that are used as macro operators. The use of chemical structures may
be very useful to achieve more compact and comprehensible models than the
ones described with atoms and bonds.
This work has been partially supported by the project ILP-Web-Service (PTDC-
/EIA/70841/2006), HORUS (PTDC/EIA-EIA/100897/2008), and by the Fun-
da¸ c˜ ao para a Ciˆ encia e Tecnologia. Max Pereira is funded by FCT grant SFRH/-
1. David Page. ILP: Just do it. In J. Cussens and A. Frisch, editors, Proceedings of
the 10th International Conference on, volume 1866 of LNAI, pages 3–18. Springer-
2. William Humphrey, Andrew Dalke, and Klaus Schulten. VMD – Visual Molecular
Dynamics. Journal of Molecular Graphics, 14:33–38, 1996.
3. Vitor Santos Costa, Nuno A. Fonseca, and Rui Camacho. LogCHEM: Interactive
Discriminative Mining of Chemical Structure. In Proceedings of 2008 IEEE In-
ternational Conference on Bioinformatics and Biomedicine (BIBM 2008), pages
421–426, Philadelphia, USA, November 2008. IEEE Computer Society.
4. J.MCollins. TheDTPAIDS
5. G. M. Maggiora, V. Shanmugasundaram, M. J. Lajiness, T. N. Doman, and M. W.
Schultz. A practical strategy for directed compound acquisition, pages 315–332.
antiviralscreen program, 1999.
6. A. Karwath and Luc De Raedt. Predictive graph mining. In Discovery Science,
7th International Conference, (DS 2004), Italy, volume 3245 of LNCS, pages 1–15.
7. Ravindra N. Chittimoori, Lawrence B. Holder, and Diane J. Cook. Applying the
subdue substructure discovery system to the chemical toxicity domain. In Am-
ruth N. Kumar and Ingrid Russell, editors, Proceedings of the Twelfth Interna-
tional Florida Artificial Intelligence Research Society Conference, May 1-5, 1999,
Orlando, Florida, USA, pages 90–94. AAAI Press, 1999.
8. C. Borgelt and M. R. Berthold. Mining molecular fragments: Finding relevant sub-
structures of molecules. In Proceedings of the 2002 IEEE International Conference
on Data Mining (ICDM 2002), Japan, pages 51–58, 2002.
9. Xifeng Yan and Jiawei Han. gspan: Graph-based substructure pattern mining. In
Proceedings of the 2002 IEEE International Conference on Data Mining (ICDM
2002), 9-12 December 2002, Maebashi City, Japan, pages 721–724, 2002.
10. Jun Huan, Wei Wang, and Jan Prins. Efficient mining of frequent subgraphs in the
presence of isomorphism. In Proceedings of the 3rd IEEE International Conference
on Data Mining (ICDM 2003), 19-22 December 2003, Melbourne, Florida, USA,
pages 549–552. IEEE Computer Society, 2003.
11. Siegfried Nijssen and Joost N. Kok. Frequent graph mining and its application
to molecular databases. In Proceedings of the IEEE International Conference on
Systems, Man & Cybernetics: The Hague, Netherlands, 10-13 October 2004, pages
4571–4577. IEEE, 2004.
12. Andreas Maunz, Christoph Helma, and Stefan Kramer. Large-scale graph mining
using backbone refinement classes. In KDD, pages 617–626, 2009.
13. Stefan Kramer, Luc De Raedt, and Christoph Helma. Molecular feature mining in
hiv data. In KDD, pages 136–143, NY, USA, 2001.
14. R. Guha, M. T. Howard, G. R. Hutchison, P. Murray-Rust, H. Rzepa, C. Stein-
beck, J. K. Wegner, and E. L. Willighagen. The Blue Obelisk–Interoperability in
Chemical Informatics. Journal of Chemical Information and Modeling, 46:991–998,
15. A.M. Richard and C.R. Williams.
(dsstox) public database network: a proposal. Mutation Research/Fundamental
and Molecular Mechanisms of Mutagenesis, 499:27–52(26), 2002.
16. Ashwin Srinivasan. The Aleph Manual. University of Oxford, 2004. Available at
17. S. Muggleton. Inverse entailment and Progol. New Generation Computing, Special
issue on Inductive Logic Programming, 13(3-4):245–286, 1995.
18. Nuno A. Fonseca, Fernando Silva, and Rui Camacho. April - An Inductive Logic
Programming System. In Proceedings of the 10th European Conference on Logics in
Artificial Intelligence (JELIA06), volume 4160 of LNAI, pages 481–484, Liverpool,
September 2006. Springer-Verlag.
19. Francesca A. Lisi, Stefano Ferilli, and Nicola Fanizzi. Object identity as search
bias for pattern spaces. In Frank van Harmelen, editor, Proceedings of the 15th
Eureopean Conference on Artificial Intelligence, ECAI’2002, Lyon, France, July
2002, pages 375–379. IOS Press, 2002.
20. David Page and Ashwin Srinivasan. ILP: A short look back and a longer look
forward. Journal of Machine Learning Research, 4:415–430, 2003.
Distributed structure-searchable toxicity