Predicting the points of interaction of small molecules in the NF-κB pathway.
ABSTRACT The similarity property principle has been used extensively in drug discovery to identify small compounds that interact with specific drug targets. Here we show it can be applied to identify the interactions of small molecules within the NF-κB signalling pathway.
Clusters that contain compounds with a predominant interaction within the pathway were created, which were then used to predict the interaction of compounds not included in the clustering analysis.
The technique successfully predicted the points of interactions of compounds that are known to interact with the NF-κB pathway. The method was also shown to be successful when compounds for which the interaction points were unknown were included in the clustering analysis.
-
Citations (0)
-
Cited In (0)
Page 1
METHODOLOGY ARTICLEOpen Access
Predicting the points of interaction of small
molecules in the NF-?B pathway
Yogendra Patel1, Catherine A Heyward2, Michael RH White2,3*, Douglas B Kell1*
Abstract
Background: The similarity property principle has been used extensively in drug discovery to identify small
compounds that interact with specific drug targets. Here we show it can be applied to identify the interactions of
small molecules within the NF-?B signalling pathway.
Results: Clusters that contain compounds with a predominant interaction within the pathway were created, which
were then used to predict the interaction of compounds not included in the clustering analysis.
Conclusions: The technique successfully predicted the points of interactions of compounds that are known to
interact with the NF-?B pathway. The method was also shown to be successful when compounds for which the
interaction points were unknown were included in the clustering analysis.
Background
A major challenge of systems biology is to use computa-
tional modelling to predict new targets for chemical
intervention. Systems biology involves the quantitative
analysis and integration of individual components of a
biological system leading to a better understanding of
the dynamic and regulatory properties of the system
[1-3]. Chemical biology, on the other hand, involves the
screening of a set of chemical entities to determine their
effects on the function of a system. The combination of
these approaches can allow a better understanding of
the system network through the identification of new
cellular reactions at which new chemical entities perturb
the system[4-6]. Figure 1 outlines the methodology
involved.
One of the most studied cellular signalling systems is
the Nuclear Factor ?B (NF-?B) network. The NF-?B
family of transcription factors controls the transcription
of at least 300 genes, but has different transcriptional
and cell fate outcomes in different cells and in response
to different stimuli. As well as being a critical compo-
nent of the innate immune response, NF-?B controls
cell division and apoptosis in most cell types. While the
NF-?B signalling pathway has been studied in many
papers (nearly 30,000 are returned by a PubMed search
for “Nuclear Factor kappa B”), there is still a great deal
about the system which is not understood. Recently,
NF-?B proteins have been shown to oscillate between
the cytoplasm and nucleus of stimulated cells [7] and
the frequency of these oscillations has been suggested to
alter the pattern of gene expression [8]. The discovery
of the importance of these dynamic processes requires a
re-interpretation of the previous literature.
NF-?B has been a much studied drug target in the
pharmaceutical industry. Numerous traditional medi-
cines have been shown to contain compounds that affect
NF-?B activity. Many of these are now being investi-
gated for pharmaceutical development, for example
gambogic acid [9], caffeic acid phenyl ester [10], green
tea polyphenols (reviewed by Khan and Mukhtar [11]).
In addition, NF-?B antisense oligonucleotides have
recently been shown to affect outcome in a murine
endotoxic shock model [12] and NF-?B decoy oligonu-
cleotides are of interest as potential therapy for inflam-
matory diseases (for review see [13]). The effects of NF-
?B modulating drugs have been measured mostly using
assays for NF-?B function that have been limited to
easily available endpoints such as I?B degradation or
DNA binding. As a result the interpretation of the site
of action of these compounds may require re-analysis.
The combination of the limited characterisation of the
* Correspondence: mike.white@manchester.ac.uk; dbk@manchester.ac.uk
1Manchester Interdisciplinary Biocentre, University of Manchester,
Manchester, 131 Princess Street, M1 7DN, UK
2Institute of Integrative Biology, University of Liverpool, Liverpool,
L69 7ZB, UK
Full list of author information is available at the end of the article
Patel et al. BMC Systems Biology 2011, 5:32
http://www.biomedcentral.com/1752-0509/5/32
© 2011 Patel et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons
Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in
any medium, provided the original work is properly cited.
Page 2
site of action, as well as the limited understanding of the
NF-?B network, has meant that it has been difficult to
interpret and compare the action of different NF-?B
inhibitors. Here we use chemoinformatic approaches to
cluster a set of known NF-?B modulatory compounds.
The methodology is based on the similar property
principle [14] (structurally similar compounds have
similar properties), although it must be noted that there
are flaws with the principle. The main flaw is that small
structural changes can lead to a dramatic change in
property (e.g. changing a hydrogen bond donor for an
acceptor activity can greatly increase activity against a
target), which has a major impact in studying quantita-
tive structure activity/property relationships [15]. In this
study we use it as a general rule rather than a specific
rule. In addition to identifying relationships between
clusters of compounds and their biological functions,
clusters were also used to identify the points of interac-
tion of compounds (which are known to interact with
the NF-?B pathway) not used in the clustering analysis.
The compounds were obtained from a literature
search, which in many instances involved manually
searching for chemical structures using chemical names
present in the literature. Structures for the compounds
can be found in the Additional File 1 (chiral information
has been included where known; however only 2D infor-
mation was used in the work presented here). Since the
creation of this list, advances in text mining mean it is
now possible to automatically extract names of com-
pounds from the literature and associate the names with
structures, for example using Pipeline Pilots’ ChemMin-
ing Collection [16] or OSCAR3 [17]. The resulting list
of compounds from the literature search could look like
the diverse set collated here. Such lists obtained for a
cellular pathway could be used (as here) to identify
compounds which interact in a similar manner in a
given pathway. A point to note is that here, this techni-
que has not been used to identify novel compounds that
interact within the pathway, but rather to identify the
point of interaction of compounds which are known to
affect the pathway. As an additional aim of this work,
we have used all the compounds obtained from the lit-
erature search in this analysis, including those for which
no specific point of interaction in the NF-?B pathway
has been suggested, in order to investigate if this step
(or a similar first step in the drug discovery process)
could also be automated.
A major issue in analysing such a diverse system is
pooling together all the information available. The avail-
able literature for the NF-?B signalling system is an
extremely underused resource. This is primarily due to
the complexity of comparing data generated using dif-
ferent cell types and stimuli, and the changes in data
quality and methodologies over time. Another issue is
that the reported effect of a compound is not necessarily
indicative of the actual interaction of a compound. For
example, if a compound is found to inhibit DNA bind-
ing in an electrophoretic mobility shift assay using
nuclear extracts it is possible that the actual interaction
of this compound could be anywhere upstream of this
process [18], as indicated in the schematic diagram of
the NF-?B pathway (Figure 2.). In this paper we have
assumed that the interactions stated in the literature are
correct, but if for example, a compound is reported as
interacting at point 3 in Figure 2, it is possible that its
actual effect occurs at any point from 1 to 3. In addi-
tion, it is entirely possible that molecular interactions
may occur at multiple points in the pathway. (For
instance, this might be the case with molecules with a
reactive oxygen species (ROS) interaction [19]).
Results
Method
A set of 460 small molecules that interact with the NF-
?B pathway were obtained from the literature. This
involved an extensive literature search with additional
searches for chemical structures as many of the biologi-
cal references only refer to compounds that interact
with the NF-?B pathway by name (not necessarily the
IUPAC name). Structures, SMILES and references for
each compound can be found in Additional Files 1.
Chiral information has been included where known, but
this data was not available for all the compounds (and
also the reason why only topological descriptors have
Collate library of compounds affecting the pathway and
assign point of interaction
Chemoinformatic analysis to cluster compounds
according to chemical structure and mechanism of
action (and use of other chemoinformatic techniques to
identify novel pathway interfering compounds in
subsequent iterations)
Investigate example compounds from clusters to
identify their point of interaction
Plan related chemical structures by chemoinformatics
Synthesise new related compounds and test for their
point of interaction
Figure 1 Methodology of using Chemoinformatics in Systems
Biology. Outline of the methodology involved in applying
chemoinformatics to a systems biology problem.
Patel et al. BMC Systems Biology 2011, 5:32
http://www.biomedcentral.com/1752-0509/5/32
Page 2 of 12
Page 3
been used). The compounds in the set vary greatly in
terms of size and functional groups present. Figure 3
shows the distribution of the compounds in a represen-
tation of chemical space.
For 297 compounds the type of interaction within the
pathway was also taken from the literature. The interac-
tions were defined as: interacting via a ROS mechanism
(this can be at any point in the pathway); inhibiting IKK
(at point 1 in Figure 2); inhibiting degradation/phos-
phorylation of I?B (point 2 in Figure 2); increasing
degradation/phosphorylation of I?B (point 2); inhibiting
translocation (point 3); and interfering with DNA bind-
ing (point 4). Four compounds had more than one
interaction (see Additional File 1 for structures and
references).
From this initial compound set of 460 inhibitors, five
different training and test set combinations were ran-
domly created. Previous work has shown that different
combinations of training and test sets taken from the
same data can produce different results [20]. Although
this was with respect to a quantitative structure-activity
relationship model, the same methodology was applied
here. The reason for following this procedure was to
identify whether randomly created clusters were able to
classify compounds correctly. If only one of the training
sets created clusters that classified compounds correctly,
we could assume that this was more likely to be due to
chance than if all five training sets created clusters that
could correctly classify the compounds. A test set was
chosen by randomly selecting 60 compounds with
known interactions, ensuring that there was at least one
compound with each type of interaction. The remainder
of compounds were used to form the training set. From
this point onward, each of these will be referred to as
dataset 1, dataset 2 etc., and collectively as the datasets.
Table 1 shows the number of compounds with a specific
interaction in each training and test set of the datasets.
Compounds with more than one interaction are
included in the training sets of datasets one and four
and test sets of datasets two, three and five.
Each training set was clustered using Pipeline Pilot
[16] with the following descriptors: Extended Connectiv-
ity Fingerprints with a path length of 4 atoms (ECFP4),
Property descriptors (AlogP, molecular weight, number
of hydrogen bond acceptors, number of hydrogen bond
donors, number of atoms, number of rotatable bonds,
number of rings, and number of aromatic rings), ECFP4
with Property descriptors, BCUT (descriptors obtained
from the eigenvalues of the adjacency matrix, weighting
the diagonal elements with atom weights), GCUT
(obtained from the eigenvalues of a modified graph dis-
tance adjacency matrix), BCUT with GCUT, and GCUT
with Property descriptors. The BCUT and GCUT
descriptors were calculated using MOE [21]. Clustering
IKK
NF-?B
dimer
I?B
IKK
activation
Phosphorylation &
degradation of I?B
NF-?B
dimer
NUCLEUS
NF-?B dimers
translocate to
nucleus
NF-?B
dimer
NF-?B dimers bind to
DNA and initiate
transcription of target
gene and I?B
CYTOPLASM
I?B
I?B restores NF-?B
to the cytoplasm
NF-?B
dimer
I?B
Activation
signal
1
2
3
4
5
6
DNA
Figure 2 Simplified NF-?B Pathway. In unstimulated cells inactive dimers of NF-?B are located in the cytoplasm bound to I?B proteins
preventing NF-?B from translocating into the nucleus. Activation of the inhibitor ?B kinases (IKK) by NF-?B-activating stimuli (1) allows
phosphorylation of I?B and NF-?B protein. Phosphorylation of I?B leads to its ubiquitination and degradation and therefore dissociation from
the NF-?B dimers (2). The free dimers can then translocate into the nucleus (3) and regulate target gene transcription (4). I?B is a transcriptional
target for NF-?B (5 and 6), creating a negative feedback loop.
Patel et al. BMC Systems Biology 2011, 5:32
http://www.biomedcentral.com/1752-0509/5/32
Page 3 of 12
Page 4
Figure 3 Distribution of Compounds in a Representation of Chemical Space. The three principle components were calculated from 184 2D
descriptors in MOE[21] and describe 51.1% of the variance. Type of interaction: orange = IKK inhibitors; pink = ROS interactions; blue = DNA
interaction; green = inhibits translocation; yellow = increases I?B degradation & phosphorylation; yellow = decreases I?B degradation &
phosphorylation; black = unknown. The compounds with unknown interactions in the area A all come from series of compounds based on
Resveratrol[25] (bottom).
Table 1 Compounds in the Various Training and Test Sets
DatasetROS
interactiontranslocation
InhibitsInterfering with
DNA binding
Inhibits IKK
activation
Inhibits I?B degradation or
phosphorylation
Activates I?B degradation or
phosphorylation
Training
set 1
Test set
1
12548104693
22 12 3473
Training
set 2
Test set
2
115 52111574
328 27192
Training
set 3
Test set
3
115 53107 604
32731162
Training
set 4
Test set
4
116 48112613
311226153
Training
set 5
Test set
5
11648 108 625
31123014 1
Table showing the various number of compounds with a specific function for each training and test set of each dataset (training and test set 1 form dataset 1
and so on).
Patel et al. BMC Systems Biology 2011, 5:32
http://www.biomedcentral.com/1752-0509/5/32
Page 4 of 12
Page 5
was based on maximal dissimilarity partitioning with the
clusters derived by imposing a distance threshold
between a molecule and its cluster representative. As
the clustering algorithm in Pipeline Pilot is dependent
upon a seed compound, five different seeds were chosen
for each descriptor and dataset combination (i.e. there
were five different sets of clusters for each descriptor of
each dataset giving a total of 25 different sets of clusters
for each descriptor).
The number of clusters used in the clustering was
chosen by using the following method [22]: first the
training set compounds were clustered into a set num-
ber of clusters (n), which varied from one to 200 (which
would give an average of 2 compounds per cluster). For
each n, the average self-similarity (avg-s) of the clusters
was calculated. The value of n was chosen so that the
biggest decrease in avg-s was seen between (n-1) and n
clusters.
The clusters were analysed to see if the compounds
they contained had predominantly one type of interac-
tion. The interactions used to define the predominance
of a cluster are as given above. Validation of the accu-
racy of the clustering procedure was performed by find-
ing the most similar cluster to the compounds in a test
set (i.e. the compounds in the test sets were used as
query compounds) in turn and assigning the predomi-
nant interaction of the nearest cluster to the test com-
pound. The nearest cluster was found in one of the
following ways:
1. The cluster which had the most similar
compound;
2. The cluster which had the most similar cluster
centre;
3. The cluster with the highest average similarity;
4. Repeat considering only clusters with a minimum
of 1 (i.e. singletons), 2, 3, 4 or 5 compounds.
The compounds with unidentified points of interac-
tion in the pathway were included in the training sets
used in the clustering analysis in order to investigate
how their inclusion affected the ability of using this
technique to predict the interactions of the query com-
pounds in the test sets.
Clustering
Surprisingly, the number of clusters chosen by all the
combinations of descriptors and datasets was 135, as
this gave the largest decrease in the avg-s. Figure 4
shows a heat map representation of the similarity
between the clusters for all the datasets. The similarity
was measured by comparing which pairs of compounds
were clustered together in each of the sets of clusters.
The figure shows that the compounds in the clusters
vary between the datasets. There is less variance within
the datasets of a single descriptor than with those of
other descriptors. This is to be expected as datasets of
the same descriptor will be partitioned in a similar way.
Ignoring singletons, the number of clusters varies from
61 to 85 for the datasets.
Each dataset was then analysed to see how many of
the clusters contain compounds with a predominant
interaction. The levels of predominance used in this
analysis were 50%, 66%, and 75%, and considered clus-
ters with a minimum size of 1 (singletons), 2, 3, 4 or 5
compounds. For example, the datasets were analysed to
see how many clusters have at least 50% of their mem-
bers with the same interaction.
Figure 5 shows the success of the clustering at produ-
cing clusters with a predominant interaction using
ECFP4 with Property descriptors. The clusters for data-
set 5 are shown in Additional File 2. The figure shows
that no matter what the minimum size of the clusters
taken into consideration, more than half contain over
50% of compounds with the same interaction. Generally
50-70% of clusters contained over 50% of compounds
with the same interaction. Of clusters containing 66% or
75% of compounds with the same interaction, dataset 4
performed the worst at 35%, while the other datasets
gave 40-60% of clusters having the same interaction.
Combining Property descriptors with ECFP4 descriptors
may improve the clustering by making the descriptor
more specific, e.g. with the Property component com-
pounds of a similar size and a similar number of rings
have a higher similarity score to a query. Below we look
at the clusters created for a dataset using these descrip-
tors in more detail.
Figure 6 shows pie chart representations of the clus-
ters (minimum size of two compounds) according to the
interaction of its member compounds, either including
or omitting compounds with unknown interactions. The
clusters are those created using ECFP4 with Property
descriptors as this gives the best predictions for the
compounds not involved in the clustering procedure.
The compounds in both sets of clusters can be found in
Additional File 3. Taking into account the compounds
with unknown interactions, there are 68 clusters with at
least two compounds. In 45 clusters 50% or more of its
members have the same interaction. 28 clusters have
50% or more members with an unknown interaction (6
have 50% of its members with the same interaction, and
50% with unknown interactions). Omitting the com-
pounds with unknown interactions gives 40 clusters
with at least two compounds. All of these have at least
50% of its members having the same interaction. Thirty
clusters have 100% of their members with the same
interaction, one has 75%, one 66.6%, and the rest (8)
have 50% of their members interacting at the same
Patel et al. BMC Systems Biology 2011, 5:32
http://www.biomedcentral.com/1752-0509/5/32
Page 5 of 12