Kinome-wide Activity Modeling from Diverse Public High-Quality Data Sets

Article (PDF Available)inJournal of Chemical Information and Modeling 53(1) · December 2012with21 Reads
DOI: 10.1021/ci300403k · Source: PubMed
Large corpora of kinase small molecule inhibitor data are accessible to public sector research from thousands of journal article and patent publications. These data have been generated employing a wide variety of assay methodologies and experimental procedures by numerous laboratories. Here we ask the question how applicable these heterogeneous datasets are to predict kinase activities and which characteristics of the datasets contribute to their utility. We accessed almost 500,000 molecules from the Kinase Knowledge Base (KKB) and after rigorous aggregation and standardization generated over 180 distinct datasets covering all major groups of the human Kinome. To assess the value of the datasets we generated hundreds of classification and regression models. Their rigorous cross-validation and characterization demonstrated highly predictive classification and quantitative models for the majority of kinase targets if a minimum required number of active compounds or structure-activity data points were available. We then applied the best classifiers to compounds most recently profiled in the NIH Library of Integrated Network-based Cellular Signatures (LINCS) program and found good agreement of profiling results with predicted activities. Our results indicate that, although heterogeneous in nature, the publically accessible datasets are exceedingly valuable and well suited to develop highly accurate predictors for practical Kinome-wide virtual screening applications and to complement experimental kinase profiling.
9 Figures

Full-text (PDF)

Available from: Stephan Schurer, Nov 05, 2014
Kinome-wide Activity Modeling from Diverse Public High-Quality
Data Sets
Stephan C. Schürer*
and Steven M. Muskal
Department of Molecular and Cellular Pharmacology, Miller School of Medicine and Center for Computational Science, University
of Miami, Miami, Florida 33136, United States
Eidogen-Sertanty, Inc., 3460 Marron Road No. 103-475, Oceanside, California 92056, United States
SSupporting Information
ABSTRACT: Large corpora of kinase small molecule
inhibitor data are accessible to public sector research from
thousands of journal article and patent publications. These
data have been generated employing a wide variety of assay
methodologies and experimental procedures by numerous
laboratories. Here we ask the question how applicable these
heterogeneous data sets are to predict kinase activities and
which characteristics of the data sets contribute to their utility.
We accessed almost 500 000 molecules from the Kinase
Knowledge Base (KKB) and after rigorous aggregation and
standardization generated over 180 distinct data sets covering all major groups of the human kinome. To assess the value of the
data sets, we generated hundreds of classication and regression models. Their rigorous cross-validation and characterization
demonstrated highly predictive classication and quantitative models for the majority of kinase targets if a minimum required
number of active compounds or structureactivity data points were available. We then applied the best classiers to compounds
most recently proled in the NIH Library of Integrated Network-based Cellular Signatures (LINCS) program and found good
agreement of proling results with predicted activities. Our results indicate that, although heterogeneous in nature, the publically
accessible data sets are exceedingly valuable and well suited to develop highly accurate predictors for practical Kinome-wide
virtual screening applications and to complement experimental kinase proling.
Over 500 human protein kinases
as well as many nonprotein kinases are involved in virtually
every signal-transduction process, which are controlled by
phosphorylation cascades. Because of their ubiquity, kinases are
one of the most intensely pursued classes of drug targets.
Numerous distinct kinase targets and over 300 kinase inhibitors
are in clinical development,
and 15 protein kinase inhibitor
drugs have been approved so far (compiled from dierent
sources). While currently most of the clinical kinases drug
targets are being investigated for treatment of cancer, a growing
number of other disorders including immunological, neuro-
logical, metabolic, and infectious diseases have been associated
with dysregulation of protein phosphorylation.
This suggests
that the number of kinases as potential drug targets is
substantial and has generated huge interest in the development
of small-molecule inhibitors, most of them targeting the ATP
binding site of the kinase catalytic domain.
The ATP binding
region is highly conserved among protein kinases, which has
important consequences for the drug discovery process.
Achieving selectivity of a small molecule inhibitor against
kinase o-targets to avoid adverse reactions has generally been
considered challenging. On the other hand, the clinical ecacy
of many of the current kinase inhibitor oncology drugs is
related to their polypharmacologytheir ability to inhibit
multiple kinases at the same time.
It is now well-understood
that the development of novel ecacious and safe compounds
requires a nely tuned balance of polypharmacology and
Despite the high degree of conservation in the ATP
binding site, reasonably selective inhibitors with favorable
pharmacological properties can be developed.
This is helped
by the increasing understanding of structural and functional
relationships across the kinome.
Today, it is quite common
to prole inhibitors against an extensive set of kinase targets
at an early stage of development. Kinase proling technologies
have generated valuable data sets and provided insights into the
determinants of selectivity and promiscuity of clinical
One such public eort currently takes place
under the NIH Libraries of Network-based Cellular Signatures
(LINCS) program.
The LINCS program develops a library of
molecular signatures based on gene expression and other
cellular changes in response to perturbing agents across a
variety of cell types using various high-throughput screening
approaches. LINCS proled compounds include many kinase
inhibitors, because of their translational potential. LINCS
kinase proling results are currently available from the HMS
Received: August 24, 2012
Published: December 21, 2012
© 2012 American Chemical Society 27 |J. Chem. Inf. Model. 2013, 53, 2738
LINCS DataBase
and also via the LINCS Information
FrameWork (LIFE).
High-throughput technologies and conventional screening
over two decades of small molecule kinase inhibitor research
have generated a huge amount of data available in peer-
reviewed publications and patents. In addition to numerous
individual virtual screening approaches, the availability of large
data sets of small molecule kinase inhibitorsin particular in
the pharmaceutical industryhas spurred eorts to utilize the
available data sets to understand and model kinase poly-
pharmacology for example by analysis of inhibitor cross
and analysis of kinase chemotype selectivity
Here, we focus on the data that are available in the public
domain. The reliability of heterogeneous data sets generated
under dierent screening conditions and using dierent assay
methods and technologies are sometimes questioned. We want
to understand how useful these types of data sets really are. To
do this, we evaluate results from dierent machine learning
techniques applied to the data. Earlier examples among
numerous machine learning approaches to classify kinase
inhibitors include neural networks trained using BCUTS
ve Bayesian modeling using extended 2D
topological ngerprints and basic molecular descriptors,
and a
survey of dierent machine methods using fragment-based
Ghose-Crippen descriptors.
In an eort toward virtual
polypharmacology, fragments and fragment counts have
successfully been used in prospectively exploring kinase
inhibitor activity space.
Here, we generated and charac-
terized hundreds of kinase classication and regression models
based on a large number of data sets extracted from the Kinase
Knowledge Base (KKB),
which span the entire human
kinome. The data sets were curated from several thousand peer-
reviewed journal and patent publications from numerous
laboratories comprising various assay technologies, assay
designs, and procedures. We investigated the applicability of
these data sets to generate predictive models, which modeling
technique(s) were best suited and applicable for virtual
screening, and which characteristics of data sets led to good
predictors. We applied the best predictors to compounds
proled in the LINCS program and found that these proling
results were in good agreement with predicted actives.
Kinase Data Sets. The Q4 2009 release of the KKB
incorporated >430 000 bioactivity data points from over 20 000
assay experiments curated from more than 1800 journal articles
and more than 4400 patent publications. It covers over 500 000
unique molecules. The KKB includes various metadata
annotations including standardized target and assay format.
biochemical assays (enzymatic assays, performed with puried
protein) of human species were chosen. Any mutant kinase
targets were excluded. Only high-quality concentration
response end points (e.g., IC50,Ki) were kept. Because in the
KKB, chemical structures are stored as published, we
standardized all structures using an in-house Pipeline Pilot
Salts/addends and duplicate fragments were
removed (using an in-house salt library) so that each structure
consisted of only one fragment. Stereochemistry and charges
were standardized, the structures were then ionized at pH = 7.4
and tautomers were canonicalized. For the purpose of this
modeling study in which we use extended connectivity
ngerprint of length four (ECFP4) descriptors (see below),
stereochemistry and E/Z geometric congurations were also
removed. All data points were rst transformed into p-values
(i.e., pIC50 =log10[IC50] in molar concentration) and then
aggregated, rst within each experiment and then across
experiments by unique structures and kinase targets. Kinase
protein targets were identied by unique standardized Entrez
Gene symbols, which are annotated in the KKB. The kinase
gene symbols were mapped to Uniprot accessions. Although
detailed experimental descriptions are available in the KKB, we
did not lter any assay technologies or experimental conditions.
Identical target-structure data points were aggregated using the
median for exact data points (to minimize the eect of outliers
introduced by dierent assay methods and experiments) and in
case of qualied data (greater than, less than, range) the most
conservative (inclusive) ranges were kept. This aggressive data
aggregation procedure resulted in 233 667 unique (structure-
target) data points for 126 114 unique structures covering 411
unique kinase targets. We also standardized the structures of all
KKB molecules resulting in a total of 489 373 unique chemical
compounds. Access to these data sets along with the entire
KKB database for academic research groups can be obtained
through the portal For the naï
ve Bayesian
classication models, active compounds were dened as p-
transformed activity concentration of greater or equal to 6 (i.e.,
IC50 1μM). Using this denition of active, 189 kinase
targets have at least 10 active compounds (see Supporting
Information Table S1; not including TBK1, which includes 169
actives and no inactives). We built nä
ive Bayesian classication
models treating the data sets in two dierent ways: in one case,
the classiers are trained employing only compounds that are
explicitly dened as active or inactive (KA-KI; known active-
known inactive). In the other case, we use all unique KKB
compounds as decoys and presume as inactive all compounds
that are not annotated as actives for any specic kinase (KA-PI;
known active-presumed inactive). For the regression models
(see below) only data points with exact activity values (no <, >,
or range data) were kept. They included data sets for 168
kinases with at least 20 structureactivity data points
(Supporting Information Table S2, which also incudes data
set statistics). To evaluate the activity range of kinase inhibitors
by structural series and the coverage of structural series across
dierent kinase targets, we clustered all compounds into 336
clusters (twice the number of kinase data sets). We used the
partitioning algorithm implemented in the Pipeline Pilot 8.0
(Accelrys) modeling collection with the Tanimoto distance
function on ECFP4 ngerprints (see below); cluster centers
were selected by maximum dissimilarity. After generating the
clusters, we visualized the pIC50 activity ranges for each kinase
data set for each cluster (Supporting Information Table S3).
KINOMEScan kinase proling results from the LINCS
project were downloaded from the HMS LINCS DataBase.
total 25 064 data points were obtained with 60 unique
compounds (by HMSL_ID), 43 of them with dened/known
chemical structure, and 486 dierent targets (including
mutations for several kinases); not all compounds were tested
against all targets. Kinase activity was screened at 10 μM
compound concentration. Reported activities were transformed
into percent inhibition. The kinase targets were manually
annotated with their corresponding Uniprot accessions based
on their symbol/description. The targets were mapped to the
KKB data sets (and models) based on their Uniprot accessions.
A total of 4796 kinase compounds pairs were mapped for
Journal of Chemical Information and Modeling Article |J. Chem. Inf. Model. 2013, 53, 273828
compounds with known structures after removing mutant
kinase targets from the KINOMEScan data sets.
Laplacien-Corrected Nai
ve Bayesian Classiers. Naï
Bayes is a statistical classication method based on conditional
probabilities: in this context, the probability of a compound
being active given the presence of structural features computed
from the frequencies of occurrence in a training set of active
and inactive samples. Naï
verefers to the assumption that
features are independent and that the overall probability
therefore can be computed by multiplying probabilities of the
individual events. The Laplacien-corrected estimator accounts
for the dierent sampling frequencies of dierent features
assuming that the vast majority of features have no relation with
activity. The Laplacien correction stabilizes the estimator: as
the number of samples containing a feature approaches zero,
the features probability contribution converges to the baseline
probability. The nal estimate for a particular sample is
computed as the sum of the logarithm of the relative
(corrected) individual feature weights.
This classier has
several desirable criteria. It scales linearly with the number of
molecules and is therefore applicable to large data sets.
Bayesian classication is suitable to model in high-dimensional
spaces (large number of descriptors do not cause overtting). It
is therefore appropriate for developing models from structurally
dissimilar molecules and to incorporate multiple activity classes
into a single model (for example dierent sites/binding modes
or dierent mechanism of action). The Laplacien-corrected
ve Bayes classication method is also reasonably resistant to
noise such as false positives or false negatives.
The Pipeline
Pilot modeling collection includes an implementation of this
classication learner, which was employed here.
Here we built Laplacien-corrected nai ̈
ve Bayesian classi-
cation models based on two dierent methods of handling the
kinase data sets as described above. Classiers are trained from
known actives and known inactives (see data sets for 188
kinases provided in Supporting Information Table S1). KA-PI
classiers were built using one data le of all unique (489 373)
kinase molecules in which the activity categories are dened by
an array containing the respective kinase symbol(s) for which
each compound qualies as active. The classiers were then
built by identifying for each individual kinase the active
compounds and treating the remaining compounds as inactive
(decoy). Both types of classication models were rst built
using all data sets and characterized by the Pipeline Pilot
internal leave-one-out cross-validation to estimate the area
under the receiver operating characteristic (ROC) curve (ROC
score) and enrichment results (for 1, 5, 10, 25, 50, 75, and
90%). The ROC curve is the true positive rate (TPR) over the
false positive rate (FPR). TPR is equal to sensitivity (S) of the
model and dened and as the number of true positives (TP)
divided by the number of actives (Nact). Specicity (SP) is the
number of true negatives (TN) divided by the number of
inactives (Ninact). FPR is 1 SP, which is equal to the number
of false positives (FP) divided by Ninact. The enrichment factor
(EF) at any given percentage of compounds tested is dened as
EF = [TP/(TP + FP)]/[Nact/N] where Nis the overall number
of samples; it is a measure of how well the predictor recovers
active compounds relative to random retrieval. To validate
these modeling approaches more rigorously, we randomly split
the data sets into 75/25 training/test sets and built the models
using the training sets and generated ROC scores and
enrichment results (for 0.1, 0.5, 1, 3, 5, 10, and 20%) using
the test sets. This randomized train/test evaluation procedure
was repeated 10 times, and the results were averaged over the
repetitions. Supporting Information Tables S1 and S4 report
the modeling results for all 188 and 189 kinase data sets
corresponding to the KA-KI and KA-PI approach, respectively.
Tables S1 and S4 also include the average numbers of training
and test sets for each kinase data set and the maximum
achievable (perfect) enrichment factors (EFmax), which is the
maximum number of actives among the percentage of tested
compounds divided by the fraction of actives in the entire data
set. To evaluate enrichment, we report the ratio of EF/EFmax.
This normalized enrichment factor is a useful measure, because
EF values are not directly comparable across dierent data sets.
This is because maximum possible EF by denition is limited to
the ratio of total to active compounds in a data set. EF is also
limited to the reciprocal of the fraction of compounds tested.
For example if 0.1% of all compounds are tested, the highest
possible EF is 1000 if the fraction of actives is less or equal to
0.1% of total compounds (and if all retrieved compounds are
true positives). Table S1 incudes some empty elds for EF at
lower percentages, because enrichment factors for a given
percentage of screened compounds can only be obtained if the
number of tested compounds is at least one. For the KA-PI
classication models using the entire set of kinase compounds
as decoy, we further characterized the classiers by the same
cross-validation procedure employing training/test set ratios of
50/50 as well as 25/75. These results are reported in
Supporting Information Table S5. We also built KA-PI
classiers from the same chemical structures but after
randomizing the kinase activities while maintaining the number
of actives for each kinase. Leave-one-out cross-validation
resulted in ROC scores of close to 0.5 and enrichment factors
of approximately the ratio of actives to overall number of
samples, which corresponded to random classication as
To test domain applicability, 53 kinase data sets that each
have at least 500 active compounds were selected and each
clustered into 10 series using the partitioning algorithm
implemented in the Pipeline Pilot 8.0 (Accelrys) modeling
collection with the Tanimoto distance function on ECFP4
ngerprints (the same as for the kinase models, see below);
cluster centers were selected by maximum dissimilarity. For
each kinase data set, 10 models were built, each using a
dierent series as the (active) test set and the remaining 9
clusters combined as the (active) training set. Inactive test and
training compounds were selected randomly for each model in
the same ratio of active training to test compounds. This way
530 models were built. For each active test compound the
Tanimoto similarity to the closest active training compound
was calculated. ROC scores were computed for each model
based on predictions of the test set to evaluate the performance
of the models based on similarity of the test to the training set
(details are provided in Supporting Information Table S6). In
addition, for each active test set compound the predicted
activity was recorded to evaluate true positive rate as a function
of similarity to the closest training compound. In total 98 731
predictions were made across all 530 models.
To compare predicted kinase activity to the KINOMEScan
proling results, we used the EstPGood output of the KA-PI
classiers, the estimated probability that the sample is in the
active category based on an assumed normal distribution within
the active and inactive categories.
Regression Models. Both knearest neighbor (kNN) and
partial least square (PLS) regression was performed using the
Journal of Chemical Information and Modeling Article |J. Chem. Inf. Model. 2013, 53, 273829
implementations of these machine learning procedures in the
Pipeline Pilot modeling collection. PLS is a statistical method
to build a (linear) model of the response variable as a function
of (uncorrelated) linear combinations of the input descriptors
(components); it is arguably one of the most traditional, yet
widely used, approaches to qualitative structureactivity
relationship (QSAR) study.
In our PLS models, we restrict
the number of components to 20. The kNN method is among
the simplest in machine learning algorithms. In kNN
regression, the response variable is computed as a weighted
average of the value of the knearest neighbors. Here we use 20
nearest neighbors and a dynamic smoothing factor (Gaussian
weighting) of 0.5. For both PLS and kNN, we performed a full
10-fold cross-validation and computed R2for the training data,
q2for cross-validation, and RMSE for training and cross-
validation. We repeated the 10-fold full cross-validation
procedure 10 times with random split of the 10 subsets. The
averaged results are reported in Supporting Information Table
Structural Descriptors. Because of their strong perform-
ance in previous studies,
we employed extended
connectivity ngerprints (ECFPs). Specically we used atom
type ECFPs of length four (ECFP4) implemented in Pipeline
ECFPs are topological circular ngerprints character-
izing each atom by its number of atomic connections, element
type, charge and mass, and environment (in this case up to four
neighbor atoms).
KA-KI Classiers. We rst built and evaluated Laplacien-
modied naï
ve Bayesian KA-KI classiers. For this we selected
all data sets from the preprocessed (aggregated) KKB with
minimum of 10 active compounds where active is dened as p-
transformed activity value of 6 (i.e., IC50 <1μM). A total of
188 kinase data sets qualify; the minimum data set size (actives
plus inactives) was 26. The classiers were cross-validated by a
leave-one-out analysis and 10 repetitions of randomized 75/25
training/test split (see Methods). For both cross-validation
methods, ROC scores and enrichment factors show that these
KA-KI classiers perform well for most of the data sets
(Supporting Information Table S1). Good results are obtained
for data sets with more than 40 active compounds (minimum
61 compounds total), which corresponds to 130 data sets with
a ROC score (leave-one-out) of 0.7 or larger; that is with the
exception of one data set (MST1R) that has only two inactives
and therefore cannot be meaningfully evaluated and was
excluded from further analysis (leaving 129 data sets). Raising
the minimum number of actives to 70 further improves the
results to ROC scores of 0.84 or greater (111 data sets with at
least 113 total samples). ROC scores based on leave-one-out
and 75/25 (train/test) cross-validation procedures are well
correlated for data sets with at least 40 actives (R2= 0.73) and
increases further for data sets with 70 actives or more (R2=
0.77) as shown in Figure S1 (Supporting Information).
Correlation of the ROC score of the two cross-validation
procedures increases signicantly for data sets with a ratio of
total to active compounds of two or greater: for the data sets
with at least 40 actives, R2is greater than 0.85, and for data sets
with 70 actives, R2improves to 0.93. However, this
decreases the number of data sets to 57 and 47, respectively.
In general the classiers for the more balanced and larger data
sets performed better.
For the 129 data sets with >40 actives, enrichment factors
(EF) at 10% tested samples varied between 1 and 8 for both
the leave-one-out and train/test cross-validation procedures.
The maximum enrichment factor (EF max) depends on the
fraction of actives in the data set. For all 129 data sets, the
actual enrichment factors reach the maximum possible
enrichment indicating well-performing classiers. However,
EF is of limited usefulness to characterize classiers based on
Figure 1. Characterization of 141 kinase KA-PI protein and nonprotein kinase classiers, including all major protein kinase groups. ROC scores are
shown as a function of active samples. Shape by protein vs nonprotein kinase, color-coded by kinase group, scaled by number of active data points,
and annotated by HUGO kinase gene symbol.
Journal of Chemical Information and Modeling Article |J. Chem. Inf. Model. 2013, 53, 273830
balanced data sets. The nature of the data sets reects what is
typically reported in the literature emphasizing active
compounds and reporting only few (often structurally related)
inactive compounds. In contrast, high-throughput screening
(HTS) results include mostly inactive/negative results. There-
fore, in practice, the KA-KI classiersalthough highly
predictivemay be of limited utility. Highly miniaturized
HTS allows a throughput of several hundred thousand to
millions of compounds. Consequently, in order to recover a
sizable fraction of hits by screening only a few hundred to
thousand compounds, 100- to 1000-fold enrichment or even
greater would be desirable.
KA-PI Classiers. We investigated such a scenario by
building KA-PI classiers for the 189 kinase data sets with at
least 10 actives as dened above. To build these classiers, we
employ all (489 373) KKB kinase molecules as decoys; i.e., we
presume as inactive all compounds that were not specically
annotated as active for any given kinase. This resulted in 189
data sets, each consisting of 489 373 samples including 10
6388 actives. Their size and unbalanced nature corresponded
quite well to real HTS data sets, for example those in
Laplacien-modied naï
ve Bayesian classication is a suitable
method to model our data sets of almost half a million unique
kinase-related compounds (see Methods), which can bind at
dierent sites, such as ATP competitive compounds, but also
allosteric inhibitors, and with dierent binding modes, for
example type I and type II inhibitors.
Although many of the
presumed inactive compounds did not have specic activity
annotations, they were all reported in patents and journal
publications that focus on kinase inhibitors. They are thus
closely related in terms of their biological focus and in many
cases structurally.
Similar to the KA-KI classication models, the KA-PI
classiers were validated by leave-one-out and 10 repetitions
of train/test cross-validation. We report ROC scores and
enrichment factors at various percentages of samples screened.
In addition to the 75/25 split, we also evaluated the classiers
in a 50/50 and 25/75 training/test cross-validation (10
repetitions averaged, see Methods).
Figure 1 shows ROC score (leave-one-out) as a function of
the number of active samples for all 141 kinases KA-PI
classiers with at least 25 actives. The 141 kinases cover all
major groups of the human kinome and also several nonprotein
kinases. All 189 models are shown in Supporting Information
Figure S2. As a general trend, it can be seen that the quality of
the classiers increases with the number of active samples. In
particular, data sets with more than about 50 actives resulted in
much improved results compared to those with fewer actives.
For classiers with more than a few hundred actives, there
appeared to be no further improvement in ROC score as the
number of actives increases further. For the majority of models,
the ROC scores are very high (>0.96). All details are provided
in Supporting Information Table S4, which shows ROC scores
and enrichment factors for the leave-one-out and 75/25 train/
test cross-validation procedure for all 189 KA-PI kinase
classiers. Data set statistics are also shown. Supporting
Information Figure S3 illustrates the relationship of ROC
scores for leave-one-out vs 75/25 train/test cross-validation for
KA-PI classiers based on data sets with at least 10 active
samples vs data sets with at least 50 actives. As with the KA-KI
models, leave-one-out ROC score estimate was closely related
to the ROC score obtained by 75/25 train/test validation, in
particular for the data sets with a greater number of actives.
Supporting Information Table S7 shows the ROC plots for the
KI-PI classiers based on a data set with >50 actives using a 75/
25 train/test set. While Table S7 shows one ROC plot for each
kinase/data set, it should be noted that ROC scores reported in
Supporting Information Table S4 are averaged over 10
While ROC score is a good measure of the predictors overall
performance, enrichment, in particular for low percentages of
Figure 2. Normalized enrichment factors for 189 KA-PI kinase classiers at 0.1% screened samples for 75/25 (train/test) cross-validation (10
repetitions averaged) as a function of active compounds in the data set. Shape by protein vs nonprotein kinase, color-coded by kinase group, scaled
by number of active data points, and annotated by HUGO kinase gene symbol.
Journal of Chemical Information and Modeling Article |J. Chem. Inf. Model. 2013, 53, 273831
screened compounds, is an important measure of the practical
applicability of a predictor. Enrichment factors (EF) at very low
percentages of screened samples for the KA-PI classiers were
very high for most kinases. Figure 2 illustrates the normalized
enrichment factor, that is the ratio EF/EFmax (see Methods), at
0.1% tested samples for all 189 kinase classiers based on
randomized 75/25 train/test cross-validation averaged over 10
repetitions. It can be seen that most classiers are able to
retrieve true actives very well if the number of actives in the
data sets are greater than about 50. Enrichment for these
classiers is generally greater than 50% of EFmax suggesting that
the kinase classiers presented here are practically applicable for
virtual screening. Enrichment increased signicantly for
classiers based on data sets of more than 50 actives. This
indicated a required minimum number of the active class of the
training set to reliably retrieve true positives from the test set.
This was also reected in the ROC scores (compare Figure 1).
Absolute EF values for 0.1, 0.5, and 1.0% of tested compounds
are provided in Supporting Information Figure S4 and Table
Figure 2 suggested the highest enrichments for data sets that
have between 50 and 250 active molecules. The normalized
enrichment factors were slightly lower and relatively stable
around 0.5 for data sets with more than a few hundred actives.
A possible reason for higher enrichment in data sets with
smaller numbers of active compounds that are derived from
only a few studies may lie in overrepresentation of scaolds
(analog bias; see below for domain applicability results).
However, this is less likely for the larger data sets that have
been extracted from a large number of articles and patents. It
should also be emphasized here that the decoy (presumed
inactive) compounds were all derived from the same kinase
literature that are also the source of the active compounds and
are therefore closely related in terms of their biological focus
and in many cases also structurally. Normalized enrichment
factors obtained by 75/25 train/test cross-validation at 0.1, 0.5,
and 1.0% tested samples are shown in Supporting Information
Figure S5. As the percentage of tested samples increases, a
larger fraction of classiers show a very high ratio EF/EFmax
(>0.8). Although expected, because the maximum possible
enrichment decreases, the results also indicated that it is more
dicult to retrieve a certain fraction of true positives in a
smaller set of sampled compounds, compared to a larger, i.e.
from 0.1% vs 0.5% or 1.0% tested compounds.
Following standard procedure to further validate the KA-PI
classiers with our data, we randomized the kinase activities
(maintaining the number of actives for each kinase data set).
Leave-one-out cross-validation resulted in ROC scores of close
to 0.5 and EF of approximately the ratio of actives to overall
number of samples. This corresponded to random classication
as expected.
We also investigated how the ratio of training and test sets
inuenced the ROC and enrichment cross-validation results for
the various kinase data sets. In addition to 75/25 (Supporting
Information Table S4), we split the data into by 50/50 and 25/
75 (Supporting Information Table S5). ROCs were slightly
higher for larger compared to smaller training sets (Supporting
Information Figure S6) indicating improved overall predict-
ability, which was also consistent with the general trend of the
ROC scores as a function of the number of active samples
(Figure 1 and Supporting Information Figure S2). More
specically, as the ratio of training/test compounds decreased,
the required number of active compounds in the data sets to
give very good ROC (>0.96) increased. This suggested a
threshold of required active (training) compounds to develop
very good predictors.
In contrast to increased overall predictivity (measured by
ROC score) for larger (compared to smaller) train/test ratios,
enrichment factors at very low percentages of tested
compounds (0.1% and 0.5%) increased slightly with lower
(compared to higher) ratios of train/test sets. Figure 3
illustrates the average ratio of EF/EFmax at 0.1% tested samples
for the dierent train/test ratios across data sets for various
ranges of numbers of active compounds. Other than for the
lowest bin of active compounds (0400), enrichment at 0.1%
increases as the ratio of train/test decreases. Supporting
Information Figures S7 and S8 show EF and EFmax for each
individual data set as a function of active compounds for the
three train/test ratios at 0.1 and 0.5% tested samples,
respectively. Supporting Information Figure S9 shows the
normalized enrichment factors (EF/EFmax) for each data set at
0.1% tested samples, illustrating the same trend. Although the
ratio of active to inactive compounds is the same among the
dierent train/test splits, these results suggest that it is easier
for the classier to select true actives from a test set with a
larger (in absolute numbers) pool of active compounds.
Because the trend holds for the data sets with the highest
numbers of actives, it is likely not a trivial analog bias; although
classiers are based on structural features and therefore true
positive test compounds are by denition structurally related to
the active training compounds. More importantly, the results
indicated again that once a certain number of active compounds
are available, classier performance does not increase
signicantly with additional data. This is consistent with the
validation results described above based on ROC and EF.
To evaluate how similarities of test to training compounds
aect model performance, we performed a simple domain
applicability study (Figure 4, compare methods). Data sets with
at least 500 active compounds were selected (53 kinase data
sets representative of most of the human kinome), and the
actives of each data set were clustered into 10 series. Ten
models were generated each using one series as test set while
using the remaining combined 9 clusters as training set. This
way predictions are made for compounds that are structurally
Figure 3. Average normalized enrichment (EF/EFmax) at 0.1% tested
samples for dierent train/test ratios binned by ranges of active
compounds across 141 kinase data sets with at least 25 actives (EF
averaged over 10 repetitions).
Journal of Chemical Information and Modeling Article |J. Chem. Inf. Model. 2013, 53, 273832
dissimilar to training compounds. In this way, 530 models were
generated and evaluated. Figure 4A shows a histogram of
closest test-to-training set similarities by ROC score; results are
shown for data sets with at least 10 active test compounds (443
models). Specically we investigated ROC score as a function
of the similarity of test set cluster center to the closest training
compound (B), the average of the closest similarities of the test
set compounds to the training set compounds (C), and the
similarity of the closest training compound to any test
compound (D). As can be seen, the best criteria of model
applicability is the average closest similarity of the test
compounds to the training compounds (panel C). The models
show good predictivity (ROC > 0.8) for relatively low average
closest similarities (>0.5).
In addition to ROC score for each model (cluster), we
recorded the predictions of all active compounds over all 530
models (98 731 predictions total) and evaluated the hit rate as a
function of similarity of the predicted test compound to the
closest training compound (Supporting Information Figure
S10). Figure S10A shows the true positive rate (TPR), and
Figure S10B shows the distribution of true positives and actives
as a function of closest similarity of the test compound to the
training set. TPR drops osharply as the similarity decreases
below 0.5, and the TPR is greater than 0.8 for compounds with
a Tanimoto similarity of >0.6 (ECFP4).
To illustrate the potential applicability of the KA-PI models
for virtual screening, Figure 5 shows the enrichment results at
0.1% tested samples as the fraction of true positives obtained
from the test set vs the fraction of actives in the entire data set.
As expected the true positive rate increases with the ratio of
active to total compounds until the latter reaches 0.1% after
which the true positives identied from the test set remain
relatively constant between 40 and 80%. In this plot, the
enrichment factor achieved by each classier at 0.1% is the
quotient of the yand the xvalues. In practical terms, were the
KA-PI classication models applied to prioritize 500 com-
pounds from a library of about 500 000, one may expect to
recover anywhere from 10 to 400 actives depending on the
number of actual actives for a given kinase target. Thus these
models appear practically applicable, assuming that our data
sets reasonably well represent the kinase inhibitor chemical
The performance of the Laplacien-modied naï
ve Bayesian
kinase KA-PI classiers based on ROC scores and enrichment
at very low percentages of tested compounds indicate that the
data sets employed here are well-suited to build highly
predictive kinase classication models. The results consistently
suggest that the best classiers are obtained if a minimum
number of conservatively 50 active training compounds are
available against a large decoy set. Performance of the models
Figure 4. Model domain applicability. KA-PI model performance as measured by ROC score as a function of similarity of test to training sets. 443
models are shown for 53 kinases (representative of most of the human kinome). (A) Average closest similarities by binned ROC score. Shown are
closest individual similarities (blue), cluster average closest similarities (red), and cluster center closest similarities (yellow). (B) ROC score by
cluster center closest similarities. (C) ROC score by cluster average closest similarities. (D) ROC score by individual closest similarities (cluster
global closest similarity).
Journal of Chemical Information and Modeling Article |J. Chem. Inf. Model. 2013, 53, 273833
can be further improved with additional active compounds of
up to a few hundred but does not improve much beyond this.
Given the low ratio of structureactivity data points to unique
compounds in the KKB, it is likely that some of the presumed
inactive (decoy) compounds are in fact active against one or
several kinases. It is therefore feasible that the kinase KA-PI
classiers can be further improved as more kinase proling data
sets become available.
Regression Models. In addition to the naï
ve Bayes binary
classiers we also wanted to evaluate how suitable the data sets
are for the development of quantitative predictors. We chose
partial least square (PLS) and knearest neighbor (kNN)
regression as two fairly dierent learning methods to
quantitatively predict a continuous property. PLS and kNN
regression are considered most applicable for data sets of
congeneric (structurally similar) molecules. The methods can
be sensitive to outliers and require high-quality data. Here, 168
kinase data sets with at least 20 exact molecule-activity data
points were extracted from the standardized and aggregated
KKB data sets (see Methods). PLS and kNN QSAR models
were built as described and evaluated by 10-fold cross-
validation, which was further repeated 10 times randomly
partitioning the data. All cross-validation results (including, q2,
and RMSE) and data set statistics are summarized in
Supporting Information Table S2. Figure 6 illustrates q2values
for the kNN vs PLS models along with the number of
structure-data points for each kinase. The kNN method in
general outperformed PLS, and the performance (measured by
q2) of both methods correlated reasonably well. From Figure 6,
one can also conclude a trend in which the predictive quality
for both PLS and kNN models generally improved with the size
of the data sets. Although this trend was not as strict as what we
observed for the naï
ve Bayes classiers, a cutofor kNN q2
0.4 and PLS q20.25 leaves 91 kinase data setsall except
three having greater or equal to 50 structure-data points (also
compare Figure 8). For kinases with 500 or more data points,
all but one model have kNN q2values of >0.5.
Another important characteristic inuencing the quality of
both PLS and kNN was the activity range of the data sets.
Figure 7 illustrates the average q2for both kNN and PLS as a
function of the p-transformed activity range, which corresponds
to the orders of magnitude between the most and least active
compound. PLS and kNN q2values continuously increased
from 0.1 to >0.8 as the activity range increased from 1 to 12
orders of magnitude. To evaluate the activity range across
structural series and to see how structural series distribute
across dierent kinase data sets, we clustered compounds into
336 clusters and calculated activity ranges for each kinase for
each cluster (see Methods and Supporting Information Table
S3). Supporting Information Figure S11 shows the total
number of kinases and activity range of each cluster. These
results show that many structural series span a wide activity
range and many kinases suggesting that a scaold bias should
not be a general concern for the models built using these data
Figure 8 shows kNN and PLS models (characterized by q2
and number of structure-data points) for 91 kinases (selected
from 182 total models) with kNN q20.4 and PLS q20.25.
Figure 5. Fraction of true positives (TP) of the number of tested
compounds in 0.1% of the test set as a function of the ratio of actives
to total compounds for dierent ratios of training to test data. Also
indicated is the enrichment factor as the size of the circles. The
enrichment factor (at 0.1%) for each classier is the quotient of yvalue
and the xvalue.
Figure 6. Quantitative regression models developed from 168 kinase
data sets. q2values of kNN vs PLS regression models and the number
of kinase activity data points indicated by the circle size and color (see
Figure 7. Average q2for kNN and PLS regression cross-validation
results as a function of the p-value range (dened as p-valuemax p-
valuemin) of all 168 kinase data sets.
Journal of Chemical Information and Modeling Article |J. Chem. Inf. Model. 2013, 53, 273834
These include protein and nonprotein kinases and all major
kinase groups. The quality of the majority of the models is quite
good indicated by q2and also R2values (compare Supporting
Information Table S2).
PLS and kNN regression are very dierent machine learning
methods. For many of the kinase data sets we obtained high
quality models with both methods, in particular the larger data
sets and distinctly the ones spanning a wider activity range.
These kinase data appear particularly well-applicable for
quantitative modeling, and by that measure, they are of high
Application of Classication Models to LINCS Kinase
Proling Data. Because of their exceptional performance, we
tested the KA-PI classiers on data recently generated in the
NIH LINCS project (see Introduction). The recent KINO-
MEScan screening technology
is used at the Harvard
Medical School LINCS center to prole compounds against
486 targets. We applied the KA-PI classication models to the
43 compounds with known structures. Figure 9 illustrates the
probability that a compound is active against a kinase and the
corresponding actual activity based on the KINOMEScan
proling results. Shown are all data with greater than 10%
probability of activity for targets that could be mapped to the
KINOMEScan results (see Methods). All compounds, targets,
predictions, and percent inhibition values are given in
Supporting Information Table S8. As can be seen, all but
three activity predictions are conrmed by these screening
results given a probability cutoof 10% and an activity of >60%
inhibition; the majority of actives even have greater than 90%
inhibition. If we select a 60% probability cuto, there is only
one outlier as shown in Figure 9 (CSK) predicted as highly
likely active, but with 0% inhibition. This appears to be a false
positive; PD173074, an FGFR inhibitor has a reported IC50 of
20 μM for CSK (KBB, Eidogen-Sertanty). Overwhelmingly
however, the kinase model activity predictions are in agreement
with the KINOMEScan proling results.
In addition to correct predictions of active kinase inhibitors,
the classierspredicted activities overall correspond very well
to the LINCS KINOMEScan results. Figure 10 illustrates the
aggregated (sum) probabilities of activity (by kinase category)
for all mapped compound target combinations (see Methods)
by binned percent inhibition. As can be seen, predicted kinase
activity probabilities are much greater for compounds reported
as active in the proling results and in particular for the most
active category (>90% inhibition). Supporting Information
Figure S12 shows the histogram of mapped KINOMEScan
activity results with the vast majority being inactive (<10%
inhibition), and which correspond to very low predicted
probability of activity (Figure 10).
These results illustrate very good performance of the
classiers. Although expected, it should be noted that, because
KINOMEScan proling was performed at a relatively high
screening concentration of 10 μM, a relatively large percentage
of compounds appear active (Supporting Information Figure
S12); this is in contrast to predictions that are based on models
with an activity cutoof IC50 1μM. Nonetheless the
KINOMEScan results are in very good concordance with the
KA-PI model predictions.
Here we investigated the quality and value of publically
accessible heterogeneous data sets of small molecule kinase
inhibitors, which were aggregated from thousands of journal
articles and patents. Because data sets are generated by
numerous methods and technologies and under dierent
conditions, their integrity and usefulness is sometimes
questioned. We used dierent machine learning approaches
to establish the quality and utility of these data. We were able to
Figure 8. kNN and PLS activity predictors for 91 kinases (q2kNN > 0.4 and q2PLS > 0.25) by the number of data points. Data sets include protein
and nonprotein kinases and all major kinase groups. kNN q2is shown by number of (unique) structuredata points. The symbol indicates protein vs
nonprotein kinase, the size is scaled by PLS q2, colored by kinase group, and annotated by HUGO kinase gene symbol.
Journal of Chemical Information and Modeling Article |J. Chem. Inf. Model. 2013, 53, 273835
generate and validate very high-quality kinase activity predictors
for a large and diverse number of kinase targets.
From almost 500 000 kinase-related compounds, we
extracted over 180 diverse structureactivity data sets covering
all major groups of the human kinome. We applied dierent
machine learning techniques including naï
ve Bayes binary
classication and quantitative kNN and PLS regression.
Rigorous cross-validation demonstrated reliable predictors for
the majority of kinase targets if a minimum required number of
active compounds or structureactivity data points were
available. For the types of data sets investigated here, the best
results for the largest number of kinases were obtained using
Laplacien-corrected naï
ve Bayes classiers trained (for any
specic kinase) on known actives and a large background set of
kinase-family focused presumed inactive compounds (KA-PI
approach). This method resulted in very good ROC scores and
very high enrichment rates for the majority of targets in
particular for data sets with greater than 50 active compounds.
The Laplacien-modied naï
ve Bayes classiers generally
improved when increasing the numbers of actives to a few
hundred, but not much beyond that. Model domain
applicability studies suggested that the classiers are applicable
to novel compounds and provide guidance to interpret virtual
screening results. We applied the KA-PI classiers to
compounds recently proled in the NIH LINCS project and
found very good agreement of predicted kinase inhibition to
actual screening results. Using kNN and PLS regression, we
also obtained high-quality models for a large number of diverse
kinase targets; in particular for data sets with greater than 50
structureactivity records and spanning a wide activity range.
All data employed here were derived from biochemical
concentrationresponse human (nonmutant) protein kinase
assays. The data sets combined dierent assay methods,
technologies, and various experimental procedures from
numerous laboratories. After preprocessing of the data using
a rigorous data standardization and aggregation procedure, our
results indicate that the data sets are very well suitable to
develop highly accurate predictors employing dierent machine
learning techniques and are, by that measure, of high quality.
The various kinase screening technologies and conditions
appear to generate, on average, consistent results. This is
supported by excellent predictivity of the various models and
very good agreement of predicted kinase activity modeled on
heterogeneous data compared to the KINOMEScan LINCS
results, which were all generated under the same conditions
using the same competition binding assay. However,
heterogeneity of the data may be one reason why we observe
a distinct increase of model performance with a minimum
number of active compounds or structureactivity data points.
Depending on the number of actual actives in the specic data
sets from which the models were built, the best KA-PI
classiers were able to retrieve between 40 and 80% true
positives among only 0.1% of all compounds. These results
suggest that they are practically applicable with great potential
for virtual screening of large libraries across the kinome.
Moreover, our results suggest that the KA-PI predictors are
Figure 9. Probability (EstPGood) of compounds predicted active against a kinase based on KA-PI kinase classiers and actual KINOMEScan
percent inhibition values (at 10 μM). Compare with Supporting Information Table S8. Kinases are classied by groups and protein vs nonprotein
Journal of Chemical Information and Modeling Article |J. Chem. Inf. Model. 2013, 53, 273836
useful to complement actual kinase proling screening results,
for example to identify likely false negatives.
SSupporting Information
Additional supporting Figures S1S12 and supporting Tables
S1S8 including all cross-validation results for all classication
and quantitative models, domain applicability, ROC curves, and
comparisons of predictions and kinase proling results. This
material is available free of charge via the Internet at http://
Corresponding Author
The authors declare no competing nancial interest.
This work was supported by the NIH grant U01 HL111561
(NIH LINCS program) and by the Center for Computational
Science of the University of Miami.
(1) Manning, G.; Whyte, D. B.; Martinez, R.; Hunter, T.;
Sudarsanam, S. The protein kinase complement of the human
genome. Science 2002,298, 19121934.
(2) Akritopoulou-Zanze, I.; Hajduk, P. J. Kinase-targeted libraries: the
design and synthesis of novel, potent, and selective kinase inhibitors.
Drug Discovery Today 2009,14, 291297.
(3) Zhang, J.; Yang, P. L.; Gray, N. S. Targeting cancer with small
molecule kinase inhibitors. Nat. Rev. Cancer 2009,9,2839.
(4) Cohen, P. Protein kinases-the major drug targets of the twenty-
first century? Nat. Rev. Drug Discovery 2002,1, 309315.
(5) Knight, Z. A.; Lin, H.; Shokat, K. M. Targeting the cancer kinome
through polypharmacology. Nat. Rev. Cancer 2010,10, 130137.
(6) Morphy, R. Selectively Nonselective Kinase Inhibition: Striking
the Right Balance. J. Med. Chem. 2010,53, 14131437.
(7) Davies, S. P.; Reddy, H.; Caivano, M.; Cohen, P. Specificity and
mechanism of action of some commonly used protein kinase
inhibitors. Biochem. J. 2000,351,95105.
(8) Zuccotto, F.; Ardini, E.; Casale, E.; Angiolini, M. Through the
gatekeeper door: exploiting the active kinase conformation. J. Med.
Chem. 2010,53, 26812694.
(9) Brylinski, M.; Skolnick, J. Comprehensive Structural and
Functional Characterization of the Human Kinome by Protein
Structure Modeling and Ligand Virtual Screening. J. Chem. Inf.
Model 2010,50, 1839.
(10) Goldstein, D. M.; Gray, N. S.; Zarrinkar, P. P. High-throughput
kinase profiling as a platform for drug discovery. Nat. Rev. Drug
Discovery 2008,7, 391397.
(11) Karaman, M. W.; Herrgard, S.; Treiber, D. K.; Gallant, P.;
Atteridge, C. E.; Campbell, B. T.; Chan, K. W.; Ciceri, P.; Davis, M. I.;
Edeen, P. T.; Faraoni, R.; Floyd, M.; Hunt, J. P.; Lockhart, D. J.;
Milanov, Z. V; Morrison, M. J.; Pallares, G.; Patel, H. K.; Pritchard, S.;
Wodicka, L. M.; Zarrinkar, P. P. A quantitative analysis of kinase
inhibitor selectivity. Nat. Biotechnol. 2008,26, 127132.
Figure 10. Aggregated predicted probabilities (EstPGood) of compounds being active against kinases (based on the KA-PI classiers) as a function
of the actual KINOMEScan percent inhibition ranges by category of kinase group and protein vs nonprotein kinase. 4796 activity data points for 43
compounds are mapped to KA-PI models (not all compounds were tested against the same number of targets).
Journal of Chemical Information and Modeling Article |J. Chem. Inf. Model. 2013, 53, 273837
(12) Fedorov, O.; Marsden, B.; Pogacic, V.; Rellos, P.; Müller, S.;
Bullock, A. N.; Schwaller, J.; Sundström, M.; Knapp, S. A systematic
interaction map of validated kinase inhibitors with Ser/Thr kinases.
Proc. Natl. Acad. Sci. U.S.A. 2007,104, 2052320528.
(13) Fabian, M. A.; Biggs, W. H., 3rd; Treiber, D. K.; Atteridge, C. E.;
Azimioara, M. D.; Benedetti, M. G.; Carter, T. A.; Ciceri, P.; Edeen, P.
T.; Floyd, M.; Ford, J. M.; Galvin, M.; Gerlach, J. L.; Grotzfeld, R. M.;
Herrgard, S.; Insko, D. E.; Insko, M. A.; Lai, A. G.; Lelias, J. M.; Mehta,
S. A.; Milanov, Z. V; Velasco, A. M.; Wodicka, L. M.; Patel, H. K.;
Zarrinkar, P. P.; Lockhart, D. J. A small molecule-kinase interaction
map for clinical kinase inhibitors. Nat. Biotechnol. 2005,23, 329336.
(14) Library of Integrated Network-based Cellular Signatures
(LINCS). (accessed Nov 30, 2012).
(15) HMS LINCS DataBase.
(accessed Nov 30, 2012).
(16) LINCS Information FramEwork (LIFE).
(accessed Nov 30, 2012).
(17) Vieth, M.; Sutherland, J. J.; Robertson, D. H.; Campbell, R. M.
Kinomics: characterizing the therapeutically validated kinase space.
Drug Discovery Today 2005,10, 839846.
(18) Bamborough, P.; Drewry, D.; Harper, G.; Smith, G. K.;
Schneider, K. Assessment of Chemical Coverage of Kinome Space and
Its Implications for Kinase Drug Discovery. J. Med. Chem. 2008,51,
(19) Vieth, M.; Higgs, R. E.; Robertson, D. H.; Shapiro, M.; Gragg, E.
A.; Hemmerle, H. Kinomics-structural biology and chemogenomics of
kinase inhibitors and targets. Biochim. Biophys. Acta 2004,1697, 243
(20) Posy, S. L.; Hermsmeier, M. A.; Vaccaro, W.; Ott, K. H.;
Todderud, G.; Lippy, J. S.; Trainor, G. L.; Loughney, D. A.; Johnson,
S. R. Trends in Kinase Selectivity: Insights for Target Class-Focused
Library Screening. J. Med. Chem. 2011,54,5466.
(21) Manallack, D. T.; Pitt, W. R.; Gancia, E.; Montana, J. G.;
Livingstone, D. J.; Ford, M. G.; Whitley, D. C. Selecting screening
candidates for kinase and G protein-coupled receptor targets using
neural networks. J. Chem. Inf. Comput. Sci. 2002,42, 12561262.
(22) Xia, X.; Maliski, E. G.; Gallant, P.; Rogers, D. Classification of
kinase inhibitors using a Bayesian model. J. Med. Chem. 2004,47,
(23) Briem, H.; Gunther, J. Classifying kinase inhibitor-likenessby
using machine-learning methods. Chembiochem 2005,6, 558566.
(24) Sutherland, J. J.; Higgs, R. E.; Watson, I.; Vieth, M. Chemical
Fragments as Foundations for Understanding Target Space and
Activity Prediction. J. Med. Chem. 2008,51, 26892700.
(25) Aronov, A. M.; McClain, B.; Moody, C. S.; Murcko, M. A.
Kinase-likeness and kinase-privileged fragments: toward virtual
polypharmacology. J. Med. Chem. 2008,51, 12141222.
(26) Kinase Knowledge Base (Q4 2009). http://eidogen-sertanty.
com/kinasekb.php (accessed Nov 30, 2012).
(27) Pipeline Pilot 8.0, version 8.0; Accelrys: San Diego, CA, 2010.
(28) Glick, M.; Jenkins, J. L.; Nettles, J. H.; Hitchings, H.; Davies, J.
W. Enrichment of high-throughput screening data with increasing
levels of noise using support vector machines, recursive partitioning,
and Laplacian-modified naive Bayesian classifiers. J. Chem. Inf. Model.
2006,46, 193200.
(29) Cramer, R. D., III Partial Least Squares (PLS): Its strengths and
limitations. Perspect. Drug Discovery Des. 1993,1, 269278.
(30) Hert, J.; Willett, P.; Wilton, D. J.; Acklin, P.; Azzaoui, K.; Jacoby,
E.; Schuffenhauer, A. Comparison of topological descriptors for
similarity-based virtual screening using multiple bioactive reference
structures. Org. Biomol. Chem. 2004,2, 32563266.
(31) Rogers, D.; Brown, R. D.; Hahn, M. Using extended-
connectivity fingerprints with Laplacian-modified Bayesian analysis in
high-throughput screening follow-up. J. Biomol. Screen. 2005,10, 682
(32) Sastry, M.; Lowrie, J. F.; Dixon, S. L.; Sherman, W. Large-Scale
Systematic Analysis of 2D Fingerprint Methods and Parameters to
Improve Virtual Screening Enrichments. J. Chem. Inf. Model. 2010,50,
(33) Rogers, D.; Hahn, M. Extended-Connectivity Fingerprints.
Chem. Inf. Model 2010,50, 742754.
(34) Li, Q.; Cheng, T.; Wang, Y.; Bryant, S. H. PubChem as a public
resource for drug discovery. Drug Discovery Today 2010,15, 1052
(35) Liu, Y.; Gray, N. S. Rational design of inhibitors that bind to
inactive kinase conformations. Nat. Chem. Biol. 2006,2, 358364.
Journal of Chemical Information and Modeling Article |J. Chem. Inf. Model. 2013, 53, 273838
    • "and the LIFE project website 8 . To characterize the diversity in chemical space of the tested LINCS compounds, we generated a histogram of their pairwise chemical similarities based on the Tanimoto metric using extended-connectivity fingerprints of length 4 (ECFP4) (Rogers and Hahn, 2010). Based on unique LSM IDs we identified overlap of screened compounds among the different LINCS assays. "
    [Show abstract] [Hide abstract] ABSTRACT: The Library of Integrated Network-based Cellular Signatures (LINCS) project is a large-scale coordinated effort to build a comprehensive systems biology reference resource. The goals of the program include the generation of a very large multidimensional data matrix and informatics and computational tools to integrate, analyze, and make the data readily accessible. LINCS data include genome-wide transcriptional signatures, biochemical protein binding profiles, cellular phenotypic response profiles and various other datasets for a wide range of cell model systems and molecular and genetic perturbations. Here we present a partial survey of this data facilitated by data standards and in particular a robust compound standardization workflow; we integrated several types of LINCS signatures and analyzed the results with a focus on mechanism of action (MoA) and chemical compounds. We illustrate how kinase targets can be related to disease models and relevant drugs. We identified some fundamental trends that appear to link Kinome binding profiles and transcriptional signatures to chemical information and biochemical binding profiles to transcriptional responses independent of chemical similarity. To fill gaps in the datasets we developed and applied predictive models. The results can be interpreted at the systems level as demonstrated based on a large number of signaling pathways. We can identify clear global relationships, suggesting robustness of cellular responses to chemical perturbation. Overall, the results suggest that chemical similarity is a useful measure at the systems level, which would support phenotypic drug optimization efforts. With this study we demonstrate the potential of such integrated analysis approaches and suggest prioritizing further experiments to fill the gaps in the current data.
    Full-text · Article · Sep 2014
    • "All these studies achieved good prediction performances: from 0.67 to 0.73 correlation coefficient in Lapins and Wikberg (2010); accuracy between 74 and 81% and matthews correlation coefficient (MCC) between 0.3 and 0.48 in different tested datasets and with different encodings and learning methods in Niijima et al. (2012); 94% accuracy and 0.98 area under the ROC curve (auROC) in Cao et al. (2013). In Schürer and Muskal (2013), the auROC for individual kinase models vary from around 0.93 to 1, and the prediction accuracy showed a positive correlation with the number of known inhibitors available for training. In Yabuuchi et al. (2011), some predicted novel inhibitors for the epidermal growth factor receptor kinase and the cyclin-dependent kinase 2 were experimentally confirmed, sometimes showing scaffold hopping (i.e., having radically different characteristics than known inhibitors). "
    [Show abstract] [Hide abstract] ABSTRACT: The central role of kinases in virtually all signal transduction networks is the driving motivation for the development of compounds modulating their activity. ATP-mimetic inhibitors are essential tools for elucidating signaling pathways and are emerging as promising therapeutic agents. However, off-target ligand binding and complex and sometimes unexpected kinase/inhibitor relationships can occur for seemingly unrelated kinases, stressing that computational approaches are needed for learning the interaction determinants and for the inference of the effect of small compounds on a given kinase. Recently published high-throughput profiling studies assessed the effects of thousands of small compound inhibitors, covering a substantial portion of the kinome. This wealth of data paved the road for computational resources and methods that can offer a major contribution in understanding the reasons of the inhibition, helping in the rational design of more specific molecules, in the in silico prediction of inhibition for those neglected kinases for which no systematic analysis has been carried yet, in the selection of novel inhibitors with desired selectivity, and offering novel avenues of personalized therapies.
    Full-text · Article · Jun 2014
    • "ions are the global mapping of pharmacological space by Paolini and co- workers, [3] the Similarity Ensemble Approach (SEA), [4] the Bayesian models for adverse drug reactions by Bender and coworkers, [5] the models used for polypharmacological optimization by Hopkins et al., [6] and the kinome-wide activity modeling studies by Schuerer and Muskal. [7] These methods can be used to predict off-target effects based on heterogeneous public activity data and chemical similarity analysis. Usually, public off-target toxicity models like human Ether-a `-go-go-Related Gene (hERG) [8] and cytochrome P450 (CYP) models [9,10] are based and validated on mixed public IC 50 data, since there is not"
    [Show abstract] [Hide abstract] ABSTRACT: The biochemical half maximal inhibitory concentration (IC50) is the most commonly used metric for on-target activity in lead optimization. It is used to guide lead optimization, build large-scale chemogenomics analysis, off-target activity and toxicity models based on public data. However, the use of public biochemical IC50 data is problematic, because they are assay specific and comparable only under certain conditions. For large scale analysis it is not feasible to check each data entry manually and it is very tempting to mix all available IC50 values from public database even if assay information is not reported. As previously reported for Ki database analysis, we first analyzed the types of errors, the redundancy and the variability that can be found in ChEMBL IC50 database. For assessing the variability of IC50 data independently measured in two different labs at least ten IC50 data for identical protein-ligand systems against the same target were searched in ChEMBL. As a not sufficient number of cases of this type are available, the variability of IC50 data was assessed by comparing all pairs of independent IC50 measurements on identical protein-ligand systems. The standard deviation of IC50 data is only 25% larger than the standard deviation of Ki data, suggesting that mixing IC50 data from different assays, even not knowing assay conditions details, only adds a moderate amount of noise to the overall data. The standard deviation of public ChEMBL IC50 data, as expected, resulted greater than the standard deviation of in-house intra-laboratory/inter-day IC50 data. Augmenting mixed public IC50 data by public Ki data does not deteriorate the quality of the mixed IC50 data, if the Ki is corrected by an offset. For a broad dataset such as ChEMBL database a Ki- IC50 conversion factor of 2 was found to be the most reasonable.
    Full-text · Article · Apr 2013
Show more