Computer prediction of cardiovascular and hematological agents by statistical learning methods.
ABSTRACT Computational methods have been explored for predicting agents that produce therapeutic or adverse effects in cardiovascular and hematological systems. The quantitative structure-activity relationship (QSAR) method is the first statistical learning methods successfully used for predicting various classes of cardiovascular and hematological agents. In recent years, more sophisticated statistical learning methods have been explored for predicting cardiovascular and hematological agents particularly those of diverse structures that might not be straightforwardly modelled by single QSAR models. These methods include partial least squares, multiple linear regressions, linear discriminant analysis, k-nearest neighbour, artificial neural networks and support vector machines. Their application potential has been exhibited in the prediction of various classes of cardiovascular and hematological agents including 1, 4-dihydropyridine calcium channel antagonists, angiotensin converting enzyme inhibitors, thrombin inhibitors, AchE inhibitors, HERG potassium channel inhibitors and blockers, potassium channel openers, platelet aggregation inhibitors, protein kinase inhibitors, dopamine antagonists and torsade de pointes causing agents. This article reviews the strategies, current progresses and problems in using statistical learning methods for predicting cardiovascular and hematological agents. It also evaluates algorithms for properly representing and extracting the structural and physicochemical properties of compounds relevant to the prediction of cardiovascular and hematological agents.
-
Citations (0)
-
Cited In (0)
Page 1
Cardiovascular & Hematological Agents in Medicinal Chemistry, 2007, 5, 11-1911
1871-5257/07 $50.00+.00 © 2007 Bentham Science Publishers Ltd.
Computer Prediction of Cardiovascular and Hematological Agents by
Statistical Learning Methods
X. Chen1,2, H. Li1, C.W. Yap1, C.Y. Ung1,3, L. Jiang1, Z.W. Cao4, Y.X. Li4 and Y.Z. Chen1,4,*
1Bioinformatics and Drug Design Group, Department of Pharmacy and Department of Computational Science, National
University of Singapore, Blk S16, Level 8, 3 Science Drive 2, Singapore 117543; 2College of life sciences, Zhejiang Uni-
versity, No.368 Zijinghua Road, Hangzhou, Zhejiang, P. R. China 310058; 3Department of Biochemistry, The Yong Loo
Lin School of Medicine, National University of Singapore, Blk MD7, #02-03, 8 Medical Drive, Singapore, 117597;
4Shanghai Center for Bioinformation Technology, Shanghai, P. R. China 201203
Abstract: Computational methods have been explored for predicting agents that produce therapeutic or adverse effects in
cardiovascular and hematological systems. The quantitative structure-activity relationship (QSAR) method is the first sta-
tistical learning methods successfully used for predicting various classes of cardiovascular and hematological agents. In
recent years, more sophisticated statistical learning methods have been explored for predicting cardiovascular and hemato-
logical agents particularly those of diverse structures that might not be straightforwardly modelled by single QSAR mod-
els. These methods include partial least squares, multiple linear regressions, linear discriminant analysis, k-nearest neigh-
bour, artificial neural networks and support vector machines. Their application potential has been exhibited in the predic-
tion of various classes of cardiovascular and hematological agents including 1, 4-dihydropyridine calcium channel an-
tagonists, angiotensin converting enzyme inhibitors, thrombin inhibitors, AchE inhibitors, HERG potassium channel in-
hibitors and blockers, potassium channel openers, platelet aggregation inhibitors, protein kinase inhibitors, dopamine an-
tagonists and torsade de pointes causing agents. This article reviews the strategies, current progresses and problems in us-
ing statistical learning methods for predicting cardiovascular and hematological agents. It also evaluates algorithms for
properly representing and extracting the structural and physicochemical properties of compounds relevant to the predic-
tion of cardiovascular and hematological agents.
Key Words: Statistical learning methods, cardiovascular agents, haematological agents, pharmacodynamic, pharmacokinetic,
QSAR.
INTRODUCTION
and mortality in the world [1]. Drug-induced hematological
reactions often lead to type II, III and IV hemolytic anemia,
hypersensitivity, agranulocytosis, thrombocytopecia and
aplastic anemia [2]. Identification of agents that produce
therapeutic or adverse effects in cardiovascular and hemato-
logical systems is important for designing new drugs and for
detecting potentially harmful agents. Efforts have been made
to explore computational methods for predicting various
classes of cardiovascular and hematological agents [3-8]. In
particular, statistical learning methods have shown promis-
ing potential for performing these tasks [7-10] as well as for
predicting agents of other pharmaceutical applications, toxi-
cological properties, and pharmacokinetic profiles [11].
Cardiovascular diseases are the main causes of morbidity
introduced to complement conventional quantitative struc-
ture activity relationship (QSAR) methods [4, 5, 12] for cov-
ering more diverse ranges of cardiovascular and haemato-
logical agents [7-10]. In contrast to QSAR methods, these
statistical learning methods derive implicit statistical models
More sophisticated statistical learning methods have been
*Address correspondence to this author at Bioinformatics and Drug design
Group, Department of Pharmacy and Department of Computational Science,
National University of Singapore, Blk S16, Level 8, 3 Science Drive 2,
Singapore 117543; Tel: 65-6516-6877; Fax: 65-6774-6756;
E-mail: phacyz@nus.edu.sg
capable of describing multiple mechanisms and non-linear
relationships between chemical structures and a particular
activity which can be used for predicting new agents having
the same activity [13-15]. Regression methods can be incor-
porated into these statistical learning methods for deriving
the activity level of these agents [16-18] .
MOLECULAR DESCRIPTORS FOR REPRESENTING
CHEMICAL AGENTS
depends on proper representation of the structural and phys-
icochemical features of chemical agents [19, 20]. Over 3,700
molecular descriptors, computed from the 1D, 2D or 3D
structure of an agent, have been developed to quantitatively
represent different structural and physicochemical features
[19, 21-25]. These descriptors range from constitutional de-
scriptors such as molecular weight to more complex 2D and
3D descriptors representing different geometric, connec-
tivity, and physicochemical properties. These molecular de-
scriptors can be computed by using popular computer pro-
grams such as DRAGON [23], Molconn-Z [22], JOELib
[24], Xue descriptor set [19], and MODEL (http://jing.cz3.
nus.edu.sg/cgi-bin/model/model.cgi).
Successful application of statistical learning methods
which contain members overlapping with members of other
classes. Examples of typical descriptor classes are constitu-
tional descriptors that include molecular weight and number
These descriptors can be divided into 18 classes, some of
Page 2
12 Cardiovascular & Hematological Agents in Medicinal Chemistry, 2007, Vol. 5, No. 1Chen et al.
of hydrogen bond donors, and geometrical descriptors that
include volume and surface areas, topological descriptors
such as the number of rings and rotatable bonds. Many de-
scriptor classes contain descriptors of mixed properties fre-
quently used for collectively representing various QSAR and
QSPR models. RDF descriptors represent inter-atomic dis-
tances in the entire molecule and other useful information
such as bond distances, ring types, planar and non-planar
systems, atom types and molecular weight [26], molecular
walk counts [27]. 3D-MoRSE descriptors describe features
such as molecular weight, van der Waals volume, electro-
negativities and polarizabilities [28]. BCUT descriptors rep-
resent connectivity information and atomic properties rele-
vant to intermolecular interaction [29]. WHIM descriptors
describe size, shape, symmetry, atom distribution and po-
larizability of a molecule [30]. Other useful descriptor
classes are Galvez topological charge indices and charge
descriptors [31], GETAWAY descriptors [32], 2D autocorre-
lations, functional groups, atom-centred descriptors, aro-
maticity indices [33], Randic molecular profiles [34], elec-
trotopological state descriptors [35], and linear solvation
energy relationship descriptors [36].
FEATURE SELECTION METHODS
for representing features of a particular class of agents. Fea-
tures useful for agents of a particular activity can be selected
either by intuition as in the cases of conventional QSAR
studies, or by using feature selection methods. The com-
monly used feature selection methods include recursive fea-
ture eliminations (RFE) [37], genetic algorithm-based ap-
proach (GA) [38], and simulated annealing-based approach
(SA) [39]. Some of these methods, particularly RFE, have
gained popularity due to their effectiveness for discovering
informative features in the analysis of drug activity [37, 40]
and pharmacokinetic and toxicological properties [19, 20,
41-44].
Normally, only a fraction of these descriptors are needed
the following strategy: First a statistical learning model is
generated by using either all or a few of a selected set of
descriptorsasthe starting-set. This model is then used to rank
the contribution of these descriptors. For the all-descriptors
starting-set, descriptors contributing the least to a studied
property are eliminated. For the few-descriptors starting-set,
those contributing the most to a studied property are retained
and the rest are eliminated. The process proceeds to the next
step to construct a new machine learning model by using
either the reduced set of descriptors for the all-descriptors
starting-set or the retained set of descriptors plus newly
added additional descriptors for the few-descriptors starting-
set. This new model is subsequently used to rank and then
eliminate or add descriptors. This iteration process continues
until all of the irrelevant descriptors are eliminated or all of
the relevant descriptors are added.
These feature selection methods are primarily based on
recursive feature elimination (RFE) method as an example.
Descriptor ranking in RFE is based on the magnitude of the
change of an objective function of a statistical learning
model upon removing each descriptor (which roughly meas-
ure the extent of contribution of each feature to the predic-
The ranking of descriptors can be illustrated by using
tion capability of the model) [45]. Prediction capability of a
statistical learning model is more significantly affected by a
greater change in the objective function, and thus the corre-
sponding descriptor is ranked higher.
set of descriptors due to the high redundancy and overlap-
ping nature of many descriptors [46]. Separate sets of de-
scriptors containing different members of redundant descrip-
tor classes have been found to give similar prediction accu-
racies [47]. The interpretation of the prediction results in
these cases should be more appropriately conducted at the
descriptor class level where redundant and overlapping de-
scriptors are grouped into one class [20, 41, 48].
In many cases, it is difficult to uniquely select an optimal
COMMONLY
METHODS
USED STATISTICAL LEARNING
Linear Discriminant Analysis (LDA)
feature vectors by constructing a hyperplane defined by a
As shown in Fig. 1, LDA [49] separates two classes of
linear discriminant function: L =
wixi
i
k
?
, where L is the re-
sultant classification score and wi is the weight associated
with the corresponding descriptor xi. A positive or negative L
value indicates that a feature vector x belongs to the positive
or negative class respectively.
Multiple Linear Regressions (MLR)
there is a linear relationship between a specific set of mo-
lecular descriptors of a compound, which is usually ex-
pressed as a feature vector x with each descriptor as its com-
ponent, and a particular property, y. A MLR model can be
described using the following equation:
A MLR model is developed under the assumption that
ˆ y = ?0 + ?1X1+ ?2X2 +…+ ?kXk
where {X1, …, Xk} are molecular descriptors, ?0 is the re-
gression model constant, ?1 to ?k are the coefficients corre-
sponding to the descriptors X1 to Xk. The values for ?0 to ?k
are chosen by minimizing the sum of squares of the vertical
distances of the points from the hyperplane so as to give the
best prediction of y from x.
Partial Least Squares (PLS)
basis of a linear relationship between a vector x and a par-
ticular property y. However, the problems of collinear de-
scriptors are avoided by calculating the principal compo-
nents for the molecular descriptors and target property sepa-
rately. The scores for the molecular descriptors are then used
as the feature vector x for predicting the scores for the target
property, which can then be used to predict y. An important
consideration in PLS is the appropriate number of principal
components to be used for the QSAR model. This is usually
determined by using cross-validation methods like 5-fold
cross-validation and leave-one-out. Comparative Molecular
Field Analysis (CoMFA) [50] is a popular 3D-QSAR tech-
nique which uses PLS as the data analysis method. In
CoMFA, compounds are aligned to a common substructure
PLS is similar to MLR in that it is also developed on the
Page 3
Computer Prediction of CHA Cardiovascular & Hematological Agents in Medicinal Chemistry, 2007, Vol. 5, No. 1 13
and the magnitudes of the steric and electrostatic fields of
each compound are sampled at regular intervals and used as
molecular descriptors.
k Nearest Neighbor (kNN)
vector x and each individual vector xi in the training set is
measured [51, 52]. A total of k number of vectors nearest to
In kNN, the Euclidean distance between an unclassified
the unclassified vector x are used to determine the class of
that unclassified vector. The class of the majority of the k
nearest neighbors is chosen as the predicted class of the un-
classified vector x.
Artificial Neural Network (ANN), Neural Network (NN),
Principal component ANN (PCANN)
(NN) is an information-processing paradigm inspired by the
way the densely interconnected, parallel structure of the
mammalian brain processes information. As shown in Fig. 2,
NN consists of a set of highly interconnected entities, called
nodes or units. Each unit is designed to mimic its biological
counterpart, the neuron, mathematically. Each node accepts a
weighted set of inputs and responds with an output respec-
tively [53]. While the principal component-artificial neural
network (PC-ANN) was proposed to improve training speed
and decrease the overall calibration error [54]. In this
method, the input data are subjected to principal component
analysis (PCA) before being introduced into the neural net-
work and the most significant principal components of the
original data matrix are selected and used as ANN input.
An artificial neural network (ANN) or neural network
Fig. (2). Schematic diagram illustrating the process of the predic-
tion of chemical agents with a cardiovascular or haematological
property from its structure by using a statistical learning method –
neural networks (NN). A, B, E, F and (hj, pj, vj,…) are the same as
those in Fig. 2.
Fig. (1). Schematic diagram illustrates the process of the prediction
of chemical agents with a cardiovascular or haematological prop-
erty from its structure by using a statistical learning method - dis-
criminant analysis method (LDA). A, B: feature vectors of agents
with the property; E, F: feature vectors of agents without the prop-
erty; feature vector (hj, pj, vj,…) represents such structural and
physicochemical properties as hydrophobicity, volume, polarizabil-
ity, etc.
Page 4
14 Cardiovascular & Hematological Agents in Medicinal Chemistry, 2007, Vol. 5, No. 1Chen et al.
Support Vector Machine (SVM)
linear SVM. Nonlinear SVM, which is illustrated in Fig. 3, is
more useful for chemical agents of diverse structures and
thus more extensive used [13, 20, 37, 48]. Linear SVM con-
There are two types of SVM algorithms, linear and non-
Fig. (3). Schematic diagram illustrating the process of the predic-
tion of chemical agents with a cardiovascular or haematological
property from its structure by using a statistical learning method -
support vector machines (SVM). A, B, E, F and (hj, pj, vj,…) are the
same as those in Fig. 2.
structs a hyperplane separating two different classes of fea-
ture vectors with a maximum margin [55, 56]. This hyper-
plane is constructed by finding a vector w and a parameter b
that minimizes w
2
which satisfies the following condi-
tions:
w ?xi+ b ? +1, for yi= +1 (positive class) and
w ?xi+ b ? ?1, for yi= ?1 (negative class). Here xi is a
feature vector, yiis the group index, w is a vector normal to
the hyperplane, b / w is the perpendicular distance from
the hyperplane to the origin and w
2
is the Euclidean norm
of w. Nonlinear SVM projects feature vectors into a high
dimensional feature space by using a kernel function such
as K(xi,xj) = e
? xj?xi
2/2?2
. The linear SVM procedure is
then applied to the feature vectors in this feature space. After
the determination of w and b, a given vector x can be classi-
fied by using sign[(w ?x) + b] , a positive or negative value
indicates that the vector x belongs to the positive or negative
class respectively.
PREDICTION PERFORMANCE
methods for predicting cardiovascular and haematological
agents can be divided into two groups. One group includes
classification-based statistical learning methods that predict
cardiovascular and haematological agents without providing
the activity level of the predicted agents. The second group
includes regression-based statistical learning methods that
estimate the activity level in addition to classifying whether
or not a compound is a cardiovascular or haematological
agent.
The reported studies about the use of statistical learning
based methods for predicting cardiovascular and haemato-
logical agents. These agents include HERG potassium chan-
nel inhibitors, calcium channel antagonists, torsade de poin-
tes causing agents, protein kinase inhibitors and dopamine
antagonists. Inhibition of HERG potassium channel can lead
to prolongation of the QT interval which might trigger tor-
sade de pointes arrhythmia [57]. Calcium channel antago-
nists control the calcium -dependent biological events by
blocking the flux of calcium from the extracellular medium
to the cell cytoplasm, which has implication in the treatment
of such cardiovascular diseases as variant and exertional
angina, certain types of cardiac arrhythmias, and hyperten-
sion [58]. Torsade de pointes is an atypical rapid ventricular
tachycardia with periodic waxing and waning of amplitude
of the QRS complexes on the electrocardiogram as well as
rotation of the complexes about the isoelectric line [59]. Pro-
tein kinases such as protein kinase C and MAPK induce vas-
cular contraction and increase blood pressure [60-62]. Inhibi-
tors of such protein kinases can thus be used as agents for
lowering blood pressure. Dopamine produces such a mixed
cardiovascular effect as vasodilation. Dopamine antagonists
inhibit dopamine receptors and thus enhance synthesis of
dopamine that are clinically used in the treatment of circula-
tory shock [63].
Table 1 summarises the performance of classification-
curacies of classification-based methods are in the range of
71% ~ 100%, with the majority concentrated in the range of
81%~98%. These are similar to the reported accuracy of the
prediction of compounds with other pharmacodynamic,
pharmacokinetic and toxicological properties by statistical
As shown in Table 1, the reported overall prediction ac-
Page 5
Computer Prediction of CHA Cardiovascular & Hematological Agents in Medicinal Chemistry, 2007, Vol. 5, No. 1 15
learning methods [11, 64]. These results suggest that the
classification methods surveyed here have certain level of
capability for predicting cardiovascular and haematological
agents.
works that used regression methods for predicting cardiovas-
cular and haematological agents. These agents include cal-
Table 2 summarises the performance of the reported
cium channel antagonists, angiotensin-converting enzyme
(ACE) inhibitors, thrombin inhibitors, potassium channel
openers, dopamine antagonists, hERG potassium channel
blockers and platelet aggregation inhibitor. Angiotensin-
converting enzyme is a membrane-bound enzyme on the
surface of vascular endothelial cells mainly of the lung. ACE
converts angiotensin I to angiotensin II that result in vaso-
Table 1. Performance of Classification-Based Statistical Learning Methods for Predicting Cardiovascular and Hematological
Agents
Property Method Molecular descriptors No of compounds
in training set
Validation method
(No of Compounds in
validation set) a
Reported overall
prediction accu-
racy
HERG potassium
channel inhibitors
SVM [7]
MOE 2D descriptors, molecular
fragment-count descriptors, S log P
73 Validation (73+414) 86~97%
LDA [16]
two topological descriptors, one
geometric descriptor, three quantum
chemical descriptors, and one elec-
trostatic descriptor
45 LOO (45) 86.7%
LS-SVM [16]
two topological descriptors, one
geometric descriptor, three quantum
chemical descriptors, and one elec-
trostatic descriptor
45 LOO (45) 91.1%
PCA [8]
Molecular
Polarizability, Verlop Minimum
Width and Length of the Substitu-
ent, Rotational Barrier, Net Atomic
Charge, Frontier Electron and Or-
bital Densities, and Molecular Hard-
ness
45
--
82-100%
Calcium Channel
Antagonists
NN [8]
Molecular
Polarizability, Verlop Minimum
Width and Length of the Substitu-
ent, Rotational Barrier, Net Atomic
Charge, Frontier Electron and Or-
bital Densities, and Molecular Hard-
ness
45 LOO (45) 77-100%
Torsade de pointes
causing agents
SVM [78] Linear solvation energy relationship Training set 271 Validation set (78) 91.0%
Protein Kinase
Inhibitors
Consensus NN
[79]
20 standard BCUT descriptors Training set 480 Validation Set (297) 98.7%
ANN [80]
Topological Structural Fragment
based on the enumeration of all
possible substructure from a chemi-
cal structure and the numerical
characterization of them
Training set 1227 Validation Set (137) 81%
ANN [81]
Structural and topological descrip-
tors
Training set 1022 Validation set (113) 71.7%
Dopamine An-
tagonists
LDA [81]
Structural and topological descrip-
tors
Training set 1022 Validation set (113) 72.6%
Abbreviations: HERG – human ether-a-go-go-related gene; LDA – linear discriminant analysis; PCA – principal component analysis; NN – neural network; ANN – artificial neural
network; SVM –support vector machine; LS-SVM - least squares support vector machines; BCUT – Burden-CAS-University of Texas eigenvalues.
a – number in parenthesis denotes the number of compounds used for model validation.
Keywords
Similar Publications
Chun Wei Yap |