Comparison of Ridge Regression, Partial Least-Squares, Pairwise Correlation, Forward- and Best Subset Selection Methods for Prediction of Retention Indices for Aliphatic Alcohols

Institute of Chemistry, Chemical Research Center, Hungarian Academy of Sciences, H-1525 Budapest, P.O. Box 17, Hungary.
Journal of Chemical Information and Modeling (Impact Factor: 3.74). 05/2005; 45(2):339-46. DOI: 10.1021/ci049827t
Source: PubMed


A quantitative structure-retention relationship (QSRR) study based on multiple linear regression (MLR) was performed for the description and prediction of Kováts retention indices (RI) of alcohol compounds. Alcohols were of saturated, linear or branched types and contained a hydroxyl group on the primary, secondary or tertiary carbon atoms. Constitutive and weighted holistic invariant molecular (WHIM) descriptors were used to represent the structure of alcohols in the MLR models. Before the model building, five variable selection methods were applied to select the most relevant variables from a large set of descriptors, respectively. The selected molecular properties were included into the MLR models. The efficiency of the variable selection methods was also compared. The selection methods were as follows: ridge regression (RR), partial least-squares method (PLS), pair-correlation method (PCM), forward selection (FS) and best subset selection (BSS). The stability and the validity of the MLR models were tested by a cross-validation technique using a leave-n-out technique. Neither RR nor PLS selected variables were able to describe the Kováts retention index properly, and PCM gave reliable results in the description but not for prediction. We built models with good predicting ability using FS and BSS as a selection method. The most relevant variables in the description and prediction of RIs were the mean electrotopological state index, the molecular mass, and WHIM indices characterizing size and shape.

29 Reads
  • Source
    • "Another aspect of model comparison was touched upon by Farkas and Héberger [7]: The comparison of modeling methods in their best performance did not correspond to the principle of parsimony. Simpler, less complicated models with smaller degrees of freedom could provide almost the same or even better results [7] [8]. "
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper describes the theoretical background, algorithm and validation of a recently developed novel method of ranking based on the sum of ranking differences [TrAC Trends Anal. Chem. 2010; 29: 101-109]. The ranking is intended to compare models, methods, analytical techniques, panel members, etc. and it is entirely general. First, the objects to be ranked are arranged in the rows and the variables (for example model results) in the columns of an input matrix. Then, the results of each model for each object are ranked in the order of increasing magnitude. The difference between the rank of the model results and the rank of the known, reference or standard results is then computed. (If the golden standard ranking is known the rank differences can be completed easily.) In the end, the absolute values of the differences are summed together for all models to be compared. The sum of ranking differences (SRD) arranges the models in a unique and unambiguous way. The closer the SRD value to zero (i.e. the closer the ranking to the golden standard), the better is the model. The proximity of SRD values shows similarity of the models, whereas large variation will imply dissimilarity. Generally, the average can be accepted as the golden standard in the absence of known or reference results, even if bias is also present in the model results in addition to random error. Validation of the SRD method can be carried out by using simulated random numbers for comparison (permutation test). A recursive algorithm calculates the discrete distribution for a small number of objects (n<14), whereas the normal distribution is used as a reasonable approximation if the number of objects is large. The theoretical distribution is visualized for random numbers and can be used to identify SRD values for models that are far from being random. The ranking and validation procedures are called Sum of Ranking differences (SRD) and Comparison of Ranks by Random Numbers (CRNN), respectively.
    Journal of Chemometrics 04/2011; 25(4):151 - 158. DOI:10.1002/cem.1320 · 1.50 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: In this study, quantitative structure–retention relationship (QSRR) technique was used to find the best approximation and to predict gas chromatographic retention indices for O-, N-, and S-heterocyclic compounds on standard nonpolar polydimethyl siloxane stationary phase. Boiling point (BP) and calculated properties were used to encode the structure of compounds. Three- and two-dimensional calculated properties such as weighted–holistic invariant molecular (WHIM) descriptors, geometry topology and atom weights assembly (GETAWAY) descriptors, connectivity indices, and zero-dimensional constitutive descriptors were used. Variable subset selection (VSS) and partial least squares (PLS) projections to latent structures were used to select the most significant variables from a large set of descriptors. Multiple linear regression (MLR) and PLS were applied to find the relationship between selected properties and gas chromatographic retention indices. PLS was not able to select the most important descriptors (boiling point or molecular weight). The predictive ability of the models was tested by cross-validation. Solely calculated descriptors were not able to give proper models. Boiling point was always necessary for good prediction. PLS models containing boiling points were suitable for retention index prediction, whereas MLR did not give real linear models.
    Chemometrics and Intelligent Laboratory Systems 07/2004; 72(2-72):173-184. DOI:10.1016/j.chemolab.2004.01.012 · 2.32 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We propose a new classification method for the prediction of drug properties, called random feature subset boosting for linear discriminant analysis (LDA). The main novelty of this method is the ability to overcome the problems with constructing ensembles of linear discriminant models based on generalized eigenvectors of covariance matrices. Such linear models are popular in building classification-based structure-activity relationships. The introduction of ensembles of LDA models allows for an analysis of more complex problems than by using single LDA, for example, those involving multiple mechanisms of action. Using four data sets, we show experimentally that the method is competitive with other recently studied chemoinformatic methods, including support vector machines and models based on decision trees. We present an easy scheme for interpreting the model despite its apparent sophistication. We also outline theoretical evidence as to why, contrary to the conventional AdaBoost ensemble algorithm, this method is able to increase the accuracy of LDA models.
    Journal of Chemical Information and Modeling 02/2006; 46(1):416-23. DOI:10.1021/ci050375+ · 3.74 Impact Factor
Show more