-
[show abstract]
[hide abstract]
ABSTRACT: Wavelength selection is a critical step for producing better prediction performance when applied to spectral data. Considering the fact that the vibrational and rotational spectra have continuous features of spectral bands, we propose a novel method of wavelength interval selection based on random frog, called interval random frog (iRF). To obtain all the possible continuous intervals, spectra are first divided into intervals by moving window of a fix width over the whole spectra. These overlapping intervals are ranked applying random frog coupled with PLS and the optimal ones are chosen. This method has been applied to two near-infrared spectral datasets displaying higher efficiency in wavelength interval selection than others. The source code of iRF can be freely downloaded for academy research at the website: http://code.google.com/p/multivariate-calibration/downloads/list.
Spectrochimica Acta Part A Molecular and Biomolecular Spectroscopy 03/2013; 111C:31-36. · 2.10 Impact Factor
-
-
[show abstract]
[hide abstract]
ABSTRACT: The identification of disease-relevant genes represents a challenge in microarray-based disease diagnosis where the sample size is often limited. Among established methods, reversible jump Markov Chain Monte Carlo (RJMCMC) methods have proven to be quite promising for variable selection. However, the design and application of an RJMCMC algorithm requires, for example, special criteria for prior distributions. Also, the simulation from joint posterior distributions of models is computationally extensive, and may even be mathematically intractable. These disadvantages may limit the applications of RJMCMC algorithms. Therefore, the development of algorithms that possess the advantages of RJMCMC methods and are also efficient and easy to follow for selecting disease-associated genes is required. Here we report a RJMCMC-like method, called random frog that possesses the advantages of RJMCMC methods and is much easier to implement. Using the colon and the estrogen gene expression datasets, we show that random frog is effective in identifying discriminating genes. The top 2 ranked genes for colon and estrogen are Z50753, U00968, and Y10871_at, Z22536_at, respectively. (The source codes with GNU General Public License Version 2.0 are freely available to non-commercial users at: http://code.google.com/p/randomfrog/.).
Analytica chimica acta 08/2012; 740:20-6. · 4.31 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: The essential oils from two varieties of Osmanthus fragrans have been extracted by steam distillation and analyzed by gas chromatography-mass spectrometry with the help of heuristic
evolving latent projections (HELP), an effective chemometric resolution method. The overlapped peak clusters were resolved
into pure chromatograms and pure mass spectra with HELP. Identification of the components was by comparison of temperature-programmed
retention indices (PTRIs) and by similarity searches in the National Institute of Standards and Technology mass database.
Quantitative results were obtained by overall integration of the peaks. The reliability of the qualitative and quantitative
results was greatly improved by using HELP and PTRIs. The main components from O. fragrans Lour. var.
thunbergii Mak. (TM) and O. fragrans Lour. var.
aurantiacus Mak. (AM) were 1,2-epoxy linalool and nonanal, respectively. In total, 52 volatile components in essential oil of TM and
45 in AM were analyzed qualitatively and quantitatively, accounting for 95.67 and 92.28% total contents of the essential oils.
Chromatographia 05/2012; 70(7):1163-1169. · 1.20 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: An important application of metabolic profiles is to discover informative metabolites/biomarkers which are predictive of a
clinical outcome under investigation. Therefore, there is a need to develop statistically efficient method for screening such
kind of metabolites from the candidates. The most commonly used criteria to assess variable (metabolite) importance may be
the P value obtained by performing t test on each metabolite alone, without considering the influence of other variables. In this work, a new strategy, called
subwindow permutation analysis (SPA) coupled with partial least squares linear discriminant analysis (PLSLDA), is developed
for statistical assessment of variable importance. The main contribution of SPA is that, unlike t test, it can output a conditional P value by implicitly taking into account the synergetic effect of all the other variables. In this sense, the conditional
P value could to some extent help locate a good combination of informative variables. When applied to two metabolic datasets
(type 2 diabetes mellitus data and childhood overweight data), it is shown that the performance of both the unsupervised principal
component analysis (PCA) and the supervised PLSLDA are greatly improved when using the informative metabolites revealed by
SPA. The source codes for implementing SPA in both MATLAB and R (R package for both Linux and Windows) are freely available
at: http://code.google.com/p/spa2010/downloads/list.
KeywordsMetabolic profile-Biomarker discovery-Variable selection-Model population analysis-Monte Carlo
Metabolomics 04/2012; 6(3):353-361. · 4.51 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: In this paper a novel wavelength region selection algorithm, called elastic net grouping variable selection combined with partial least squares regression (EN-PLSR), is proposed for multi-component spectral data analysis. The EN-PLSR algorithm can automatically select successive strongly correlated prediction variable groups related to the response variable using two steps. First, a portion of the correlated predictors are selected and divided into subgroups by means of the grouping effect of elastic net estimation. Then, a recursive leave-one-group-out strategy is employed to further shrink the variable groups in terms of the root mean square error of cross-validation (RMSECV) criterion. The performance of the algorithm with real near-infrared (NIR) spectroscopic data sets shows that the EN-PLSR algorithm is competitive with full-spectrum PLS and moving window partial least squares (MWPLS) regression methods and it is suitable for use with strongly correlated spectroscopic data.
Applied Spectroscopy 04/2011; 65(4):402-8. · 1.66 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: Good performance of ensemble approaches could generally be obtained when base classifiers are diverse and accurate. In the present study, feature importance sampling-based adaptive random forest (fisaRF) was proposed to obtain superior classification performance to the primal one-step random forest (RF). fisaRF takes a convenient, yet very effective, way called feature importance sampling (FIS), to select the more eligible feature subset at each splitting node instead of simple random sampling and thereby strengthen the accuracy of individual trees, without sacrificing diversity between them. Additionally, the iterative use of feature importance obtained by the previous step can adaptively capture the most significant features in data and effectively deal with multiple classification problems, not easily solved by other feature importance indexes. The proposed fisaRF was applied to classify three structure–activity relationship (SAR) data sets proposed by Xue et al. 1 together with disinfection by-products (DBPs) data, compared to the primal one-step RF induced by simple random sampling. The comparison revealed that fisaRF can effectively improve the classification accuracy and prediction confidence for each sample and thereby was considered as a very useful tool to screen the underlying lead compounds. Copyright © 2011 John Wiley & Sons, Ltd.
Journal of Chemometrics 03/2011; 25(4):201 - 207. · 1.95 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: Large amounts of data from high-throughput metabolomics experiments have become commonly more and more complex, which brings a number of challenges to existing statistical modeling. Thus there is a need to develop a statistically efficient approach for mining the underlying metabolite information contained by metabolomics data under investigation. In this work, we provide a new strategy based on Monte Carlo cross validation coupled with the classification tree algorithm, which is termed as the MCTree approach. The MCTree approach inherently provides a feasible way to uncover the predictive structure of metabolomics data by the establishment of many cross-predictive models. With the help of the sample proximity matrix such obtained, it seems to be able to give some interesting insights into metabolomics data. Simultaneously, informative metabolites or potential biomarkers can be successfully discovered by means of variable importance ranking in the MCTree approach. Two real metabolomics datasets are finally used to demonstrate the performance of the proposed approach.
The Analyst 03/2011; 136(5):947-54. · 4.23 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: Selecting a small subset of informative genes plays an important role in accurate prediction of clinical tumor samples. Based on model population analysis, a novel variable selection method, called noise incorporated subwindow permutation analysis (NISPA), is proposed in this study to work with support vector machines (SVMs). The essence of NISPA lies in the point that one noise variable is added into each sampled sub-dataset and then the distribution of variable importance of the added noise could be computed and serves as the common reference to evaluate the experimental variables. Further, by using the non-parametric Mann-Whitney U test, a P value can be assigned to each variable which describes to what extent the distributions of the gene variable and the noise variable are different. According to the computed P values, all the variables could be ranked and then a small subset of informative variables could be determined to build the model. Moreover, by NISPA, we are the first to distinguish the variables into a more detailed classification as informative, uninformative (noise) and interfering variables in comparison with other methods. In this study, two microarray datasets are employed to evaluate the performance of NISPA. The results show that the prediction errors of SVM classifiers could be significantly reduced by variable selection using NISPA. It is concluded that NISPA is a good alternative of variable selection algorithm.
The Analyst 02/2011; 136(7):1456-63. · 4.23 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: Selecting a small number of informative genes for microarray-based tumor classification is central to cancer prediction and treatment. Based on model population analysis, here we present a new approach, called Margin Influence Analysis (MIA), designed to work with support vector machines (SVM) for selecting informative genes. The rationale for performing margin influence analysis lies in the fact that the margin of support vector machines is an important factor which underlies the generalization performance of SVM models. Briefly, MIA could reveal genes which have statistically significant influence on the margin by using Mann-Whitney U test. The reason for using the Mann-Whitney U test rather than two-sample t test is that Mann-Whitney U test is a nonparametric test method without any distribution-related assumptions and is also a robust method. Using two publicly available cancerous microarray data sets, it is demonstrated that MIA could typically select a small number of margin-influencing genes and further achieves comparable classification accuracy compared to those reported in the literature. The distinguished features and outstanding performance may make MIA a good alternative for gene selection of high dimensional microarray data. (The source code in MATLAB with GNU General Public License Version 2.0 is freely available at http://code.google.com/p/mia2009/).
IEEE/ACM transactions on computational biology and bioinformatics / IEEE, ACM 02/2011; 8(6):1633-41. · 2.25 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: In this paper, a two-step nonlinear classification algorithm is proposed to model the structure–activity relationship (SAR) between bioactivities and molecular descriptors of compounds, which consists of kernel principal component analysis (KPCA) and linear support vector machines (KPCA + LSVM). KPCA is used to remove some uninformative gradients such as noises and then exactly capture the latent structure of the training dataset using some new variables called the principal components in the kernel-defined feature space. LSVM makes full use of the maximal margin hyperplane to give the best generalization performance in the KPCA-transformed space. The combination of KPCA and LSVM can effectively improve the prediction performance compared with the linear SVM as well as two nonlinear methods. Three datasets related to different categorical bioactivities of compounds are used to evaluate the performance of KPCA + LSVM. The results show that our algorithm is competitive. Copyright © 2011 John Wiley & Sons, Ltd.
Journal of Chemometrics 01/2011; 25(2):92 - 99. · 1.95 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: Aqueous solubility of drug compounds plays a very important role in drug research and development. In this study, we have collected 225 diverse druglike molecules with accurate aqueous solubility. Three commonly used methods, namely partial least squares (PLS), back-propagation network (BPN) and support vector regression (SVR), were employed to model quantitative structure–property relationship (QSPR) for the aqueous solubility of 180 druglike compounds. Twenty eight molecular descriptors were used to relate the drug aqueous solubility. In order to obtain a reliable and robust aqueous solubility prediction, a novel outlier detection method was employed to simultaneously detect all outliers in the established models. According to the Organization for Economic Co-operation and Development (OECD) principles, the QSPR models were checked by both internal and external statistical validation to ensure both reliability and predictive ability. The results indicate that three models can provide good predictive ability for drug aqueous solubility. Futhermore, it was found that the predictive ability of SVR is superior to those of PLS and BPN and 28 selected molecular descriptors could give a reliable and direct interpretation to the aqueous solubility. Copyright © 2010 John Wiley & Sons, Ltd.
Journal of Chemometrics 08/2010; 24(9):584 - 595. · 1.95 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: To build a credible model for given chemical or biological or clinical data, it may be helpful to first get somewhat better insight into the data itself before modeling and then to present the statistically stable results derived from a large number of sub-models established only on one dataset with the aid of Monte Carlo Sampling (MCS). In the present work, a concept model population analysis (MPA) is developed. Briefly, MPA could be considered as a general framework for developing new methods by statistically analyzing some interesting parameters (regression coefficients, prediction errors, etc.) of a number of sub-models. New methods are expected to be developed by making full use of the interesting parameter in a novel manner. In this work, the elements of MPA are first considered and described. Then, the applications for variable selection and model assessment are emphasized with the help of MPA. Copyright © 2010 John Wiley & Sons, Ltd.
Journal of Chemometrics 06/2010; 24(7‐8):418 - 423. · 1.95 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: The crucial step of building a high performance QSAR/QSPR model is the detection of outliers in the model. Detecting outliers in a multivariate point cloud is not trivial, especially when several outliers coexist in the model. The classical identification methods do not always identify them, because they are based on the sample mean and covariance matrix influenced by the outliers. Moreover, existing methods only lay stress on some type of outliers but not all the outliers. To avoid these problems and detect all kinds of outliers simultaneously, we provide a new strategy based on Monte-Carlo cross-validation, which was termed as the MC method. The MC method inherently provides a feasible way to detect different kinds of outliers by establishment of many cross-predictive models. With the help of the distribution of predictive residuals such obtained, it seems to be able to reduce the risk caused by the masking effect. In addition, a new display is proposed, in which the absolute values of mean value of predictive residuals are plotted versus standard deviations of predictive residuals. The plot divides the data into normal samples, y direction outliers and X direction outliers. Several examples are used to demonstrate the detection ability of MC method through the comparison of different diagnostic methods.
Journal of Computational Chemistry 07/2009; 31(3):592-602. · 4.58 Impact Factor
-
TrAC Trends in Analytical Chemistry · 6.27 Impact Factor
-
Chemometrics and Intelligent Laboratory Systems 100(1):1-11. · 1.92 Impact Factor
-
Chemometrics and Intelligent Laboratory Systems 103(2):129-136. · 1.92 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: This contribution introduces Elastic Component Regression (ECR) as an explorative data analysis method that utilizes a tuning parameter α ∈ [0,1] to supervise the X-matrix decomposition. It is demonstrated theoretically that the elastic component resulting from ECR coincides with principal components of PCA when α = 0 and also coincides with PLS components when α = 1. In this context, PCR and PLS occupy the two ends of ECR and α ∈ (0,1) will lead to an infinite number of transitional models which collectively uncover the model path from PCR to PLS. Therefore, the framework of ECR shows a natural progression from PCR to PLS and may help add some insight into their relationships in theory. The performance of ECR is investigated on a series of simulated datasets together with a real world near infrared dataset. (The source codes implementing ECR in MATLAB are freely available at http://code.google.com/p/ecr/.)
Chemometrics and Intelligent Laboratory Systems.