Determination of protein fold class from Raman or Raman optical activity spectra using random forests.
ABSTRACT Knowledge of the fold class of a protein is valuable because fold class gives an indication of protein function and evolution. Fold class can be accurately determined from a crystal structure or NMR structure, though these methods are expensive, time-consuming, and inapplicable to all proteins. In contrast, vibrational spectra [infra-red, Raman, or Raman optical activity (ROA)] are rapidly obtained for proteins under wide range of biological molecules under diverse experimental and physiological conditions. Here, we show that the fold class of a protein can be determined from Raman or ROA spectra by converting a spectrum into data of 10 cm(-1) bin widths and applying the random forest machine learning algorithm. Spectral data from 605 and 1785 cm(-1) were analyzed, as well as the amide I, II, and III regions in isolation and in combination. ROA amide II and III data gave the best performance, with 33 of 44 proteins assigned to one of the correct four top-level structural classification of proteins (SCOP) fold class (all α, all β, α and β, and disordered). The method also shows which spectral regions are most valuable in assigning fold class.
[show abstract] [hide abstract]
ABSTRACT: Variable importance measures for random forests have been receiving increased attention as a means of variable selection in many classification tasks in bioinformatics and related scientific fields, for instance to select a subset of genetic markers relevant for the prediction of a certain disease. We show that random forest variable importance measures are a sensible means for variable selection in many applications, but are not reliable in situations where potential predictor variables vary in their scale of measurement or their number of categories. This is particularly important in genomics and computational biology, where predictors often include variables of different types, for example when predictors include both sequence data and continuous variables such as folding energy, or when amino acid sequence data show different numbers of categories. Simulation studies are presented illustrating that, when random forest variable importance measures are used with data of varying types, the results are misleading because suboptimal predictor variables may be artificially preferred in variable selection. The two mechanisms underlying this deficiency are biased variable selection in the individual classification trees used to build the random forest on one hand, and effects induced by bootstrap sampling with replacement on the other hand. We propose to employ an alternative implementation of random forests, that provides unbiased variable selection in the individual classification trees. When this method is applied using subsampling without replacement, the resulting variable importance measures can be used reliably for variable selection even in situations where the potential predictor variables vary in their scale of measurement or their number of categories. The usage of both random forest algorithms and their variable importance measures in the R system for statistical computing is illustrated and documented thoroughly in an application re-analyzing data from a study on RNA editing. Therefore the suggested method can be applied straightforwardly by scientists in bioinformatics research.BMC Bioinformatics 02/2007; 8:25. · 2.75 Impact Factor