ArticlePDF Available

Meta-Statistics for Variable Selection: The R Package BioMark

Authors:

Abstract and Figures

Biomarker identification is an ever more important topic in the life sciences. With the advent of measurement methodologies based on microarrays and mass spectrometry, thousands of variables are routinely being measured on complex biological samples. Often, the question is what makes two groups of samples different. Classical hypothesis testing suffers from the multiple testing problem; however, correcting for this often leads to a lack of power. In addition, choosing α cutoff levels remains somewhat arbitrary. Also in a regression context, a model depending on few but relevant variables will be more accurate and precise, and easier to interpret biologically. We propose an R package, BioMark, implementing two meta-statistics for variable selection. The first, higher criticism, presents a data-dependent selection threshold for significance, instead of a cookbook value of α = 0.05. It is applicable in all cases where two groups are compared. The second, stability selection, is more general, and can also be applied in a regression context. This approach uses repeated subsampling of the data in order to assess the variability of the model coefficients and selects those that remain consistently important. It is shown using experimental spike-in data from the field of metabolomics that both approaches work well with real data. BioMark also contains functionality for simulating data with specific characteristics for algorithm development and testing.
Content may be subject to copyright.
A preview of the PDF is not available
... The R package Biomark [5,6] includes these popular variable selection methods: student t test, Variable Importance in Projection (VIP) scores [7,8] from Partial Least Squares Regression (PLS-DA) models, Least Absolute Shrinkage and Selection Operator (LASSO) [9], and Elastic Net [10,11]. Each method has different strengths and weaknesses for identifying significant variables often found in biological data like in metabolomics, and possibly, for modelling them. ...
... Those variables appearing by chance in a few perturbations will not be a consistent indicator of class differences when the results are averaged, so are not selected. Thus, stability-based selection, like the jackknife [17] approach, perturbs the data to identify those variables that are consistently selected as group difference indicators [5,12,13] to improve prediction. ...
... Finally, the findings from this study provide new results for variable selection methods, especially for the stability-based variable selection approach, which has not previously been extensively evaluated for LASSO and Elastic Net [5,12]. The stability-based student t test and VIP scores outperformed Elastic Net and LASSO in all parameter configurations but not when the effect size and the number of variables were large. ...
Article
Full-text available
Variable selection is frequently carried out during the analysis of many types of high dimensional data, including metabolomics data. This study compared the predictive performance of four variable selection methods using stability-based selection, a new secondary selection method that is implemented in the R package BioMark. Two of these methods were evaluated using the more well-known False Discovery Rate (FDR) as well. Findings: Simulation studies varied factors relevant to biological data studies with results based on the median values of 200 partial area under the Receiver Operating Characteristic curves (pAUC). There was no single top performing method across all factor settings, but the Student t-test method based on stability selection or FDR adjustment and the Variable Importance in Projection (VIP) scores from partial least squares regression models obtained using a stability-based approach tended to perform well in most settings. Similar results were found with a real spiked-in metabolomics dataset. Group sample size, group effect size, number of significant variables and correlation structure were the most important factors whereas the percentage of significant variables was the least important. Conclusions: Researchers can improve prediction scores for their study data by choosing VIP scores based on stability variable selection over the other approaches when the number of features is small to modest and by increasing the number of samples even moderately. When the number of features is high and there is block correlation amongst the true biomarkers, the Student t-test with FDR adjustment performed best. The BioMark R package is an easy-to-use open-source program for variable selection that had excellent performance characteristics for the purposes of this study.
... We performed Stability Selection [22] (variable selection based on subsampling in combination with least absolute shrinkage and selection operator [LASSO] [23]) on Discovery Cohort data (96 PD, 45 NC). To rank candidate biomarkers, the R BioMark package across 100,000 jackknifed iterations [24] was employed. At each iteration, 30% of the proteins and 10% of the samples were left out of the bag, and LASSO was used to feature-select for variables on the remaining data. ...
... Unsupervised clustering revealed colinearity among subsets of these proteins, suggesting redundancies and possible shared relationships among many candidate biomarkers (Fig 2A). We thus employed Stability Selection [22], a meta-statistical tool that identifies consistently important features by repeated subsampling of the data, in order to identify the most robust, stable, and sparse set of discriminatory proteins; we ranked candidate biomarkers using the LASSO method across 100,000 jackknifed iterations [23,24]. The top 10 proteins from the Discovery Cohort ranked by Stability Selection, shown in Fig 2B and 2C, were advanced for replication. ...
Article
Full-text available
Background Parkinson’s disease (PD) is a progressive neurodegenerative disease affecting about 5 million people worldwide with no disease-modifying therapies. We sought blood-based biomarkers in order to provide molecular characterization of individuals with PD for diagnostic confirmation and prediction of progression. Methods and findings In 141 plasma samples (96 PD, 45 neurologically normal control [NC] individuals; 45.4% female, mean age 70.0 years) from a longitudinally followed Discovery Cohort based at the University of Pennsylvania (UPenn), we measured levels of 1,129 proteins using an aptamer-based platform. We modeled protein plasma concentration (log10 of relative fluorescence units [RFUs]) as the effect of treatment group (PD versus NC), age at plasma collection, sex, and the levodopa equivalent daily dose (LEDD), deriving first-pass candidate protein biomarkers based on p-value for PD versus NC. These candidate proteins were then ranked by Stability Selection. We confirmed findings from our Discovery Cohort in a Replication Cohort of 317 individuals (215 PD, 102 NC; 47.9% female, mean age 66.7 years) from the multisite, longitudinally followed National Institute of Neurological Disorders and Stroke Parkinson’s Disease Biomarker Program (PDBP) Cohort. Analytical approach in the Replication Cohort mirrored the approach in the Discovery Cohort: each protein plasma concentration (log10 of RFU) was modeled as the effect of group (PD versus NC), age at plasma collection, sex, clinical site, and batch. Of the top 10 proteins from the Discovery Cohort ranked by Stability Selection, four associations were replicated in the Replication Cohort. These blood-based biomarkers were bone sialoprotein (BSP, Discovery false discovery rate [FDR]-corrected p = 2.82 × 10⁻², Replication FDR-corrected p = 1.03 × 10⁻⁴), osteomodulin (OMD, Discovery FDR-corrected p = 2.14 × 10⁻², Replication FDR-corrected p = 9.14 × 10⁻⁵), aminoacylase-1 (ACY1, Discovery FDR-corrected p = 1.86 × 10⁻³, Replication FDR-corrected p = 2.18 × 10⁻²), and growth hormone receptor (GHR, Discovery FDR-corrected p = 3.49 × 10⁻⁴, Replication FDR-corrected p = 2.97 × 10⁻³). Measures of these proteins were not significantly affected by differences in sample handling, and they did not change comparing plasma samples from 10 PD participants sampled both on versus off dopaminergic medication. Plasma measures of OMD, ACY1, and GHR differed in PD versus NC but did not differ between individuals with amyotrophic lateral sclerosis (ALS, n = 59) versus NC. In the Discovery Cohort, individuals with baseline levels of GHR and ACY1 in the lowest tertile were more likely to progress to mild cognitive impairment (MCI) or dementia in Cox proportional hazards analyses adjusting for age, sex, and disease duration (hazard ratio [HR] 2.27 [95% CI 1.04–5.0, p = 0.04] for GHR, and HR 3.0 [95% CI 1.24–7.0, p = 0.014] for ACY1). GHR’s association with cognitive decline was confirmed in the Replication Cohort (HR 3.6 [95% CI 1.20–11.1, p = 0.02]). The main limitations of this study were its reliance on the aptamer-based platform for protein measurement and limited follow-up time available for some cohorts. Conclusions In this study, we found that the blood-based biomarkers BSP, OMD, ACY1, and GHR robustly associated with PD across multiple clinical sites. Our findings suggest that biomarkers based on a peripheral blood sample may be developed for both disease characterization and prediction of future disease progression in PD.
... The aqueous extract was the closest to the NO inhibitory activities demonstrating that this extract might contain the pertinent bioactive constituents. Hence, the variable importance in the projection (VIP) values of the plant metabolites greater than 0.7, which gave an influential contribution to the clustering in the PLS model, was selected [34]. ...
... In general, the closer the R2 values to 1, the better the performance of a model in terms of its goodness of fit and predictive quality of the regression model. To stipulate the Y-axis intercepts, the PLS biplot validation has to be further confirmed by 100 random permutation test [34][35]. According to Eriksson et al. [27], Y-axis intercepts should be within the limits of R2< 0.3 and Q2< 0.05, and the R2-line is far from being horizontal for a model to be considered validated. ...
Article
Full-text available
The metabolomics approach successfully explained the possible neuroprotective effect of Clinacanthus nutans (Burm. f.) Lindau (CN) leaf extracts. Forty-four metabolites were putatively identified via proton Nuclear Magnetic Resonance (1H NMR and J-resolved NMR) metabolic profiling of CN leaf extracts in three types of solvents, namely water, 50% ethanol, and ethanol. Metabolite fingerprinting has efficaciously differentiated aqueous between the other two extracts. The variable importance of projection (VIP) showed that 30 metabolites were responsible for the discrimination of the extracts by component 1 in the Partial Least Square (PLS) score plot. The lipopolysaccharides (LPS)-induced murine microglial of the BV2 cell line successfully exhibited aqueous CN as the closest extract related to the nitrite oxide (NO) inhibitory activity via PLS biplot, with an IC50 value of 336.2 ± 4.7 µg/mL through Griess assay. The cytotoxicity assay also indicated that all CN extracts were non-toxic. Schaftoside, acetate, propionate, alanine, and clinacoside C were identified as the most potential biomarkers in the anti-inflammatory assay. Hence, the aqueous CN extract could be further investigated, particularly relating to the anti-neuroinflammation study.
... The R package BioMark [21] was used to identify the significant biomarkers from metabolite concentrations obtained by rDolphin, BATMAN, and from the expert profiler. The BioMark package implements several classification methods, including principal component regression (PCR) and partial least squares-discriminant analysis (PLS-DA), as well as common selection methods, such as variable importance in projection (VIP) scores from PLS-DA models, Student's t-tests, and the LASSO [22][23][24][25]. ...
Article
Full-text available
Automated programs that carry out targeted metabolite identification and quantification using proton nuclear magnetic resonance spectra can overcome time and cost barriers that limit metabolomics use. However, their performance needs to be comparable to that of an experienced spectroscopist. A previously analyzed pediatric sepsis data set of serum samples was used to compare results generated by the automated programs rDolphin and BATMAN with the results obtained by manual profiling for 58 identified metabolites. Metabolites were selected using Student’s t-tests and evaluated with several performance metrics. The manual profiling results had the highest performance metrics values, especially for sensitivity (76.9%), area under the receiver operating characteristic curve (0.90), precision (62.5%), and testing accuracy based on a neural net (88.6%). All three approaches had high specificity values (77.7–86.7%). Manual profiling by an expert spectroscopist outperformed two open-source automated programs, indicating that further development is needed to achieve acceptable performance levels.
... Data set No. 27 was obtained from the RAST database [16] (www.mg-rast.org). Data sets without study ID were derived either from the original publications or from R packages within which they were distributed: BioMark [17], kodama [18], MixOmics [19], and pgmm [20]. Abbreviations: CD, Crohn's disease; CFS, Cronic fatigue syndrome; E Estrogen; E+P, Estrogen + Progesterone; ES, Ewing sarcoma; IBD, Inflammatory bowel disease; MA, microarray; RMS, Rhabdomyosarcoma; UC, Ulcertive colitis. ...
Article
Full-text available
Metabolite differential connectivity analysis has been successful in investigating potential molecular mechanisms underlying different conditions in biological systems. Correlation and Mutual Information (MI) are two of the most common measures to quantify association and for building metabolite—metabolite association networks and to calculate differential connectivity. In this study, we investigated the performance of correlation and MI to identify significantly differentially connected metabolites. These association measures were compared on (i) 23 publicly available metabolomic data sets and 7 data sets from other fields, (ii) simulated data with known correlation structures, and (iii) data generated using a dynamic metabolic model to simulate real-life observed metabolite concentration profiles. In all cases, we found more differentially connected metabolites when using correlation indices as a measure for association than MI. We also observed that different MI estimation algorithms resulted in difference in performance when applied to data generated using a dynamic model. We concluded that there is no significant benefit in using MI as a replacement for standard Pearson’s or Spearman’s correlation when the application is to quantify and detect differentially connected metabolites.
... Several strategies have been described for feature selection [174,175] (e.g., wrapper approaches such as Recursive Feature Elimination, Genetic Algorithms, or sparse models such as Lasso, Elastic Net, or sparse PLS). Such techniques are implemented in R packages, which also provide detailed comparisons on real datasets in terms of the stability and the size of the selected signature, the prediction performance of the final model, and the computation time [176][177][178][179]. ...
Article
Full-text available
Metabolomics aims to measure and characterise the complex composition of metabolites in a biological system. Metabolomics studies involve sophisticated analytical techniques such as mass spectrometry and nuclear magnetic resonance spectroscopy, and generate large amounts of high-dimensional and complex experimental data. Open source processing and analysis tools are of major interest in light of innovative, open and reproducible science. The scientific community has developed a wide range of open source software, providing freely available advanced processing and analysis approaches. The programming and statistics environment R has emerged as one of the most popular environments to process and analyse Metabolomics datasets. A major benefit of such an environment is the possibility of connecting different tools into more complex workflows. Combining reusable data processing R scripts with the experimental data thus allows for open, reproducible research. This review provides an extensive overview of existing packages in R for different steps in a typical computational metabolomics workflow, including data processing, biostatistics, metabolite annotation and identification, and biochemical network and pathway analysis. Multifunctional workflows, possible user interfaces and integration into workflow management systems are also reviewed. In total, this review summarises more than two hundred metabolomics specific packages primarily available on CRAN, Bioconductor and GitHub.
Article
Biomedical applications such as genome-wide association studies screen large databases with high-dimensional features to identify rare, weakly expressed, and important continuous-valued features for subsequent detailed analysis. We describe an exact, rapid Bayesian screening approach with attractive diagnostic properties using a Gaussian random mixture model focusing on the missed discovery rate (the probability of failing to identify potentially informative features) rather than the false discovery rate ordinarily used with multiple hypothesis testing. The method provides the likelihood that a feature merits further investigation, as well as distributions of the effect magnitudes and the proportion of features with the same expected responses under alternative conditions. Important features include the dependence of the critical values on clinical and regulatory priorities and direct assessment of the diagnostic properties.
Article
Untargeted metabolomics using liquid chromatography coupled to mass spectrometry (LC-MS) allows the detection of thousands of metabolites in biological samples. However, LC-MS data annotation is still considered a major bottleneck in the metabolomics pipeline since only a small fraction of the metabolites present in the sample can be annotated with the required confidence level. Here, we introduce mWISE (metabolomics wise inference of speck entities), an R package for context-based annotation of LC-MS data. The algorithm consists of three main steps aimed at (i) matching mass-to-charge ratio values to the Kyoto Encyclopedia of Genes and Genomes (KEGG) database, (ii) clustering and filtering the potential KEGG candidates, and (iii) building a final prioritized list using diffusion in graphs. The algorithm performance is evaluated with three publicly available studies using both positive and negative ionization modes. We have also compared mWISE to other available annotation algorithms in terms of their performance and computation time. In particular, we explored four different configurations for mWISE, and all four of them outperform xMSannotator (a state-of-the-art annotator) in terms of both performance and computation time. Using a diffusion configuration that combines the biological network obtained from the FELLA R package and raw scores, mWISE shows a sensitivity mean (standard deviation) across data sets of 0.63 (0.07), while xMSannotator achieves a sensitivity of 0.55 (0.19). We have also shown that the chemical structures of the compounds proposed by mWISE are closer to the original compounds than those proposed by xMSannotator. Finally, we explore the diffusion prioritization separately, showing its key role in the annotation process. mWISE is freely available on GitHub (https://github.com/b2slab/mWISE) under a GPL license.
Article
Full-text available
Psoriasis is an inflammatory disease of the epidermis based on an immunological mechanism involving Langerhans cells and T lymphocytes that produce pro-inflammatory cytokines. Genetic factors, environmental factors, and improper nutrition are considered triggers of the disease. Numerous studies have reported that in a high number of patients, psoriasis is associated with obesity. Excess adipose tissue, typical of obesity, causes a systemic inflammatory status coming from the inflammatory active adipose tissue; therefore, weight reduction is a strategy to fight this pro-inflammatory state. This study aimed to evaluate how a nutritional regimen based on a ketogenic diet influenced the clinical parameters, metabolic profile, and inflammatory state of psoriasis patients. To this end, 30 psoriasis patients were subjected to a ketogenic nutritional regimen and monitored for 4 weeks by evaluating the clinical data, biochemical and clinical parameters, NMR metabolomic profile, and IL-2, IL-1β, TNF-α, IFN-γ, and IL-4 concentrations before and after the nutritional regimen. Our data show that a low-calorie ketogenic diet can be considered a successful strategy and therapeutic option to gain an improvement in psoriasis-related dysmetabolism, with significant correction of the full metabolic and inflammatory status.
Article
Full-text available
The attention of sports community toward probiotic supplementation as a way to promote exercise and training performance, together with good health, has increased in recent years. This has applied also to horses, with promising results. Here, for the first time, we tested a probiotic mix of several strains of live bacteria typically employed for humans to improve the training performance of Standardbred horses in athletic activity. To evaluate its effects on the horse performance, we measured lactate concentration in blood, a translational outcome largely employed for the purpose, combined with the study of hematological and biochemical parameters, together with urine from a metabolomics perspective. The results showed that the probiotic supplementation significantly reduced postexercise blood lactate concentration. The hematological and biochemical parameters, together with urine molecular profile, suggested that a likely mechanism underlying this positive effect was connected to a switch of energy source in muscle from carbohydrates to short-chain fatty acids. Three sulfur-containing molecules differently concentrated in urines in connection to probiotics administration suggested that such switch was linked to sulfur metabolism. NEW & NOTEWORTHY Probiotic supplementation could reduce postexercise blood lactate concentration in Standardbred horses in athletic activity. Blood parameters, together with urine molecular profile, suggest the mechanism underlying this positive effect is connected to a switch of energy source in muscle from carbohydrates to short-chain fatty acids. Sulfur-containing molecules found in urines in connection to probiotics administration suggested that such switch was linked to sulfur metabolism.
Article
Full-text available
Scatterplot3d is an R package for the visualization of multivariate data in a three dimensional space. R is a “language for data analysis and graphics”. In this paper we discuss the features of the package. It is designed by exclusively making use of already existing functions of R and its graphics system and thus shows the extensibility of the R graphics system. Additionally some examples on generated and real world data are provided, as well as the source code and the help page of scatterplot3d.
Article
Full-text available
Metabolite profiling in biomarker discovery, enzyme substrate assignment, drug activity/specificity determination, and basic metabolic research requires new data preprocessing approaches to correlate specific metabolites to their biological origin. Here we introduce an LC/MS-based data analysis approach, XCMS, which incorporates novel nonlinear retention time alignment, matched filtration, peak detection, and peak matching. Without using internal standards, the method dynamically identifies hundreds of endogenous metabolites for use as standards, calculating a nonlinear retention time correction profile for each sample. Following retention time correction, the relative metabolite ion intensities are directly compared to identify changes in specific endogenous metabolites, such as potential biomarkers. The software is demonstrated using data sets from a previously reported enzyme knockout study and a large-scale study of plasma samples. XCMS is freely available under an open-source license at http://metlin.scripps.edu/download/.
Article
Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. The generalization error for forests converges a.s. to a limit as the number of trees in the forest becomes large. The generalization error of a forest of tree classifiers depends on the strength of the individual trees in the forest and the correlation between them. Using a random selection of features to split each node yields error rates that compare favorably to Adaboost (Y. Freund & R. Schapire, Machine Learning: Proceedings of the Thirteenth International conference, ***, 148–156), but are more robust with respect to noise. Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the splitting. Internal estimates are also used to measure variable importance. These ideas are also applicable to regression.
Article
The development and the validation of innovative approaches for biomarker selection are of paramount importance in many -omics technologies. Unfortunately, the actual testing of new methods on real data is difficult, because in real data sets, one can never be sure about the “true” biomarkers. In this paper, we present a publicly available metabolomic ultra performance liquid chromatography–mass spectrometry spike-in data set for apples. The data set consists of 10 control samples and three spiked sets of the same size, where naturally occurring compounds are added in different concentrations. In this sense, the data set can serve as a test bed to assess the performance of new algorithms and compare them with previously published results. We illustrate some of the possibilities provided by this spike-in data set by comparing the performance of two popular biomarker-selection methods, the univariate t-test and the multivariate variable importance in projection. To promote a widespread use of the data, raw data files as well as preprocessed peak lists are made available. Copyright
Article
Estimation of structure, such as in variable selection, graphical modelling or cluster analysis is notoriously difficult, especially for high-dimensional data. We introduce stability selection. It is based on subsampling in combination with (high-dimensional) selection algorithms. As such, the method is extremely general and has a very wide range of applicability. Stability selection provides finite sample control for some error rates of false discoveries and hence a transparent principle to choose a proper amount of regularisation for structure estimation. Variable selection and structure estimation improve markedly for a range of selection methods if stability selection is applied. We prove for randomised Lasso that stability selection will be variable selection consistent even if the necessary conditions needed for consistency of the original Lasso method are violated. We demonstrate stability selection for variable selection and Gaussian graphical modelling, using real and simulated data. Comment: 30 pages, 7 figures
Article
An important prerequisite for the development and benchmarking of novel analysis methods is a well-designed comprehensive LC-MS/MS data set. Here, we present our data set consisting of 59 LC-MS/MS analyses of 50 protein samples extracted individually from Escherichia coli K12 and spiked with different concentrations of bovine carbonic anhydrase II and/or chicken ovalbumin, according to a 2 × 3 full factorial design. Using the well-annotated and commonly used E. coli proteome as the sample background ensures that the complexity of the data is on a par with most current proteomic analyses. Data were acquired over a 2-month period using multiple reversed-phase columns and instrument calibrations to include real-life challenges faced when analyzing large proteomics data sets. Moreover, so-called "ground truth" data, comprised by LC-MS/MS measurements of the pure spikes are included in the data set. The current manuscript elaborates this comprehensive benchmark data set for future development and evaluation of analysis methods and software.
Article
Biomarker selection is an important topic in the omics sciences, where holistic measurement methods routinely generate results for many variables simultaneously. Very often, only a small fraction of these variables are really associated with the phenomena of interest. Selection and identification of these biomarkers is essential for obtaining an understanding of the complex biological processes under study. Finding biomarkers, however, is a difficult task. Even if a relative order can be established, e.g., on the basis of p values, it is usually hard to determine where to stop including candidates in the final set. Higher Criticism is an approach for finding data-dependent cutoff values when comparing two distinct groups of samples. Here, we extend its use to multivariate data, providing a principled approach to compromise between not selecting too many variables and catching as many true positives as possible. The results show a marked improvement in biomarker selection, compared to the standard settings available for some methods. Interestingly, HC thresholds can differ considerably from what has been suggested in literature before, again showing that it is not possible to use the same cutoff value for all data sets. The data-specific cutoff values provided by HC also open the way to more fair comparisons between biomarker selection methods, not biased by unlucky or suboptimal threshold choices.