added a

**research item**Updates

0 new

1

Recommendations

0 new

0

Followers

0 new

6

Reads

0 new

120

Ensemble methods of machine learning combine neural networks or other machine learning models in order to improve predictive performance. The proposed ensemble method is based on Occam’s razor idealized as adjusting hyperprior distributions over models according to a Rényi entropy of the data distribution that corresponds to each model. The entropy-based method is used to average a logistic regression model, a random forest, and a deep neural network. As expected, the deep leaning machine more accurately recognizes handwritten digits than the other two models. The combination of the three models performs even better than the neural network when they are combined according to the entropy-based method or according to methods that average the log odds of the classification probabilities reported by the models. Which of the best ensemble methods to choose for other applications may depend on the loss function that quantifies prediction performance and on a robustness consideration.

The probability distributions that statistical methods use to represent uncertainty fail to capture all of the uncertainty that may be relevant to decision making. A simple way to adjust probability distributions for the uncertainty not represented in their models is to average the distributions with a uniform distribution or another distribution of maximum uncertainty. A decision-theoretic framework leads to averaging the distributions by taking the means of the logit transforms of the probabilities. That method does not prevent convergence to the truth, as does taking the means of the probabilities themselves. The mean-logit approach to moderating distributions is applied to natural language processing performed by a deep neural network.

Bayesian models use posterior predictive distributions to quantify the uncertainty of their predictions. Similarly, the point predictions of neural networks and other machine learning algorithms may be converted to predictive distributions by various bootstrap methods. The predictive performance of each algorithm can then be assessed by quantifying the performance of its predictive distribution. Previous methods for assessing such performance are relative, indicating whether certain algorithms perform better than others. This paper proposes performance measures that are absolute in the sense that they indicate whether or not an algorithm performs adequately without requiring comparisons to other algorithms. The first proposed performance measure is a predictive p value that generalizes a prior predictive p value with the prior distribution equal to the posterior distribution of previous data. The other proposed performance measures use the generalized predictive p value for each prediction to estimate the proportion of target values that are compatible with the predictive distribution. The new performance measures are illustrated by using them to evaluate the predictive performance of deep neural networks when applied to the analysis of a large housing price data set that is used as a standard in machine learning.

The probability distributions that statistical methods use to represent uncertainty fail to capture all of the uncertainty that may be relevant to decision making. A simple way to adjust probability distributions for the uncertainty not represented in their models is to average the distributions with a uniform distribution or another distribution of maximum uncertainty. A decision theoretic framework leads to averaging the distributions by taking the means of the logit transforms of the probabilities. That method does not prevent convergence to the truth, as does taking the means of the probabilities themselves. The mean-logit approach to moderating distributions is applied to natural language processing performed by a deep neural network.

Ensemble methods of machine learning combine neural networks or other machine learning models in order to improve predictive performance. The proposed ensemble method is based on Occam's razor idealized as adjusting hyperprior distributions over models according to a Rényi entropy of the data distribution that corresponds to each model. The entropy-based method is used to average a logistic regression model, a random forest, and a deep neural network. As expected, the deep leaning machine more accurately recognizes handwritten digits than the other two models. The combination of the three models performs even better than the neural network when they are combined according to the entropy-based method or according to methods that average the log odds of the classification probabilities reported by the models. Which of the best ensemble methods to choose for other applications may depend on the loss function that quantifies prediction performance and on a robustness consideration.

Bayesian models use posterior predictive distributions to quantify the uncertainty of their predictions. Similarly, the point predictions of neural networks and other machine learning algorithms may be converted to predictive distributions by various bootstrap methods. The predictive performance of each algorithm can then be assessed by quantifying the performance of its predictive distribution. Previous methods for assessing such performance are relative, indicating whether certain algorithms perform better than others. This paper proposes performance measures that are absolute in the sense that they indicate whether or not an algorithm performs adequately without requiring comparisons to other algorithms. The first proposed performance measure is a predictive p value that generalizes a prior predictive p value with the prior distribution equal to the posterior distribution of previous data. The other proposed performance measures use the generalized predictive p value for each prediction to estimate the proportion of target values that are compatible with the predictive distribution. The new performance measures are illustrated by using them to evaluate the predictive performance of deep neural networks when applied to the analysis of a large housing price data set that is used as a standard in machine learning.

14C-Palmitate oxidation rates for the three palmitate conditions tested. Complete (14CO2) and incomplete (14C-ASP) oxidation were determined. Data were re-plotted from Seifert et al., 2008.
(0.13 MB PPT)

Abstract Histogram-based empirical Bayes methods developed for analyzing data for large numbers of genes, SNPs, or other biological features tend to have large biases when applied to data with a smaller number of features such as genes with expression measured conventionally, proteins, and metabolites. To analyze such small-scale and medium-scale data in an empirical Bayes framework, we introduce corrections of maximum likelihood estimators (MLEs) of the local false discovery rate (LFDR). In this context, the MLE estimates the LFDR, which is a posterior probability of null hypothesis truth, by estimating the prior distribution. The corrections lie in excluding each feature when estimating one or more parameters on which the prior depends. In addition, we propose the expected LFDR (ELFDR) in order to propagate the uncertainty involved in estimating the prior. We also introduce an optimally weighted combination of the best of the corrected MLEs with a previous estimator that, being based on a binomial distribution, does not require a parametric model of the data distribution across features. An application of the new estimators and previous estimators to protein abundance data illustrates the extent to which different estimators lead to different conclusions about which proteins are affected by cancer.A simulation study was conducted to approximate the bias of the new estimators relative to previous LFDR estimators. Data were simulated for two different numbers of features (N), two different noncentrality parameter values or detectability levels (dalt), and several proportions of unaffected features (p0). One of these previous estimators is a histogram-based estimator (HBE) designed for a large number of features. The simulations show that some of the corrected MLEs and the ELFDR that corrects the HBE reduce the negative bias relative to the MLE and the HBE, respectively.For every method, we defined the worst-case performance as the maximum of the absolute value of the bias over the two different dalt and over various p0. The best worst-case methods represent the safest methods to be used under given conditions. This analysis indicates that the binomial-based method has the lowest worst-case absolute bias for high p0 and for N = 3, 12. However, the corrected MLE that is based on the minimum description length (MDL) principle is the best worst-case method when the value of p0 is more uncertain since it has one of the lowest worst-case biases over all possible values of p0 and for N = 3, 12. Therefore, the safest estimator considered is the binomial-based method when a high proportion of unaffected features can be assumed and the MDL-based method otherwise.A second simulation study was conducted with additional values of N. We found that HBE requires N to be at least 6-12 features to perform as well as the estimators proposed here, with the precise minimum N depending on p0 and dalt.

Background:
Sustained research on the problem of determining which genes are differentially expressed on the basis of microarray data has yielded a plethora of statistical algorithms, each justified by theory, simulation, or ad hoc validation and yet differing in practical results from equally justified algorithms. Recently, a concordance method that measures agreement among gene lists have been introduced to assess various aspects of differential gene expression detection. This method has the advantage of basing its assessment solely on the results of real data analyses, but as it requires examining gene lists of given sizes, it may be unstable.
Results:
Two methodologies for assessing predictive error are described: a cross-validation method and a posterior predictive method. As a nonparametric method of estimating prediction error from observed expression levels, cross validation provides an empirical approach to assessing algorithms for detecting differential gene expression that is fully justified for large numbers of biological replicates. Because it leverages the knowledge that only a small portion of genes are differentially expressed, the posterior predictive method is expected to provide more reliable estimates of algorithm performance, allaying concerns about limited biological replication. In practice, the posterior predictive method can assess when its approximations are valid and when they are inaccurate. Under conditions in which its approximations are valid, it corroborates the results of cross validation. Both comparison methodologies are applicable to both single-channel and dual-channel microarrays. For the data sets considered, estimating prediction error by cross validation demonstrates that empirical Bayes methods based on hierarchical models tend to outperform algorithms based on selecting genes by their fold changes or by non-hierarchical model-selection criteria. (The latter two approaches have comparable performance.) The posterior predictive assessment corroborates these findings.
Conclusions:
Algorithms for detecting differential gene expression may be compared by estimating each algorithm's error in predicting expression ratios, whether such ratios are defined across microarray channels or between two independent groups.According to two distinct estimators of prediction error, algorithms using hierarchical models outperform the other algorithms of the study. The fact that fold-change shrinkage performed as well as conventional model selection criteria calls for investigating algorithms that combine the strengths of significance testing and fold-change estimation.