# Leo Breiman's research while affiliated with University of California, Berkeley and other places

## Publications (50)

Article
The Π method for estimating an underlying smooth function of M variables, (x l , …, xm), using noisy data is based on approximating it by a sum of products of the form Πm m (x m ). The problem is then reduced to estimating the univariate functions in the products. A convergent algorithm is described. The method keeps tight control on the degrees of...
Article
Tree ensembles are looked at in distribution space, that is, the limit case of "infinite" sample size. It is shown that the simplest kind of trees is complete in D-dimensional $L_2(P)$ space if the number of terminal nodes T is greater than D. For such trees we show that the AdaBoost algorithm gives an ensemble converging to the Bayes risk.
Article
In this paper we propose two ways to deal with the imbalanced data classification problem using random forest. One is based on cost sensitive learning, and the other is based on a sampling technique. Performance metrics such as precision and recall, false positive rate and false negative rate, F-measure and weighted accuracy are computed. Both meth...
Conference Paper
Two-eyed algorithms are complex prediction algorithms that give accurate predictions and also give important insights into the structure of the data the algorithm is processing. The main example I discuss is RF/tools, a collection of algorithms for classification, regression and multiple dependent outputs. The last algorithm is a preliminary versio...
Article
Breiman (Machine Learning, 26(2), 123–140) showed that bagging could effectively reduce the variance of regression predictors, while leaving the bias relatively unchanged. A new form of bagging we call iterated bagging is effective in reducing both bias and variance. The procedure works in stages—the first stage is bagging. Based on the outcomes of...
Article
this paper, we discuss an example in which we classify objects as quasars or non-quasars using the combined results of a radio survey and an optical survey. Such classi cation helps guide the choice of which objects to follow up with relatively expensive spectroscopic measurements
Article
Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. The generalization error for forests converges a.s. to a limit as the number of trees in the forest becomes large. The generalization error of a forest of...
Article
Introduction In recent research in combining predictors, it has been recognized that the critical thing to success in combining low-bias predictors such as trees and neural nets has been through methods that reduce the variability in the predictor due to training set variability. Assume that the training set consists of N independent draws from the...
Article
There are two cultures in the use of statistical modeling to reach conclusions from data. One assumes that the data are generated by a given stochastic data model. The other uses algorithmic models and treats the data mechanism as unknown. The statistical community has been committed to the almost exclusive use of data models. This commitment has l...
Article
Bagging predictors is a method for generating multiple versions of a predictor and using these to get an aggregated predictor. The aggregation averages over the versions when predicting a numerical outcome and does a plurality vote when predicting a class. The multiple versions are formed by making bootstrap replicates of the learning set and using...
Article
Recent work has shown that combining multiple versions of unstable classifiers such as trees or neural nets results in reduced test set error. To study this, the concepts of bias and variance of a classifier are defined. Unstable classifiers can have universally low bias. Their problem is high variance. Combining multiple versions is a variance red...
Article
The size of many data bases have grown to the point where they cannot fit into the fast memory of even large memory machines, to say nothing of current workstations. If what we want to do is to use these data bases to construct predictions of various characteristics, then since the usual methods require that all data be held in fast memory, various...
Article
The theory behind the success of adaptive reweighting and combining algorithms (arcing) such as Adaboost (Freund & Schapire, 1996a, 1997) and others in reducing generalization error has not been well understood. By formulating prediction as a game where one player makes a selection from instances in the training set and the other a convex linear co...
Article
Many databases have grown to the point where they cannot fit into the fast memory of even large memory machines, to say nothing of current workstations. If what we want to do is to use these data bases to construct predictions of various characteristics, then since the usual methods require that all data be held in fast memory, various work-arounds...
Article
Recent work has shown that combining multiple versions of unstable classifiers such as trees or neural nets results in reduced test set error. One of the more effective is bagging (Breiman [1996a] ) Here, modified training sets are formed by resampling from the original training set, classifiers constructed using these training sets and then combin...
Article
Breiman(1996) showed that bagging could effectively reduce the variance of regression predictors, while leaving the bias unchanged. A new form of bagging we call adaptive bagging is effective in reducing both bias and variance. The procedure works in stages-- the first stage is bagging. Based on the outcomes of the first stage, the output values ar...
Article
Introduction Half&half bagging is a method for producing combinations of classifiers having low generalization error. The basic idea is straightforward and intuitive--suppose k classifiers have been constructed to date. Each classifier was constructed using some weighted subset of the original training set. To construct the next training set, rando...
Article
Recent work has shown that combining multiple versions of unstable classifiers such as trees or neural nets results in reduced test set error. One of the more effective is bagging. Here, modified training sets are formed by resampling from the original training set, classifiers constructed using these training sets and then combined by voting. Y. F...
Article
In World War II, there was a saying, “there are no atheists in foxholes.” The implication was that on the front lines and under pressure, soldiers needed someone to pray to. The implication in my title is that when big, real, tough problems need to be solved, there are no Bayesians. For decades, the pages of various statistical journals have been l...
Article
Recent work has shown that adaptively reweighting the training set, growing a classifier using the new weights, and combining the classifiers constructed to date can significantly decrease generalization error. Procedures of this type were called arcing by Breiman[1996]. The first successful arcing procedure was introduced by Freund and Schapire[19...
Article
We look at the problem of predicting several response variables from the same set of explanatory variables. The question is how to take advantage of correlations between the response variables to improve predictive accuracy compared with the usual procedure of doing individual regressions of each response variable on the common set of predictor var...
Article
In bagging, predictors are constructed using bootstrap samples from the training set and then aggregated to form a bagged predictor. Each bootstrap sample leaves out about 37% of the examples. These left-out examples can be used to form accurate estimates of important quantities. For instance, they can be used to give much improved estimates of nod...
Article
In model selection, usually a "best" predictor is chosen from a collection ${\hat{\mu}(\cdot, s)}$ of predictors where $\hat{\mu}(\cdot, s)$ is the minimum least-squares predictor in a collection $\mathsf{U}_s$ of predictors. Here s is a complexity parameter; that is, the smaller s, the lower dimensional/smoother the models in $\mathsf{U}_s$. ¶ If...
Article
Bagging predictors is a method for generating multiple versions of a predictor and using these to get an aggregated predictor. The aggregation averages over the versions when predicting a numerical outcome and does a plurality vote when predicting a class. The multiple versions are formed by making bootstrap replicates of the learning set and using...
Article
Stacking regressions is a method for forming linear combinations of different predictors to give improved prediction accuracy. The idea is to use cross-validation data and least squares under non negativity constraints to determine the coefficients in the combination. Its effectiveness is demonstrated in stacking regression trees of different sizes...
Article
Stacking regressions is a method for forming linear combinations of different predictors to give improved prediction accuracy. The idea is to use cross-validation data and least squares under non negativity constraints to determine the coefficients in the combination. Its effectiveness is demonstrated in stacking regression trees of different sizes...
Article
Various criteria have been proposed for deciding which split is best at a given node of a binary classification tree. Consider the question: given a goodness-of-split criterion and the class populations of the instances at a node, what distribution of the instances between the two children nodes maximizes the goodness-of-split criterion? The answer...
Article
Classification trees are attractive in that they present a simple and easily understandable structure. But on many data sets their accuracy is far from optimal. Much of this lack of accuracy is due to their instability--small changes in the data can lead to large changes in the resulting tree. This instability is the reason that combining many tree...
Article
A new method, called the nonnegative (nn) garrote, is proposed for doing subset regression. It both shrinks and zeroes coefficients. In tests on real and simulated data, it produces lower prediction error than ordinary subset selection. It is also compared to ridge regression. If the regression equations generated by a procedure do not change drast...
Article
The question of whether to adjust the 1990 census using a capture-recapture model has been hotly argued in statistical journals and courtrooms. Most of the arguments to date concern methodological issues rather than data quality. Following the Post Enumeration Survey, which was designed to provide the basic data for adjustment, the Census Bureau ca...
Article
Archetypal analysis represents each individual in a data set as a mixture of individuals of pure type or archetypes. The archetypes themselves are restricted to being mixtures of the individuals in the data set. Archetypes are selected by minimizing the squared error in representing each individual as a mixture of archetypes. The usefulness of a...
Article
Archetypal analysis represents each individual in a data set as a mixture of individuals of pure type or archetypes. The archetypes themselves are restricted to being mixtures of the individuals in the data set. Archetypes are selected by minimizing the squared error in representing each individual as a mixture of archetypes. The usefulness of arch...
Article
Quantitative competition immunoassays with appropriate combinations of antibodies give consistent dose‐response patterns which may be used to identify and estimate amounts of cross‐reacting compounds. Previously reported methods of analyzing cross‐reaction patterns include multiple regression, principal components analysis and minimum estimates of...
Article
A hinge function y = h ( x ) consists of two hyperplanes continuously joined together at a hinge. In regression (prediction), classification (pattern recognition), and noiseless function approximation, use of sums of hinge functions gives a powerful and efficient alternative to neural networks with computation times several orders of magnitude less...
Article
A method is given for fitting additive models to multivariate regression type data. The emphasis is on diagnostics analogous to the concept of influence and on alternative models. These are illustrated in three case studies. Simulations show excellent predictive accuracy as compared with existing methods.
Article
We present an algorithm for finding the global maximum of a multimodal, multivariate function for which derivatives are available. The algorithm assumes a bound on the second derivatives of the function and uses this to construct an upper envelope. Successive function evaluations lower this envelope until the value of the global maximum is known to...
Article
Four univariate smoothing techniques are compared in an extensive simulation. The methods compared are smoothing splines, kernel smooths, a running linear smoother with an adaptable window size, and regression splines. These four were selected because (a) they have been reported on in the statistical literature, and (b) because they automatically a...
Article
When a regression problem contains many predictor variables, it is rarely wise to try to fit the data by means of a least squares regression on all of the predictor variables. Usually, a regression equation based on a few variables will be more accurate and certainly simpler. There are various methods for picking “good” subsets of variables, and pr...
Article
The Π method for estimating an underlying smooth function of M variables, (x l , …, xm ), using noisy data is based on approximating it by a sum of products of the form Π m φ m (x m ). The problem is then reduced to estimating the univariate functions in the products. A convergent algorithm is described. The method keeps tight control on the degree...
Article
Four related methods are discussed for obtaining robust confidence bounds for extreme upper quantiles of the unknown distribution of a positive random variable. These methods are designed to work when the upper tail of the distribution is neither too heavy nor too light in comparison to the exponential distribution. An extensive simulated study is...
Article
The general objectives of this report are to provide a summary of the state of the art in discriminant analysis and clustering and to identify key research and unsolved problems that need to be addressed in these two areas. It was prepared under the auspices of the Committee on Applied and Theoretical Statistics of the Board on Mathematical Science...
Article
The increasing use of computers in statistics has spawned a new generation of multivariate statistical techniques. Chief among these is a tree-structured approach to classification and regression analysis. The CART, or Classification and Regression Trees, program implements a recursive partitioning procedure based on an iterative search for best bi...
Article
In regression analysis the response variable Y and the predictor variables X1 …, Xp are often replaced by functions θ(Y) and Ø1(X1), …, Øp(Xp). We discuss a procedure for estimating those functions θ and Ø1, …, Øp that minimize e = E{[θ(Y) — Σ Øj(Xj)]}/var[θ(Y)], given only a sample {(yk, xk1, …, xkp), 1 ⩽ k ⩽ N} and making minimal assumptions conc...
Chapter
The automatic classification of objects from catalogues or other sources of data is a common statistical problem in many astronomical surveys. We describe an effective method, Random Forests, in which votes for class membership are polled from a large random ensemble of tree classifiers. This procedure is illustrated by the problem of identifying q...
Article
Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. The generalization error for forests converges a.s. to a limit as the number of trees in the forest becomes large. The error of a forest of tree classifier...
Article
A prediction algorithm is consistent if given a large enough sample of instances from the underlying distribution, it can achieve nearly optimal generalization accuracy. In practice, the training set is finite and does not give an adequate representation of the underlying distribution. Our work is based on a simple method for generating additional...

## Citations

... 1A") were built independently on a proper training set (based on LLNA data). Instead of trying to choose a specific method, these methods were combined by the stacking methodology of Wolpert (1992) and Breiman (1996) to obtain a specific stacking meta-model for each tier. ...
... Aside from the dataset partitioning, the landslide inventory is partitioned probabilistically but without replacement relying on a ratio known as the bag fraction (Youssef et al. 2016 In the BRT modeling, the divisor variable is initially utilized to split the target variable. The data is then divided into two groups in order to nd the optimal divisor variable (Breiman 1998 The physical environment of the ith grid cell of the region is produced using , since the combination of the superimposed layers can reveal the relative landslide susceptibility of the grid cells. The causal variables de ne the location, and S is a bivariate parameter derived from the superimposed layers that equals 0 in stable areas and 1 in unstable areas. ...
... DTs are closely related to interpretability: not only are they often regarded as a particularly interpretable model (Freitas 2014), but also interpretability itself is regarded as the "biggest single advantage of the tree-structured approach" (Breiman and Friedman 1988). ...
... package (v1.3.3) in R (R Core Team 2015). The population connectivity was estimated by the ability of the otolith chemical signature of the otolith core and edge to discriminate among spawning grounds according to Random forest (RF) classi cation (Breiman, 2001). The main advantage of this method is that it makes no assumptions on variable distributions or on linear relations between variables (Mercier et al., 2011; Stekhoven and Bühlmann, 2012) and can accommodate continuous as well as ordinal or categorical variables. ...
Citing article
... olding the others fixed. This was an application of the Gauss–Seidel algorithm of numerical linear algebra. A simpler version, taking θ as the identity, is the familiar " backfitting " algorithm [Hastie and Tibshirani (1986), Buja, Hastie and Tibshirani (1989)]. ACE was the first in a series of papers Breiman wrote on smoothing and additive models. Breiman and Peters (1992) compared four scatterplot smoothers using an extensive simulation. Building on the spline models used in Breiman and Peters (1992), Breiman's Π method [Breiman (1991)], with the colorful acronym " PIMPLE, " fit additive models of products of (univariate) cubic splines. Hinging hyperplanes [Breiman (1993b)] fit an additive function of hy ...
... Beyond forecasting within the limitations I discussed earlier, having individual parameter estimates opens up the possibility of reviving an ancient custom of conjoint practitioners, which was the application of clustering techniques to preference estimates to obtain insights into preference regularities and segments in the sample. A related technique is archetypal analysis (Cutler and Breiman 1994), which identifies archetypes (or pure types) that form the basis for characterizing all other decision makers. From a market management and product development perspective, this is rich knowledge indeed, since the clusters and archetypes may form a strong basis for firm decision making and action. ...
Citing article
... The correlation between individual trees and their robustness determines the generalization error (Selvathi and Emala, 2016). To improve classification accuracy, a random forest model employs a mix of trees, with each tree (algorithm) grouped into the most popular class (Breiman, 1999). Trees in forests grow until they reach an acceptable predictive model. ...
... It is clearly showing both models are able to predict at higher frequency compared to the lower frequency of PM10 concentration level. The variable influence or relative variable importance (RVI) of the decision tree ensembles was based on the decision tree influences [27] and was then proposed at 2001 [18]. The decision tree influences were then implemented in the gbm package. ...
... Over fitting is a potential problem when the number of predictor variables relative to the number of subjects in the study (i.e., sample size) is large (Hosmer and Lemeshow, 1989). For better performance, a small number of predictor variables relative to the sample size should be used in model development (Anonymous, 1989). ...
... The Π model (Breiman 1991) also uses a stepwise procedure for selecting a linear combination of products of univariate spline functions to be included in the metamodel. For all of these regression spline methods, the authors assume that the set of data values {(x i , y i )} to be fit are given. ...