# Leo Breiman's research while affiliated with University of California, Berkeley and other places

**What is this page?**

This page lists the scientific contributions of an author, who either does not have a ResearchGate profile, or has not yet added these contributions to their profile.

It was automatically created by ResearchGate to create a record of this author's body of work. We create such pages to advance our goal of creating and maintaining the most comprehensive scientific repository possible. In doing so, we process publicly available (personal) data relating to the author as a member of the scientific community.

If you're a ResearchGate member, you can follow this page to keep up with this author's work.

If you are this author, and you don't want us to display this page anymore, please let us know.

It was automatically created by ResearchGate to create a record of this author's body of work. We create such pages to advance our goal of creating and maintaining the most comprehensive scientific repository possible. In doing so, we process publicly available (personal) data relating to the author as a member of the scientific community.

If you're a ResearchGate member, you can follow this page to keep up with this author's work.

If you are this author, and you don't want us to display this page anymore, please let us know.

## Publications (50)

The Π method for estimating an underlying smooth function of M variables, (x l , …, xm), using noisy data is based on approximating it by a sum of products of the form Πm m (x m ). The problem is then reduced to estimating the univariate functions in the products. A convergent algorithm is described. The method keeps tight control on the degrees of...

Tree ensembles are looked at in distribution space, that is, the limit case of "infinite" sample size. It is shown that the simplest kind of trees is complete in D-dimensional $L_2(P)$ space if the number of terminal nodes T is greater than D. For such trees we show that the AdaBoost algorithm gives an ensemble converging to the Bayes risk.

In this paper we propose two ways to deal with the imbalanced data classification problem using random forest. One is based on cost sensitive learning, and the other is based on a sampling technique. Performance metrics such as precision and recall, false positive rate and false negative rate, F-measure and weighted accuracy are computed. Both meth...

Two-eyed algorithms are complex prediction algorithms that give accurate predictions and also give important insights into the structure of the data the algorithm is processing. The main example I discuss is RF/tools, a collection of algorithms for classification, regression and multiple dependent outputs. The last algorithm is a preliminary versio...

Breiman (Machine Learning, 26(2), 123–140) showed that bagging could effectively reduce the variance of regression predictors, while leaving the bias relatively unchanged. A new form of bagging we call iterated bagging is effective in reducing both bias and variance. The procedure works in stages—the first stage is bagging. Based on the outcomes of...

this paper, we discuss an example in which we classify objects as quasars or non-quasars using the combined results of a radio survey and an optical survey. Such classi cation helps guide the choice of which objects to follow up with relatively expensive spectroscopic measurements

Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. The generalization error for forests converges a.s. to a limit as the number of trees in the forest becomes large. The generalization error of a forest of...

Introduction In recent research in combining predictors, it has been recognized that the critical thing to success in combining low-bias predictors such as trees and neural nets has been through methods that reduce the variability in the predictor due to training set variability. Assume that the training set consists of N independent draws from the...

There are two cultures in the use of statistical modeling to reach conclusions from data. One assumes that the data are generated by a given stochastic data model. The other uses algorithmic models and treats the data mechanism as unknown. The statistical community has been committed to the almost exclusive use of data models. This commitment has l...

Bagging predictors is a method for generating multiple versions of a predictor and using these to get an aggregated predictor. The aggregation averages over the versions when predicting a numerical outcome and does a plurality vote when predicting a class. The multiple versions are formed by making bootstrap replicates of the learning set and using...

Recent work has shown that combining multiple versions of unstable classifiers such as trees or neural nets results in reduced test set error. To study this, the concepts of bias and variance of a classifier are defined. Unstable classifiers can have universally low bias. Their problem is high variance. Combining multiple versions is a variance red...

The size of many data bases have grown to the point where they cannot fit into the fast memory of even large memory machines, to say nothing of current workstations. If what we want to do is to use these data bases to construct predictions of various characteristics, then since the usual methods require that all data be held in fast memory, various...

The theory behind the success of adaptive reweighting and combining algorithms (arcing) such as Adaboost (Freund & Schapire, 1996a, 1997) and others in reducing generalization error has not been well understood. By formulating prediction as a game where one player makes a selection from instances in the training set and the other a convex linear co...

Many databases have grown to the point where they cannot fit into the fast memory of even large memory machines, to say nothing of current workstations. If what we want to do is to use these data bases to construct predictions of various characteristics, then since the usual methods require that all data be held in fast memory, various work-arounds...

Recent work has shown that combining multiple versions of unstable classifiers such as trees or neural nets results in reduced test set error. One of the more effective is bagging (Breiman [1996a] ) Here, modified training sets are formed by resampling from the original training set, classifiers constructed using these training sets and then combin...

Breiman(1996) showed that bagging could effectively reduce the variance of regression predictors, while leaving the bias unchanged. A new form of bagging we call adaptive bagging is effective in reducing both bias and variance. The procedure works in stages-- the first stage is bagging. Based on the outcomes of the first stage, the output values ar...

Introduction Half&half bagging is a method for producing combinations of classifiers having low generalization error. The basic idea is straightforward and intuitive--suppose k classifiers have been constructed to date. Each classifier was constructed using some weighted subset of the original training set. To construct the next training set, rando...

Recent work has shown that combining multiple versions of unstable classifiers such as trees or neural nets results in reduced test set error. One of the more effective is bagging. Here, modified training sets are formed by resampling from the original training set, classifiers constructed using these training sets and then combined by voting. Y. F...

In World War II, there was a saying, “there are no atheists
in foxholes.” The implication was that on the front lines and
under pressure, soldiers needed someone to pray to. The implication in
my title is that when big, real, tough problems need to be solved, there
are no Bayesians. For decades, the pages of various statistical journals
have been l...

Recent work has shown that adaptively reweighting the training set, growing a classifier using the new weights, and combining the classifiers constructed to date can significantly decrease generalization error. Procedures of this type were called arcing by Breiman[1996]. The first successful arcing procedure was introduced by Freund and Schapire[19...

We look at the problem of predicting several response variables from the same set of explanatory variables. The question is how to take advantage of correlations between the response variables to improve predictive accuracy compared with the usual procedure of doing individual regressions of each response variable on the common set of predictor var...

In bagging, predictors are constructed using bootstrap samples from the training set and then aggregated to form a bagged predictor. Each bootstrap sample leaves out about 37% of the examples. These left-out examples can be used to form accurate estimates of important quantities. For instance, they can be used to give much improved estimates of nod...

In model selection, usually a "best" predictor is chosen from a collection ${\hat{\mu}(\cdot, s)}$ of predictors where $\hat{\mu}(\cdot, s)$ is the minimum least-squares predictor in a collection $\mathsf{U}_s$ of predictors. Here s is a complexity parameter; that is, the smaller s, the lower dimensional/smoother the models in $\mathsf{U}_s$. ¶ If...

Bagging predictors is a method for generating multiple versions of a predictor and using these to get an aggregated predictor. The aggregation averages over the versions when predicting a numerical outcome and does a plurality vote when predicting a class. The multiple versions are formed by making bootstrap replicates of the learning set and using...

Stacking regressions is a method for forming linear combinations of different predictors to give improved prediction accuracy. The idea is to use cross-validation data and least squares under non negativity constraints to determine the coefficients in the combination. Its effectiveness is demonstrated in stacking regression trees of different sizes...

Stacking regressions is a method for forming linear combinations of different predictors to give improved prediction accuracy. The idea is to use cross-validation data and least squares under non negativity constraints to determine the coefficients in the combination. Its effectiveness is demonstrated in stacking regression trees of different sizes...

Various criteria have been proposed for deciding which split is best at a given node of a binary classification tree. Consider the question: given a goodness-of-split criterion and the class populations of the instances at a node, what distribution of the instances between the two children nodes maximizes the goodness-of-split criterion? The answer...

Classification trees are attractive in that they present a simple and easily understandable structure. But on many data sets their accuracy is far from optimal. Much of this lack of accuracy is due to their instability--small changes in the data can lead to large changes in the resulting tree. This instability is the reason that combining many tree...

A new method, called the nonnegative (nn) garrote, is proposed for doing subset regression. It both shrinks and zeroes coefficients. In tests on real and simulated data, it produces lower prediction error than ordinary subset selection. It is also compared to ridge regression. If the regression equations generated by a procedure do not change drast...

The question of whether to adjust the 1990 census using a capture-recapture model has been hotly argued in statistical journals and courtrooms. Most of the arguments to date concern methodological issues rather than data quality. Following the Post Enumeration Survey, which was designed to provide the basic data for adjustment, the Census Bureau ca...

Archetypal analysis represents each individual in a data set as a mixture of individuals of
pure type or archetypes. The archetypes themselves are restricted to being mixtures of the
individuals in the data set. Archetypes are selected by minimizing the squared error in
representing each individual as a mixture of archetypes. The usefulness of a...

Archetypal analysis represents each individual in a data set as a mixture of individuals of pure type or archetypes. The archetypes themselves are restricted to being mixtures of the individuals in the data set. Archetypes are selected by minimizing the squared error in representing each individual as a mixture of archetypes. The usefulness of arch...

Quantitative competition immunoassays with appropriate combinations of antibodies give consistent dose‐response patterns which may be used to identify and estimate amounts of cross‐reacting compounds. Previously reported methods of analyzing cross‐reaction patterns include multiple regression, principal components analysis and minimum estimates of...

A hinge function y = h ( x ) consists of two
hyperplanes continuously joined together at a hinge. In regression
(prediction), classification (pattern recognition), and noiseless
function approximation, use of sums of hinge functions gives a powerful
and efficient alternative to neural networks with computation times
several orders of magnitude less...

A method is given for fitting additive models to multivariate regression type data. The emphasis is on diagnostics analogous to the concept of influence and on alternative models. These are illustrated in three case studies. Simulations show excellent predictive accuracy as compared with existing methods.

We present an algorithm for finding the global maximum of a multimodal, multivariate function for which derivatives are available. The algorithm assumes a bound on the second derivatives of the function and uses this to construct an upper envelope. Successive function evaluations lower this envelope until the value of the global maximum is known to...

Four univariate smoothing techniques are compared in an extensive simulation. The methods compared are smoothing splines, kernel smooths, a running linear smoother with an adaptable window size, and regression splines. These four were selected because (a) they have been reported on in the statistical literature, and (b) because they automatically a...

When a regression problem contains many predictor variables, it is rarely wise to try to fit the data by means of a least squares regression on all of the predictor variables. Usually, a regression equation based on a few variables will be more accurate and certainly simpler. There are various methods for picking “good” subsets of variables, and pr...

The Π method for estimating an underlying smooth function of M variables, (x l , …, xm ), using noisy data is based on approximating it by a sum of products of the form Π m φ m (x m ). The problem is then reduced to estimating the univariate functions in the products. A convergent algorithm is described. The method keeps tight control on the degree...

Four related methods are discussed for obtaining robust confidence bounds for extreme upper quantiles of the unknown distribution of a positive random variable. These methods are designed to work when the upper tail of the distribution is neither too heavy nor too light in comparison to the exponential distribution. An extensive simulated study is...

The general objectives of this report are to provide a summary of the state of the art in discriminant analysis and clustering and to identify key research and unsolved problems that need to be addressed in these two areas. It was prepared under the auspices of the Committee on Applied and Theoretical Statistics of the Board on Mathematical Science...

The increasing use of computers in statistics has spawned a new generation of multivariate statistical techniques. Chief among these is a tree-structured approach to classification and regression analysis. The CART, or Classification and Regression Trees, program implements a recursive partitioning procedure based on an iterative search for best bi...

In regression analysis the response variable Y and the predictor variables X1 …, Xp are often replaced by functions θ(Y) and Ø1(X1), …, Øp(Xp). We discuss a procedure for estimating those functions θ and Ø1, …, Øp that minimize e = E{[θ(Y) — Σ Øj(Xj)]}/var[θ(Y)], given only a sample {(yk, xk1, …, xkp), 1 ⩽ k ⩽ N} and making minimal assumptions conc...

The automatic classification of objects from catalogues or other sources of data is a common statistical problem in many astronomical
surveys. We describe an effective method, Random Forests, in which votes for class membership are polled from a large random
ensemble of tree classifiers. This procedure is illustrated by the problem of identifying q...

Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. The generalization error for forests converges a.s. to a limit as the number of trees in the forest becomes large. The error of a forest of tree classifier...

A prediction algorithm is consistent if given a large enough sample of instances from the underlying distribution, it can achieve nearly optimal generalization accuracy. In practice, the training set is finite and does not give an adequate representation of the underlying distribution. Our work is based on a simple method for generating additional...

## Citations

... 1A") were built independently on a proper training set (based on LLNA data). Instead of trying to choose a specific method, these methods were combined by the stacking methodology of Wolpert (1992) and Breiman (1996) to obtain a specific stacking meta-model for each tier. ...

... Aside from the dataset partitioning, the landslide inventory is partitioned probabilistically but without replacement relying on a ratio known as the bag fraction (Youssef et al. 2016 In the BRT modeling, the divisor variable is initially utilized to split the target variable. The data is then divided into two groups in order to nd the optimal divisor variable (Breiman 1998 The physical environment of the ith grid cell of the region is produced using , since the combination of the superimposed layers can reveal the relative landslide susceptibility of the grid cells. The causal variables de ne the location, and S is a bivariate parameter derived from the superimposed layers that equals 0 in stable areas and 1 in unstable areas. ...

... DTs are closely related to interpretability: not only are they often regarded as a particularly interpretable model (Freitas 2014), but also interpretability itself is regarded as the "biggest single advantage of the tree-structured approach" (Breiman and Friedman 1988). ...

... package (v1.3.3) in R (R Core Team 2015). The population connectivity was estimated by the ability of the otolith chemical signature of the otolith core and edge to discriminate among spawning grounds according to Random forest (RF) classi cation (Breiman, 2001). The main advantage of this method is that it makes no assumptions on variable distributions or on linear relations between variables (Mercier et al., 2011; Stekhoven and Bühlmann, 2012) and can accommodate continuous as well as ordinal or categorical variables. ...

... olding the others fixed. This was an application of the Gauss–Seidel algorithm of numerical linear algebra. A simpler version, taking θ as the identity, is the familiar " backfitting " algorithm [Hastie and Tibshirani (1986), Buja, Hastie and Tibshirani (1989)]. ACE was the first in a series of papers Breiman wrote on smoothing and additive models. Breiman and Peters (1992) compared four scatterplot smoothers using an extensive simulation. Building on the spline models used in Breiman and Peters (1992), Breiman's Π method [Breiman (1991)], with the colorful acronym " PIMPLE, " fit additive models of products of (univariate) cubic splines. Hinging hyperplanes [Breiman (1993b)] fit an additive function of hy ...

Reference: Remembering Leo Breiman

... Beyond forecasting within the limitations I discussed earlier, having individual parameter estimates opens up the possibility of reviving an ancient custom of conjoint practitioners, which was the application of clustering techniques to preference estimates to obtain insights into preference regularities and segments in the sample. A related technique is archetypal analysis (Cutler and Breiman 1994), which identifies archetypes (or pure types) that form the basis for characterizing all other decision makers. From a market management and product development perspective, this is rich knowledge indeed, since the clusters and archetypes may form a strong basis for firm decision making and action. ...

... The correlation between individual trees and their robustness determines the generalization error (Selvathi and Emala, 2016). To improve classification accuracy, a random forest model employs a mix of trees, with each tree (algorithm) grouped into the most popular class (Breiman, 1999). Trees in forests grow until they reach an acceptable predictive model. ...

... It is clearly showing both models are able to predict at higher frequency compared to the lower frequency of PM10 concentration level. The variable influence or relative variable importance (RVI) of the decision tree ensembles was based on the decision tree influences [27] and was then proposed at 2001 [18]. The decision tree influences were then implemented in the gbm package. ...

... Over fitting is a potential problem when the number of predictor variables relative to the number of subjects in the study (i.e., sample size) is large (Hosmer and Lemeshow, 1989). For better performance, a small number of predictor variables relative to the sample size should be used in model development (Anonymous, 1989). ...

... The Π model (Breiman 1991) also uses a stepwise procedure for selecting a linear combination of products of univariate spline functions to be included in the metamodel. For all of these regression spline methods, the authors assume that the set of data values {(x i , y i )} to be fit are given. ...

Reference: Simulation Metamodels.