# Jerome H. FriedmanStanford University | SU · Department of Statistics

Jerome H. Friedman

## About

246

Publications

150,426

Reads

**How we measure 'reads'**

A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more

210,282

Citations

Citations since 2017

## Publications

Publications (246)

The estimation of nested functions (i.e. functions of functions) is one of the central reasons for the success and popularity of machine learning. Today, artificial neural networks are the predominant class of algorithms in this area, known as representational learning. Here, we introduce Representational Gradient Boosting (RGB), a nonparametric al...

Many regression and classification procedures fit a parameterized function $f(x;w)$ of predictor variables $x$ to data $\{x_{i},y_{i}\}_1^N$ based on some loss criterion $L(y,f)$. Often, regularization is applied to improve accuracy by placing a constraint $P(w)\leq t$ on the values of the parameters $w$. Although efficient methods exist for findin...

en We propose a new method for supervised learning, the “principal components lasso” (“pcLasso”). It combines the lasso (ℓ1) penalty with a quadratic penalty that shrinks the coefficient vector toward the feature matrix's leading principal components (PCs). pcLasso can be especially powerful if the features are preassigned to groups. In that case,...

Professor Efron has presented us with a thought‐provoking paper on the relationship between prediction, estimation, and attribution in the modern era of data science. While we appreciate many of his arguments, we see more of a continuum between the old and new methodology, and the opportunity for both to improve through their synergy.

Significance
Often machine learning methods are applied and results reported in cases where there is little to no information concerning accuracy of the output. Simply because a computer program returns a result does not ensure its validity. If decisions are to be made based on such results it is important to have some notion of their veracity. Con...

Professor Efron has presented us with a thought-provoking paper on the relationship between prediction, estimation, and attribution in the modern era of data science. While we appreciate many of his arguments, we see more of a continuum between the old and new methodology, and the opportunity for both to improve through their synergy.

Machine learning is proving invaluable across disciplines. However, its success is often limited by the quality and quantity of available data, while its adoption is limited by the level of trust afforded by given models. Human vs. machine performance is commonly compared empirically to decide whether a certain task should be performed by a compute...

The goal of regression analysis is to predict the value of a numeric outcome variable y given a vector of joint values of other (predictor) variables x. Usually a particular x-vector does not specify a repeatable value for y, but rather a probability distribution of possible y--values, p(y|x). This distribution has a location, scale and shape, all...

Often machine learning methods are applied and results reported in cases where there is little to no information concerning accuracy of the output. Simply because a computer program returns a result does not insure its validity. If decisions are to be made based on such results it is important to have some notion of their veracity. Contrast trees r...

Significance
As machine learning applications expand to high-stakes areas such as criminal justice, finance, and medicine, legitimate concerns emerge about high-impact effects of individual mispredictions on people’s lives. As a result, there has been increasing interest in understanding general machine learning models to overcome possible serious...

Machine Learning is proving invaluable across disciplines. However, its successis often limited by the quality and quantity of available data, while its adoption by the level of trust that models afford users. Human vs. machine performance is commonly compared empirically to decide whether a certain task should be performed by a computer or an expe...

We propose a new method for supervised learning, especially suited to wide data where the number of features is much greater than the number of observations. The method combines the lasso ($\ell_1$) sparsity penalty with a quadratic penalty that shrinks the coefficient vector toward the leading principal components of the feature matrix. We call th...

We propose a generalization of the lasso that allows the model coefficients to vary as a function of a general set of some pre-specified modifying variables. These modifiers might be variables such as gender, age or time. The paradigm is quite general, with each lasso coefficient modified by a sparse linear function of the modifying variables $Z$....

The methodology used to construct tree structured rules is the focus of this monograph. Unlike many other statistical procedures, which moved from pencil and paper to calculators, this text's use of trees was unthinkable before computers. Both the practical and theoretical sides have been developed in the authors' study of tree methods. Classificat...

\texttt{rCOSA} is a software package interfaced to the R language. It implements statistical techniques for clustering objects on subsets of attributes in multivariate data. The main output of COSA is a dissimilarity matrix that one can subsequently analyze with a variety of proximity analysis methods. Our package extends the original COSA software...

Two-sample tests for multivariate data and especially for non-Euclidean data are not well explored. This paper presents a novel test statistic based on a similarity graph constructed on the pooled observations from the two samples. It can be applied to multivariate data and non-Euclidean data as long as a dissimilarity measure on the sample space c...

A new data science tool named wavelet-based gradient boosting is proposed and tested. The approach is special case of componentwise linear least squares gradient boosting, and involves wavelet functions of the original predictors. Wavelet-based gradient boosting takes advantages of the approximate \(\ell _1\) penalization induced by gradient boosti...

In this paper we purpose a blockwise descent algorithm for group-penalized
multiresponse regression. Using a quasi-newton framework we extend this to
group-penalized multinomial regression. We give a publicly available
implementation for these in R, and compare the speed of this algorithm to a
competing algorithm --- we show that our implementation...

Variance estimation in the linear model when $p > n$ is a difficult problem.
Standard least squares estimation techniques do not apply. Several variance
estimators have been proposed in the literature, all with accompanying
asymptotic results proving consistency and asymptotic normality under a variety
of assumptions.
It is found, however, that mos...

For high-dimensional supervised learning problems, often using problem-specific assumptions can lead to greater accuracy. For problems with grouped covariates, which are believed to have sparse effects both on a group and within group level, we introduce a regularized model for linear regression with l1 and l2 penalties. We discuss the sparsity and...

Many present day applications of statistical learning involve large numbers of predictor variables. Often, that number is much larger than the number of cases or observations available for training the learning algorithm. In such situations, traditional methods fail. Recently, new techniques have been developed, based on regularization, which can o...

I thank the discussants for their comments. Space limitations do not allow me to comment on all of the interesting points they raised, with most of which I agree. All equation, figure and section numbers refer to the main paper. (c) 2012 International Institute of Forecasters. Published by Elsevier B.V. All rights reserved.

We consider the graphical lasso formulation for estimating a Gaussian graphical model in the high-dimensional setting. This approach entails estimating the inverse covariance matrix under a multivariate normal model by maximizing the ℓ 1 -penalized log-likelihood. We present a very simple necessary and sufficient condition that can be used to ident...

We introduce a pathwise algorithm for the Cox proportional hazards model, regularized by convex combinations of l1 and l2 penalties (elastic net). Our algorithm fits via cyclical coordinate descent, and employs warm starts to find a solution along a regularization path. We demonstrate the efficacy of our algorithm on real and simulated data sets, a...

Leo Breiman was a unique character. There will not be another like him. I
consider it one of my great fortunes in life to have know and worked with him.
Along with John Tukey, Leo had the greatest influence on shaping my approach to
statistical problems. I did some of my best work collaborating with Leo, but
more importantly, we both had great fun...

We consider rules for discarding predictors in lasso regression and related problems, for computational efficiency. El Ghaoui et al (2010) propose "SAFE" rules that guarantee that a coefficient will be zero in the solution, based on the inner products of each predictor with the outcome. In this paper we propose strong rules that are not foolproof b...

We propose several methods for estimating edge-sparse and node-sparse graphical models based on lasso and grouped lasso penalties. We develop efficient algorithms for fitting these models when the num-bers of nodes and potential edges are large. We compare them to competing methods including the graphical lasso and SPACE (Peng, Wang, Zhou & Zhu 200...

We develop fast algorithms for estimation of generalized linear models with convex penalties. The models include linear regression, two-class logistic regression, and multi- nomial regression problems while the penalties include Ã¢ÂÂ_1 (the lasso), Ã¢ÂÂ_2 (ridge regression) and mixtures of the two (the elastic net). The algorithms use cyclical coor...

We consider the group lasso penalty for the linear model. We note that the standard algorithm for solving the problem assumes that the model matrices in each group are orthonormal. Here we consider a more general penalty that blends the lasso (L1) with the group lasso ("two-norm"). This penalty yields solutions that are sparse at both the group and...

We address the problem of sparse selection in linear models. A number of non-convex penalties have been proposed for this purpose, along with a variety of convex-relaxation algorithms for finding good solutions. In this paper we pursue the coordinate-descent approach for optimization, and study its convergence properties. We characterize the proper...

In this chapter we revisit the classification problem and focus on linear methods for classification. Since our predictor
G(x) takes values in a discrete set G, we can always divide the input space into a collection of regions labeled according to the classification.We saw in Chapter
2 that the boundaries of these regions can be rough or smooth, de...

The generalization performance of a learning method relates to its prediction capability on independent test data. Assessment of this performance
is extremely important in practice, since it guides the choice of learning method or model, and gives us a measure of the
quality of the ultimately chosen model.

For most of this book, the fitting (learning) of models has been achieved by minimizing a sum of squares for regression, or
by minimizing cross-entropy for classification. In fact, both of these minimizations are instances of the maximum likelihood
approach to fitting.

We have already made use of models linear in the input features, both for regression and classification. Linear regression,
linear discriminant analysis, logistic regression and separating hyperplanes all rely on a linear model. It is extremely unlikely
that the true function f(X) is actually linear in X. In regression problems, f(X) = E(Y |X) will...

In this chapter we discuss some simple and essentially model-free methods for classification and pattern recognition. Because
they are highly unstructured, they typically are not useful for understanding the nature of the relationship between the features
and class outcome. However, as black box prediction engines, they can be very effective, and a...

In this chapter we discuss prediction problems in which the number of features p is much larger than the number of observations N, often written p ≫ N. Such problems have become of increasing importance, especially in genomics and other areas of computational biology. We will see that high variance and overfitting are a major concern in this settin...

The previous chapters have been concerned with predicting the values of one or more outputs or response variables Y = (Y 1, …, Y m ) for a given set of input or predictor variables X T = (X 1, …, X p ). Denote by \(x_i^T = (x_{i1}, \ldots, x_{ip})\) the inputs for the ith training case, and let y i be a response measurement.

The first three examples described in Chapter 1 have several components in common. For each there is a set of variables that
might be denoted as inputs, which are measured or preset. These have some influence on one or more outputs. For each example the goal is to use the inputs to predict the values of the outputs. This exercise is called supervis...

In this chapter we describe generalizations of linear decision boundaries for classification. Optimal separating hyperplanes are introduced in Chapter 4 for the case when two classes are linearly separable. Here we cover extensions to the nonseparable case, where the classes overlap. These techniques are then generalized to what is known as the sup...

A linear regression model assumes that the regression function E(Y |X) is linear in the inputs X1, ..., Xp. Linear models were largely developed in the precomputer age of statistics, but even in today’s computer era there are still
good reasons to study and use them. They are simple and often provide an adequate and interpretable description of how...

Boosting is one of the most powerful learning ideas introduced in the last twenty years. It was originally designed for classification
problems, but as will be seen in this chapter, it can profitably be extended to regression as well. The motivation for boosting
was a procedure that combines the outputs of many “weak” classifiers to produce a power...

Bagging or bootstrap aggregation (section 8.7) is a technique for reducing the variance of an estimated prediction function. Bagging seems to work especially well for high-variance, low-bias procedures, such as trees. For regression, we simply fit the same regression tree many times to bootstrapsampled versions of the training data, and average the...

In this chapter we describe a class of regression techniques that achieve flexibility in estimating the regression function f(X) over the domain IR p by fitting a different but simple model separately at each query point x 0.

During the past decade there has been an explosion in computation and information technology. With it have come vast amounts of data in a variety of fields such as medicine, biology, finance, and marketing. The challenge of understanding these data has led to the development of new tools in the field of statistics, and spawned new areas such as dat...

This paper considers the problem of intrusion detection in information systems as a classification problem. In particular the case of masquerader is treated. This kind of intrusion is one of the more difficult to discover because it may attack already open user sessions. Moreover, this problem is complex because of the large variability of user mod...

In this chapter we describe a class of learning methods that was developed separately in different fields—statistics and artificial intelligence—based on essentially identical models. The central idea is to extract linear combinations of the inputs as derived features, and then model the target as a nonlinear function of these features. The result...

General regression and classification models are constructed as linear combinations of simple rules derived from the data. Each rule consists of a conjunction of a small number of simple statements concerning the values of individual input variables. These rule ensembles are shown to produce predictive accuracy comparable to the best methods. Howev...

In a statistical world faced with an explosion of data, regularization has become an important ingredient. In a wide variety of problems we have many more input features than observations, and the lasso penalty and its hybrids have become increasingly useful for both feature selection and regularization. This talk presents some effective algorithms...

We consider the problem of estimating sparse graphs by a lasso penalty applied to the inverse covariance matrix. Using a coordinate
descent procedure for the lasso, we develop a simple algorithm—the graphical lasso—that is remarkably fast: It solves a 1000-node problem (∼500000 parameters) in at most a minute and is 30–4000 times faster
than compet...

We consider ``one-at-a-time'' coordinate-wise descent algorithms for a class of convex optimization problems. An algorithm of this kind has been proposed for the $L_1$-penalized regression (lasso) in the literature, but it seems to have been largely ignored. Indeed, it seems that coordinate-wise algorithms are not often used in convex optimization....

A novel clustering approach named Clustering Objects on Subsets of Attributes (COSA) has been proposed (Friedman and Meulman,
(2004). Clustering objects on subsets of attributes. J. R. Statist. Soc. B 66, 1–25.) for unsupervised analysis of complex data sets. We demonstrate its usefulness in medical systems biology studies.
Examples of metabolomics...

Comment on Classifier Technology and the Illusion of Progress [math.ST/0606441]

Machine learning has emerged as a important tool for separating signal events from associated background in high energy particle physics experiments. This paper describes a new machine learning method based on ensembles of rules. Each rule consists of a conjuction of a small number of simple statements ("cuts") concerning the values of individual i...

Prediction involves estimating the unknown value of an attribute of a system under study given the values of other measured attributes. In prediction (machine) learning the prediction rule is derived from data consisting of previously solved cases. Most methods for predictive learning were originated many years ago at the dawn of the computer age....

. A new procedure is proposed for clustering attribute value data. When used in conjunction with conventional distance-based clustering algorithms this procedure encourages those algorithms to detect automatically subgroups of objects that preferentially cluster on subsets of the attribute variables rather than on all of them simultaneously. The re...

INTRODUCTION In the goodness--of--fit testing problem one is given a data set of N measured observations 1 each of which is presumed to be randomly drawn independently from some probability distribution with density p(x). The goal is to test the hypothesis that p(x) = p 0 (x), where p 0 (x) is is some specified reference probability density. Ideall...

Discussions of: "Process consistency for AdaBoost" [Ann. Statist. 32 (2004), no. 1, 13-29] by W. Jiang; "On the Bayes-risk consistency of regularized boosting methods" [ibid., 30-55] by G. Lugosi and N. Vayatis; and "Statistical behavior and consistency of classification methods based on convex risk minimization" [ibid., 56-85] by T. Zhang. Include...

Regularization in linear regression and classildots cation is viewed as a two–stage process. First a set of candidate models is deldots ned by a path through the space of joint parameter values, and then a point on this path is chosen to be the ldots nal model. Various pathldots nding strategies for the ldots rst stage of this process are examined,...

Regularization in linear modeling is viewed as a two-stage process. First a set of can- didate models is de…ned by a path through the space of joint parameter values, and then a point on this path is chosen to be the …nal model. Various path…nding strategies for the …rst stage of this process are examined, based on the notion of generalized gradien...

Proofs subject to correction. Not to be reproduced without permission. Confidential until read to the Society. Contributions to the discussion must not exceed 400 words. Contributions longer than 400 words will be cut by the editor.

Learning a function of many arguments is viewed from the perspective of high-- dimensional numerical quadrature. It is shown that many of the popular ensemble learning procedures can be cast in this framework. In particular randomized methods, including bagging and random forests, are seen to correspond to random Monte Carlo integration methods eac...

In target recogni2BU appli2BU--; of di;U--;A2BU t or classiA2BU--; analysi2 each `feature'i s a result of a convoluti; of aniA--D#R wi-- a filter,whi h may bederi ed from a feature vector. Iti si-- ortant to userelati ely few features. We analyze anoptiO; reduced-rank classiRA under the two-class siqBqqqA2 AssumiD each populati2 i Gaussi and has ze...

In target recognition applications of discriminant of classification analysis, each 'feature' is a result of a convolution of an imagery with a filter, which may be derived from a feature vector. It is important to use relatively few features. We analyze an optimal reduced-rank classifier under the two-class situation. Assuming each population is G...

While Cherkassky and Ma (2003) raise some interesting issues in comparing techniques for model selection, their article appears to be written largely in protest of comparisons made in our book, Elements of Statistical Learning (2001). Cherkassky and Ma feel that we falsely represented the structural risk minimization (SRM) method, which they defend...

Predicting future outcomes based on knowledge obtained from past observational data is a common application in a wide variety of areas of scientific research. In the present paper, prediction will be focused on various grades of cervical preneoplasia and neoplasia. Statistical tools used for prediction should of course possess predictive accuracy,...

Three novel statistical approaches (Cluster Analysis by Regressive Partitioning [CARP], Patient Rule Induction Method [PRIM], and ModeMap) have been used to define compositional populations within a large database (n > 13,000) of Cr-pyrope garnets from the subcontinental lithospheric mantle (SCLM). The variables used are the major oxides and proton...

If there ever was a tool that could stimulate the imagination and profit from the intuition and creativity of John Tukey, it was computer graphics. John always saw graphics a being central to exploratory data analysis: "Since the aim of exploratory data analysis is to learn what seems to be, it should be no surprise that pictures play a vital role...

Multiple additive regression trees (MART) is a methodology for predictive data mining (regression and classification). This note illustrates the use of the R/MART interface. It is intended to be a tutorial introduction. Minimal knowledge concerning the technical details of the MART methodology or the use of the R statistical package is presumed.

Gradient boosting constructs additive regression models by sequentially fitting a simple parameterized function (base learner) to current “pseudo”-residuals by least squares at each iteration. The pseudo-residuals are the gradient of the loss functional being minimized, with respect to the model values at each training data point evaluated at the c...

Function estimation/approximation is viewed from the perspective of numerical optimization iti function space, rather than parameter space. A connection is made between stagewise additive expansions and steepest-descent minimization. A general gradient descent "boosting" paradigm is developed for additive expansions based on any fitting criterion....

The nature of data is rapidly changing. Data sets are becoming increasingly large and complex. Modern methodology for analyzing these new types of data are emerging from the fields of Data Base Managment, Artificial Intelligence, Machine Learning, Pattern Recognition, and Data Visualization. So far Statistics as a field has played a minor role. Thi...