# Max Kuhn's scientific contributions

**What is this page?**

This page lists the scientific contributions of an author, who either does not have a ResearchGate profile, or has not yet added these contributions to their profile.

It was automatically created by ResearchGate to create a record of this author's body of work. We create such pages to advance our goal of creating and maintaining the most comprehensive scientific repository possible. In doing so, we process publicly available (personal) data relating to the author as a member of the scientific community.

If you're a ResearchGate member, you can follow this page to keep up with this author's work.

If you are this author, and you don't want us to display this page anymore, please let us know.

It was automatically created by ResearchGate to create a record of this author's body of work. We create such pages to advance our goal of creating and maintaining the most comprehensive scientific repository possible. In doing so, we process publicly available (personal) data relating to the author as a member of the scientific community.

If you're a ResearchGate member, you can follow this page to keep up with this author's work.

If you are this author, and you don't want us to display this page anymore, please let us know.

## Publications (21)

What goes on inside the black-box algorithms that turn big data into something useful? The answer, say Max Kuhn and Kjell Johnson, is statistical – so statisticians should come to the big data party.

Classification trees fall within the family of tree-based models and, similar to regression trees (Chapter 8), consist of nested if-then statements. Classification trees and rules are basic partitioning models and are covered in Sections 14.1 and 14.2, respectively. Ensemble methods combine many trees (or rules) into one model and tend to have much...

Determining which predictors should be included in a model is becoming one of the most critical questions as data are becoming increasingly high-dimensional. The chapter demonstrates the negative effect of extra predictors on a number of models (Section 19.1), as well as discussing typical approaches to supervised feature selection such as wrapper...

Tree-based models consist of one or more nested if-then statements for the predictors that partition the data. Within these partitions, a model is used to predict the outcome. Regression trees and regression model trees are basic partitioning models and are covered in Sections 8.1 and 8.2, respectively. In Section 8.3, we present rule-based models,...

Several of the preceding chapters have focused on technical pitfalls of predictive models, such as over-fitting and class imbalances. Often, true success may depend on aspects of the problem that are not directly related to the model itself. This chapter discusses topics such as: Type III errors (answering the wrong question, Section 20.1), the eff...

When predicting a numeric outcome, some measure of accuracy is typically used to evaluate the model’s effectiveness. However, there are different ways to measure accuracy, each with its own nuance. In Section 5.1 we define common measures for evaluating quantitative performance. We also discuss the concept of variance-bias trade-off (Section 5.2),...

Chapter 6 discussed regression models that were intrinsically linear. In this chapter we present regression models that are inherently nonlinear in nature. When using these models, the exact form of the nonlinearity does not need to be known explicitly or specified prior to model training. These models include neural networks (Section 7.1), multiva...

Data preprocessing techniques generally refer to the addition, deletion, or transformation of the training set data. Preprocessing data is a crucial step prior to modeling since data preparation can make or break a model’s predictive ability. To illustrate general preprocessing techniques, we begin by introducing a cell segmentation data set (Secti...

The data set used in Chapters 6-9 to illustrate the model building process was based on observational data: the samples were selected from a predefined population and the predictors and response were observed. The case study in the chapter is used to explain the model building process for data that emanate from a designed experiment. In a designed...

In chapters 6-8, we developed a number of models to predict compounds’ solubility. In this chapter we compare and contrast the models’ performance and demonstrate how to select the optimal final model.

In this chapter we discuss several models, all of which are akin to linear regression in that each can directly or indirectly be written in the widely know multiple linear regression form. We begin this chapter by describing a chemistry case study data set (Section 6.1) which will be used to illustrate models throughout this chapter as well as for...

Applied Predictive Modeling covers the overall predictive modeling process, beginning with the crucial steps of data preprocessing, data splitting and foundations of model tuning. The text then provides intuitive explanations of numerous common and modern regression and classification techniques, always with an emphasis on illustrating and solving...

Many predictive models have built-in or intrinsic measurements of predictor importance and have been discussed in previous chapters. For example, multivariate adaptive regression splines and many tree-based models monitor the increase in performance that occurs when adding each predictor to the model. Others, such as linear regression or logistic r...

High-performance computing (HPC) environments are used by many technology and research organizations to facilitate large-scale computations. HPC systems typically use a job scheduling software which prioritizes jobs for submissions, manages the computational resources, and initiates submitted jobs to maximize efficiency. To assist the scheduler, da...

Chapter 12 discussed classification models that defined linear classification boundaries. In this chapter we present models that generate nonlinear boundaries. We begin with explaining several generalizations to the linear discriminant analysis framework such as quadratic discriminant analysis, regularized discriminant analysis, and mixture discrim...

In this chapter we discuss models that classify samples using linear classification boundaries. We begin this chapter by describing a grant applications case study data set (Section 12.1) which will be used to illustrate models throughout this chapter as well as for Chapters 13-15. As foundational models, we discuss logistic regression (Section 12....

When modeling discrete classes, the relative frequencies of the classes can have a significant impact on the effectiveness of the model. An imbalance occurs when one or more classes have very low proportions in the training data as compared to the other classes. Imbalance can be present in any data set or application, and hence, the practitioner sh...

To begin Part I of this work, we present a simple example that illustrates the broad concepts of model building. Section 2.1 provides an overview of a fuel economy data set for which the objective is to predict vehicles' fuel economy based on standard vehicle predictors such as engine displacement, number of cylinders, type of transmission, and man...

Chapters 12-14, used a variety of different philosophies and techniques to predict grant-funding success. In this chapter we compare and contrast the models' performance on a specific test set and demonstrate how to select the optimal final model.

Many modern classification and regression models are highly adaptable; they are capable of modeling complex relationships. Each model's adaptability is typically governed by a set of tuning parameters, which can allow each model to pinpoint predictive patterns and structures within the data. However, these tuning parameters can very identify predic...

When predicting a categorical outcome, some measure of classification accuracy is typically used to evaluate the model’s effectiveness. However, there are different ways to measure classification accuracy, depending of the modeler’s primary objectives. Most classification models can produce both a continuous and categorical prediction output. In Se...

## Citations

... The choice of this tool is based on the ability of this language to give several packages that encapsulate classification models and advanced implementations of MLAs, with which each user must be familiar, in order to explore, model and prototype these data according to his own problematic. The package used for our study is "caret" (Classification and Regression Training) (Kuhn and Johnson 2013). This package covers a large fraction of the practice of predictive analysis (classification and regression). ...

... The PLS, SVR, and ANN models are the most commonly used regression models [39]. The PLS model could be traced back to Herman Wold's nonlinear iterative partial least squares algorithm [42]. Comparing the metrics of these three models, the PLS models presented the weakest imitative effect, with the R 2 of the total sets not exceeding 0.5 and the RMSE not being less than 19 days. ...

... C5.0 C5.0 is an improved version of the C4.5 ML algorithm of the DTs family. The C5.0 provides improved computational and memory usage that generates a smaller number of DTs and incorporates boosting and weighing techniques to improve the accuracy of the model [106]. The implemented C5.0 consists of three tuning hyperparameters, including trials, model, and winnow. ...

... K-Neighbors (KNN) used the K-closest samples from the training set to predict a new sample. The K-closest training set samples are determined via the distance metric like Euclidean and Minkowski [28]. ...

... One challenge in performing this optimization in practice is that most a are very distinct semantically from a, so that the learning process is slowed by the vast number of "easy" predictions (e.g., where Pr(O i = O j |A i , A j ) is clearly 0). To make matters worse, most match probabilities are 0, a kind of imbalance that can impair final model performance (Kuhn and Johnson, 2013) We adopt two measures to address these two problems. First, we implement a balanced sampling scheme: in the optimization, we ensure half of all training points have Pr ...

... Only variables with a VIF of ≤ 10 were selected for further analysis (Table A1). We followed the procedure described by Kuhn and Johnson [53] to perform the second step. In short, Spearman's rho correlation matrix was calculated We considered the following bioclimatic variables as potential predictors of forest fire ignition in both countries: mean temperature of the warmest quarter of the year (MTempWrQ), mean temperature of the driest quarter (MTempDQ), precipitation in the warmest quarter (PrecWrQ), and precipitation in the driest quarter (PrecDQ). ...

... statistical software. LDA is generally used to identify or classify unknown groups characterised by quantitative and qualitative parameters (Fisher 1936(Fisher , 1940Sugiyama 2007); it allows minimising the between-class distance and maximising the within-class distance, achieving maximum class discrimination (Hastie et al. 2001;Holden et al. 2011;Rencher and Christensen 2012;Kuhn and Johnson 2013). For the LDA, Wilk's Lambda method was used with the following default values: for the variable entering the model, F ≥ 3.84 was set, and for the variable removed from the model, it was F ≤ 2.71 (Venora et al. 2009). ...

... In terms of complexity, the tuning parameters used in bootstrap forest models require minimal customization and can often be kept at their default values (Oshiro, Perez, & Baranauskas, 2012). However, unlike regression trees, the interpretability of bootstrap forest is low due to the possible hundreds of trained trees used in the model, and like most tree-based models, it is prone to overfitting i.e., 'greedy algorithm' (Kuhn & Johnson, 2013). ...

... Therefore, in RF a large number of regression trees (e.g., n = 500) are built using bootstrap samples of the original data in order to reduce the correlation between the individual trees. This technique is known as "bagging" (bootstrap aggregation) and improves the predictive performance over a single tree by lowering the variance in the prediction (Kuhn and Johnson, 2013). The de-correlation of the trees is further increased by injecting more randomness in the tree growing process by evaluating only a randomly selected subset of the available predictors at each split. ...

... Feature engineering or feature extraction is one of the major problems for machine learning model construction. It is defined as the process of selecting the most dependable, non-redundant, and important features to use in the model creation (Kuhn and Johnson, 2013). For our work, this process was passed with four steps, namely Encoding, Normalization, Imputation, and Selection, as shown in Fig. 2. ...