Content uploaded by Thierry Moudiki

Author content

All content in this area was uploaded by Thierry Moudiki on May 15, 2020

Content may be subject to copyright.

AdaOpt: Multivariable optimization for classiﬁcation

1 Introduction

The model presented in this paper is AdaOpt, a probabilistic model for statist-

ical/machine learning classiﬁcation. AdaOpt’s classiﬁcation procedure starts

like AdaBoost classiﬁer’s (the SAMME, see Hastie et al. (2009)). As we’ll see

in the sequel of this paper, AdaOpt is more comparable to a multivariable op-

timizer, combined with a nearest neighbors algorithm.

Nearest neighbors here means that at test time, unseen examples are classi-

ﬁed according to their proximity with known, already classiﬁed observations.

Depending on the context and domain of the study, a closeness-based clas-

siﬁcation could be philosophically debatable. Human intervention may be

necessary and valuable, to understand the classiﬁer’s decisions. Indeed, the

features used for determining the closeness criteria do not (and will never)

capture the entire reality embedded into the phenomenon of interest.

Section 2describes the model in more details, and in section 3, we present

a numerical example of handwritten digits recognition.

2AdaOpt classiﬁer

As mentioned in the introduction, the AdaOpt algorithm starts like AdaBoost’s

(Hastie et al. (2009)), but is actually quite different.

First, in AdaOpt, a weak classiﬁer is adjusted to the training data. It could

be for example a stump, a decision tree with one level, but the accuracy of this

weak classiﬁer on the training set must be at least equal to 50%. Imposing this

at the beginning avoids later convergence issues.

Then, a multivariable variant of gradient descent optimization is applied

to the probabilities – of belonging to one class or the other – computed by the

weak classiﬁer. At each subsequent iteration of the optimizer on the training

set, contrary to AdaBoost, no more weak classiﬁer is adjusted. Rather than that,

AdaOpt determines which observations are misclassiﬁed, decreases the prob-

ability of the wrong decision, and thus, mechanically increases the probability

of other decisions.

From what we’ve just described, we could see that with a large number of

iterations, AdaOpt will converge to a perfect classiﬁer on the training set, with

an error rate of 0%. But a certain number of additional parameters, including

a learning rate and an early stopping criterion, can control the convergence of

AdaOpt to this optimum if desired. After the training procedure, we’ll have at

our disposal a set of probabilities assigned to each decision, for each training

set observation.

To ﬁnish, at test time, as new, unseen observations arrive in the model,

AdaOpt’s decisions will be based on a distance between these new test obser-

vations and training set observations. As in any nearest neighbors procedure,

doing this requires to calculate a quadratic number of distances between test

observations and training set observations. In order to alleviate the potential

computational burden of deriving nearest neighbors on very large datasets, a

few tricks are (currently, as of writing) implemented:

2

AdaOpt: Multivariable optimization for classiﬁcation

•Instead of using the whole training set, calculate the distances between

new test observations and centers obtained from a Minibatch k-means

clustering of the training set.

•Instead of using the whole training set, calculate the distances between

new test observations and a subset of points from the training set,

obtained by stratiﬁed subsampling.

In section 3, we present a numerical example of handwritten digits recog-

nition. AdaOpt is implemented in Moudiki (2020).

3 Numerical example

This numerical example utilizes a dataset of handwritten digits from Alpaydin

and Kaynak (1998), available in UC Irvine’s Machine Learning repository. The

task at hand is to recognize these digits. Based on what’s been seen before by

the AdaOpt learner, we’d like to say if a new handwritten digit is a 0, or a 9,

etc.

Figure 1: Digits.

The dataset is repeatedly (50 times) splitted into a training set (80% of

data), and testing set (20% remaining data), and we use the stratiﬁed sub-

sampling trick described in section 2(AdaOpt’s parameter row sample). Test

set accuracy is measured in (Fig. 2):

Figure 2: Distribution of test set accuracy as a function of subsampling frac-

tion. x-axis: subsampling fraction, AdaOpt’s parameter row sample y-axis:

Test set accuracy (unseen data).

3

AdaOpt: Multivariable optimization for classiﬁcation

The following parameters are used in ﬁgures (Fig. 2) (and (Fig. 3)):

•n iterations=50: number of iterations of the optimizer at training time

•learning rate=0.3: controls the speed of learning at training time

•reg lambda=0.1: Regularization parameter for successive misclassiﬁca-

tion errors

•reg alpha=0.5: Compromise between L1 and L2 regularization para-

meters

•eta=0.01: slope in gradient descent at training time

•gamma=0.01: step size in gradient descent at training time

•tolerance=1e-4: controls early stopping at training time

•k=3: number of nearest neighbors

And for the timings 1for training+prediction in seconds, using the same

experimental design, we have (Fig. 3):

Figure 3: Distribution of timings as a function of subsampling fraction. x-

axis: subsampling fraction, AdaOpt’s parameter row sample y-axis: Train-

ing+prediction timing in seconds.

1Current timings, could be faster in the future.

4

AdaOpt: Multivariable optimization for classiﬁcation

References

Alpaydin, E. and Kaynak, C. (1998). Cascading classiﬁers. Kybernetika,

34(4):369–374.

Hastie, T., Rosset, S., Zhu, J., and Zou, H. (2009). Multi-class adaboost. Stat-

istics and its Interface, 2(3):349–360.

Moudiki, T. (2019–2020). mlsauce, Miscellaneous Statistical/Machine learning

stuff. https://github.com/thierrymoudiki/mlsauce. BSD 3-Clause Clear

License. Version 0.2.4.

5