PreprintPDF Available

Multinomial logistic regression using quasi-randomized networks

Preprints and early-stage research may not have been peer reviewed yet.


This paper contributes to the development of quasi-randomized networks; neural networks with quasi-randomized hidden layers. It deals in particular with multinomial logistic regression, a supervised learning method that allows to classify statistical/machine learning model observations in multiple categories. The model presented here notably takes advantage of clustering and dropout, to improve its learning capabilities.
Multinomial logistic regression using
quasi-randomized networks
Thierry Moudiki
22nd February 2020
1 Introduction 2
2 Describing the model 2
3 Numerical example 3
This paper contributes to the development of quasi-randomized net-
works; neural networks with quasi-randomized hidden layers. It deals in
particular with multinomial logistic regression, a supervised learning
method that allows to classify statistical/machine learning model obser-
vations in multiple categories. The model presented here notably takes
advantage of clustering and dropout, to improve its learning capabilities.
Multinomial logistic regression using quasi-randomized networks
1 Introduction
Following some ideas from Moudiki et al. (2018), the model introduced here
is a hybrid penalized regression/neural network model, derived from the class
of randomized neural networks. Randomized neural networks were introduced
by Schmidt et al. (1992), and Random Vector Functional Link neural networks
(RVFL), by Pao et al. (1994).
In RVFL networks, in addition to a single layer neural network that explains
the non linear effects of covariates on the response, there is an optional direct
link (linear link) between the explanatory variables and the output variable,
explaining the linear effects. They have been successfully applied to solving
different types of classification and regression problems; see for example Dehuri
and Cho (2010).
Here, the focus is placed on multinomial logistic regression (see Friedman
et al. (2001), Chapter 4); a supervised learning method allowing to classify
model observations in multiple categories. In order to obtain hidden layer’s
nodes of our RVFL, we use a deterministic Sobol sequence (see Niederreiter
(1992)). Sobol sequences have been successfully used in the past for this type
of models by Moudiki et al. (2018), on multivariate time series. Some data
preprocessing methods such as clustering and dropout (Srivastava et al. (2014))
are also considered in the construction of this model.
In section 2, we describe our penalized multinomial logistic regression model,
and section 3presents a numerical example of this model applied to a dataset.
2 Describing the model
Our model is based on ideas from Moudiki et al. (2018). It’s a hybrid penalized
regression/neural network model, with separate constraints on the linear link
and the hidden layer. As in Zhu and Hastie (2004), Bishop (2006), Friedman
et al. (2010), model probabilities for each class are calculated for k0{1, . . . , K},
and xRp, as:
P(G=k0|X=x) = exTβk0
xis a vector containing characteristics of an observation: the initial model
covariates, plus non-linear transformations of these covariates, as in Moudiki
et al. (2018). Non-linear transformations of the covariates are obtained through
an activation function (typically, a Rectified linear units, ReLU, or an hyperbolic
tangent, etc.).
More precisely, all x’s are stored as rows in a matrix X:
X:=[ZΦ(Z)] (2)
Xis the concatenation of matrices Zand Φ(Z)by columns. The first
columns of Xcontain a matrix Z: model’s standardized input data, potentially
enriched with clustering information. Clustering can determine a priori, and be-
fore model learning, homogeneous groups of model observations. Typically this
Multinomial logistic regression using quasi-randomized networks
clustering information on input data, if requested, consists of one-hot encoded
covariates. One for each k-means or Gaussian Mixture clustering cluster. For
each line i{1, . . . , n}of matrix Φ(Z), we have the following terms, with zi
being the ith line of matrix Z,Wterms of a Sobol sequence, and g an activation
function (as mentioned before):
In the construction of Φ(Z),dropout can also be used. The idea of dropout
(Srivastava et al. (2014)) is to randomly remove some nodes in the hidden layer,
in order to prevent the model from being too flexible, and overfit the input
data. Now, going back to (Eq. 1), Kis the total number of classes and every
βkRp,k{1, . . . , K}is a vector of unknown model coefficients:
kare coefficients on the linear link (on Z), and β(d)
k, coefficients on the
hidden layer (on Φ(Z)). These coefficients are determined by optimizing the
model’s penalized log-likelihood l. Using (Eq. 1), an expression of our penalized
log-likelihood lfor nobservations is:
l(X,β) = 1
(YXβ)i,klog K
e(Xβ)i,k!# (4)
Yis a one-hot encoded version of model response y.βRp×Kis a matrix
of coefficients, containing at column k0the coefficients for class k0(that is, βk0).
is a elementwise matrix multiplication, and ||.||2
Fis the Frobenius norm of
a matrix. λ1and λ2are regularization parameters constraining the norm of
model coefficients βand preventing overfitting, as in a Ridge regression (Hoerl
and Kennard (1970)).
The method presented here is available in Python package nnetsauce Moudiki
(2019–2020) (as of writing, in the development version on Github), and current
optimization methods available for optimizing the log-likelihood (Eq. 4) are
Newton Conjugate gradient or L-BFGS-B.
3 Numerical example
This example is based on a dataset from scikit-learn:breast cancer, the
breast cancer wisconsin dataset. This dataset contains 569 observations, 30
covariates, and 2classes. Other examples based on other datasets can be found
on Github.
Multinomial logistic regression using quasi-randomized networks
We start by importing the data:
import nnetsauce as ns
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
breast_cancer = load_breast_cancer()
X =
y =
# split data into training test and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
Next, we fit the model, and obtain accuracy and area under the curve (AUC)
on test set:
# create the model with nnetsauce
fit_obj = ns.Ridge2Classifier(lambda1 = 6.90185578e+04,
lambda2 = 3.17392781e+02,
row_sample = 4.63427734e-01,
dropout = 3.62817383e-01,
type_clust = "gmm")
# fit the model on training set, y_train)
# get the accuracy on test set
print(fit_obj.score(X_test, y_test))
# get area under the curve on test set (auc)
print(fit_obj.score(X_test, y_test, scoring="roc_auc"))
On this example, we obtain an accuracy of 98.24% on the test set, and an
AUC of 0.98. See GitHub repository for other examples, on other datasets.
Multinomial logistic regression using quasi-randomized networks
Bishop CM (2006). Pattern recognition and machine learning. springer.
Dehuri S, Cho SB (2010). “A comprehensive survey on functional link neural
networks and an adaptive PSO–BP learning for CFLNN.” Neural Computing
and Applications,19(2), 187–205.
Friedman J, Hastie T, Tibshirani R (2001). The elements of statistical learning,
volume 1. Springer series in statistics New York.
Friedman J, Hastie T, Tibshirani R (2010). “Regularization paths for generalized
linear models via coordinate descent.” Journal of statistical software,33(1),
Hoerl AE, Kennard RW (1970). “Ridge regression: Biased estimation for
nonorthogonal problems.” Technometrics,12(1), 55–67.
Moudiki T (2019–2020). nnetsauce, A general-purpose tool for Statisti-
cal/Machine Learning.”
BSD 3-Clause Clear License. Version 0.3.3.
Moudiki T, Planchet F, Cousin A (2018). “Multiple Time Series Forecasting
Using Quasi-Randomized Functional Link Neural Networks.” Risks,6(1), 22.
Niederreiter H (1992). Random number generation and quasi-Monte Carlo meth-
ods. SIAM.
Pao YH, Park GH, Sobajic DJ (1994). “Learning and generalization charac-
teristics of the random vector functional-link net.” Neurocomputing,6(2),
Schmidt WF, Kraaijveld MA, Duin RP (1992). “Feedforward neural networks
with random weights.” In Pattern Recognition, 1992. Vol. II. Conference
B: Pattern Recognition Methodology and Systems, Proceedings., 11th IAPR
International Conference on, pp. 1–4. IEEE.
Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014).
“Dropout: a simple way to prevent neural networks from overfitting.” The
journal of machine learning research,15(1), 1929–1958.
Zhu J, Hastie T (2004). “Classification of gene microarrays by penalized logistic
regression.” Biostatistics,5(3), 427–443.
ResearchGate has not been able to resolve any citations for this publication.
Full-text available
We are interested in obtaining forecasts for multiple time series, by taking into account the potential nonlinear relationships between their observations. For this purpose, we use a specific type of regression model on an augmented dataset of lagged time series. Our model is inspired by dynamic regression models (Pankratz 2012), with the response variable’s lags included as predictors, and is known as Random Vector Functional Link (RVFL) neural networks. The RVFL neural networks have been successfully applied in the past, to solving regression and classification problems. The novelty of our approach is to apply an RVFL model to multivariate time series, under two separate regularization constraints on the regression parameters.
Full-text available
Classification of patient samples is an important aspect of cancer diagnosis and treatment. The support vector machine (SVM) has been successfully applied to microarray cancer diagnosis problems. However, one weakness of the SVM is that given a tumor sample, it only predicts a cancer class label but does not provide any estimate of the underlying probability. We propose penalized logistic regression (PLR) as an alternative to the SVM for the microarray cancer diagnosis problem. We show that when using the same set of genes, PLR and the SVM perform similarly in cancer classification, but PLR has the advantage of additionally providing an estimate of the underlying probability. Often a primary goal in microarray cancer diagnosis is to identify the genes responsible for the classification, rather than class prediction. We consider two gene selection methods in this paper, univariate ranking (UR) and recursive feature elimination (RFE). Empirical results indicate that PLR combined with RFE tends to select fewer genes than other methods and also performs well in both cross-validation and test samples. A fast algorithm for solving PLR is also described.
Full-text available
We develop fast algorithms for estimation of generalized linear models with convex penalties. The models include linear regression, two-class logistic regression, and multi- nomial regression problems while the penalties include âÂÂ_1 (the lasso), âÂÂ_2 (ridge regression) and mixtures of the two (the elastic net). The algorithms use cyclical coordinate descent, computed along a regularization path. The methods can handle large problems and can also deal efficiently with sparse features. In comparative timings we find that the new algorithms are considerably faster than competing methods.
Deep neural nets with a large number of parameters are very powerful machine learning systems. However, overfitting is a serious problem in such networks. Large networks are also slow to use, making it difficult to deal with overfitting by combining the predictions of many different large neural nets at test time. Dropout is a technique for addressing this problem. The key idea is to randomly drop units (along with their connections) from the neural network during training. This prevents units from co-adapting too much. During training, dropout samples from an exponential number of different "thinned" networks. At test time, it is easy to approximate the effect of averaging the predictions of all these thinned networks by simply using a single unthinned network that has smaller weights. This significantly reduces overfitting and gives major improvements over other regularization methods. We show that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets. © 2014 Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever and Ruslan Salakhutdinov.
The NSF-CBMS Regional Research Conference on Random Number Generation and Quasi-Monte Carlo Methods was held at the University of Alaska at Fairbanks from August 13–17, 1990. The present lecture notes are an expanded written record of a series of ten talks presented by the author as the principal speaker at that conference. It was the aim of this series of lectures to familiarize a selected group of researchers with important recent developments in the related areas of quasi-Monte Carlo methods and uniform pseudorandom number generation. Accordingly, the exposition concentrates on recent work in these areas and stresses the interplay between uniform pseudorandom numbers and quasi-Monte Carlo methods. To make these lecture notes more accessible to nonspecialists, some background material was added. Quasi-Monte Carlo methods can be succinctly described as deterministic versions of Monte Carlo methods. Determinism enters in two ways, namely, by working with deterministic points rather than random samples and by the availability of deterministic error bounds instead of the probabilistic Monte Carlo error bounds. It could be argued that most practical implementations of Monte Carlo methods are, in fact, quasi-Monte Carlo methods since the purportedly random samples that are used in a Monte Carlo calculation are often generated in the computer by a deterministic algorithm. This is one good reason for a serious study of quasi-Monte Carlo methods, and another reason is provided by the fact that a quasi-Monte Carlo method with judiciously chosen deterministic points usually leads to a faster rate of convergence than a corresponding Monte Carlo method. The connections between quasi-Monte Carlo methods and uniform pseudorandom numbers arise in the theoretical analysis of various methods for the generation of uniform pseudorandom numbers.
In this paper we explore and discuss the learning and generalization characteristics of the random vector version of the Functional-link net and compare these with those attainable with the GDR algorithm. This is done for a well-behaved deterministic function and for real-world data. It seems that ‘overtraining’ occurs for stochastic mappings. Otherwise there is saturation of training.
Functional link neural network (FLNN) is a class of higher order neural networks (HONs) and have gained extensive popularity in recent years. FLNN have been successfully used in many applications such as system identification, channel equalization, short-term electric-load forecasting, and some of the tasks of data mining. The goals of this paper are to: (1) provide readers who are novice to this area with a basis of understanding FLNN and a comprehensive survey, while offering specialists an updated picture of the depth and breadth of the theory and applications; (2) present a new hybrid learning scheme for Chebyshev functional link neural network (CFLNN); and (3) suggest possible remedies and guidelines for practical applications in data mining. We then validate the proposed learning scheme for CFLNN in classification by an extensive simulation study. Comprehensive performance comparisons with a number of existing methods are presented.