ArticlePDF Available

Abstract

In this letter, the theory of random matrices of increasing dimension is used to construct a form of Regularized Linear Discriminant Analysis (RLDA) that asymptotically yields the lowest overall risk with respect to the bias of the discriminant in cost-sensitive classification of two multivariate Gaussian distributions. Numerical experiments using both synthetic and real data show that even in finite-sample settings, the proposed classifier can uniformly outperform RLDA in terms of achieving a lower risk as a function of regularization parameter and misclassification costs.
1
Asymptotically Bias-Corrected Regularized Linear
Discriminant Analysis for Cost-Sensitive Binary
Classification
Amin Zollanvari, Member, IEEE, Muratkhan Abdirash, Aresh Dadlani, Member, IEEE, Berdakh
Abibullaev, Member, IEEE
Abstract—In this letter, the theory of random matrices of
increasing dimension is used to construct a form of Regularized
Linear Discriminant Analysis (RLDA) that asymptotically yields
the lowest overall risk with respect to the bias of the discriminant
in cost-sensitive classification of two multivariate Gaussian distri-
butions. Numerical experiments using both synthetic and real data
show that even in finite-sample settings, the proposed classifier can
uniformly outperform RLDA in terms of achieving a lower risk as
a function of regularization parameter and misclassification costs.
Index Terms—Regularized linear discriminant, bias correction,
random matrix theory, cost-sensitive classification
I. INTRODUCTION
Linear discriminant analysis (LDA), originally proposed
by R. A. Fisher [1] and then formalized by Wald [2] and
Anderson [3] in the context of decision theory, has found many
applications in its long history in pattern recognition [4], [5],
[6], [7]. LDA is the Bayes plug-in rule for classification of
two Gaussian distributions with a common covariance matrix
[3]. It is well-known that LDA performs poorly when the
dimensionality of observations (i.e., the number of variables)
is comparable in magnitude to the sample size even if the data
is Gaussian [8], [9].
This state of affairs is generally attributed to the poor
performance of the sample covariance matrix in these settings
[8], [9], [10], [11]. This inspired Di Pillo [10], [12] to employ
the ridge technique, which had been previously proposed by
Hoerl and Kennard in the context of regression [13], [14], [15],
to estimate the covariance matrix used in the expression of the
LDA. Replacing the (pooled) sample covariance matrix by its
regularized counterpart led to a classifier known as Regularized
LDA (RLDA).
The classification performance of RLDA depends heavily,
though, on the regularization parameter γ. One way to de-
termine the optimum γ(γopt) is to estimate the error rate
of RLDA for a predetermined range of γand pick the one
that leads to the least estimated error. Friedman [11] suggested
the use of cross-validation in estimating the error of RLDA
Copyright (c) 2019 IEEE. Personal use of this material is permitted.
However, permission to use this material for any other purposes must be
obtained from the IEEE by sending a request to pubs-permissions@ieee.org.
A. Z., M. A., and A. D. are with the Department of Electrical and Computer
Engineering, Nazarbayev University, Astana, Kazakhstan. B. A. is with the
Department of Robotics and Mechatronics, Nazarbayev University, Astana,
Kazakhstan. The work was supported by the Nazarbayev University Faculty
Development Competitive Research Grant under Award SOE2018008.
and, thereby, in estimating γopt. In [16], we obtained both the
asymptotic equivalent and a generalized consistent estimator
of RLDA error rate. That is to say, we obtained a closed form
and remarkably accurate estimator of its error rate using which
we can efficiently estimate γopt via a univariate search [17].
To obtain the generalized consistent estimator of RLDA
error rate in [16], we used the theory of random matrices
of increasing dimension [18], [19] (generally referred to as
random matrix theory) to construct an estimator that converges
to the asymptotic equivalent of error in a scenario where
dimension pand sample size nincreases unboundedly in a
proportional manner; that is to say, for n ,p , and
p/n J, where Jis an arbitrary positive constant. It has been
shown that using this theory, we can obtain estimators that are
remarkably accurate in a wide range of dimension and sample
size (see [20] for an overview of this theory).
Recently, and independently, Wang and Jiang [21] used the
random matrix theory to obtain an asymptotic equivalent of
RLDA error rate similar to our results in [16]. Nevertheless,
they have used their results to propose a bias-corrected form
of RLDA; that is to say, adding a bias term to the RLDA
discriminant that asymptotically leads to the minimum error
rate. However, the proposed classifier does not take into account
the effect of misclassification costs as is required in the context
of cost-sensitive learning—cost-sensitive classification per se is
an effective solution for improving classification performance
specially when a set of imbalanced training data is available
[22], [23].
In this study, we use our results in [16] to propose the
Asymptotically Bias-Corrected RLDA (hereafter, referred to as
ABC-RLDA) that asymptotically leads to the minimum overall
risk, which is a function of misclassification costs. As will be
seen, many estimators of required parameters have been already
developed in [16] with the main difference here being how they
are used; that is, in [16] these estimators were used to construct
a generalized consistent estimator of RLDA error, but here
they are used to improve the performance of the RLDA itself
in the context of cost-sensitive classification. Nevertheless, we
also propose a generalized consistent estimator of ABC-RLDA
overall risk.
II. CO ST-S ENSITIVE CLA SS IFI CATI ON O F TW O GAUS SI AN
DISTRIBUTIONS
As in many analytic error analysis studies (e.g. see [24], [25],
[26], [27]), we assume separate sampling: n=n0+n1sample
2
points are collected to constitute the sample Snin Rp, where
given n,n0and n1are determined (not random) and where
S0={x1,x2,...,xn0}and S1={xn0+1,xn0+2 ,...,xn}
are randomly selected from populations Π0and Π1, re-
spectively. Πifollows a multivariate Gaussian distribution
N(µi,Σ), for i= 0,1, where Σis nonsingular and common
across classes. In this binary classification problem, a classifier
is a function ψ:Rp {0,1}where ψis given by ψ(x)=0
if xR0and ψ(x)=1if xR1, and R0and R1
are measurable sets partitioning the sample space. Let C(i, j)
denote the cost of assigning label iwhen the true class is j
and, as it is commonly assumed, we consider C(i, i)=0
for i= 0,1; that is to say, no cost associated with correct
classification—for instance see [3], [22], [23], [28]. The Bayes
decision rule minimizes the overall (expected) risk given by
[3], [28], [29]
R=C(1,0)α0ZR1
f(x|xΠ0)dx
+C(0,1)α1ZR0
f(x|xΠ1)dx
=C(1,0)α0ε0+C(0,1)α1ε1,
(1)
where αiis the prior probability for class i,εiis the probability
of misclassifying an observation from class i, and f(x|xΠi)
is the class-conditional density (here assumed to be Gaussian)
governing Πi. As discussed elsewhere [3], [30], [31], in case of
separate sampling, which is common in practice, αicannot be
estimated from the sample and must be estimated external to
the available sample (e.g., using prior knowledge). That leads
to write the overall risk (1) as
R=C10ε0+C01 ε1,(2)
where Cij
=C(i, j)αj, i 6=jthat weights the cost of
misclassification with the prior probabilities of classes. Having
C(i, j) = 1, expression (2) reduces to the classical expression
of the overall true error presented, for example, in [16].
Nevertheless, because the cost itself is generally subjective, we
may assume Cij as a single measure of cost that captures the
overall possible subjectiveness involved in determining C(i, j)
and a specific value of αj.
III. REGULARIZED LINEAR DISCRIMINANT ANALYS IS F OR
COS T-SENSITIVE CLASSIFICATION
The plug-in classifier of the Bayes decision rule that min-
imizes (2) for two multivariate Gaussian populations with
a common covariance matrix is represented by Anderson’s
statistic defined as [3]
WLDA(x) = x¯
x0+¯
x1
2T
ˆ
Σ1(¯
x0¯
x1)log C01
C10
,(3)
where ¯
x0=1
n0PxlS0
xland ¯
x1=1
n1PxlS1
xlare the
sample means for each class and ˆ
Σis the pooled sample
covariance matrix,
ˆ
Σ=(n01) ˆ
Σ0+ (n11) ˆ
Σ1
n0+n12,(4)
where
ˆ
Σi=1
ni1X
xlSi
(xl¯
xi)(xl¯
xi)T.(5)
The degradation in performance of LDA when the sample size
and dimensionality are comparable is mainly attributed to the
ill-condition nature of the pooled sample covariance matrix.
As discussed before, to mitigate this problem, a more stable
estimate of the common covariance matrix (i.e., regularized
estimate) is used instead. This leads to the RLDA discriminant
represented as [10], [11], [16]
WRLDA(x) = γx¯
x0+¯
x1
2T
H(¯
x0¯
x1)log C01
C10
,
(6)
where γ > 0,and H= (Ip+γˆ
Σ)1. The RLDA classifier is
then given by
ψRLDA(x) = (1,if WRLDA(x)0
0,if WRLDA(x)>0.(7)
IV. GENERALIZED CONSISTENT EST IM ATOR O F TH E
OVE RA LL RI SK O F TH E RLDA
The overall risk of ψRLDA given sample Sn, is given by
RRLDA =C10εRLDA
0+C01εRLDA
1,(8)
where
εRLDA
i=Pr((1)iWRLDA(x)0|xΠi, Sn).(9)
Replacing (6) in (9) leads to (cf. equation (11) in [16]):
εRLDA
i= Φ (1)i+1G(µi)
pD(Σ)!,(10)
where Φ(.)denotes the cumulative distribution function of a
standard normal random variable and
G(µi) = µi¯
x0+¯
x1
2T
H(¯
x0¯
x1)1
γlog C01
C10 ,
D(Σ) = (¯
x0¯
x1)THΣH(¯
x0¯
x1).
(11)
Note that εRLDA
i(and consequently RRLDA) is a function of
parameters of class-conditional densities (µiand Σ) and should
be estimated in practice. We consider the following sequence
of Gaussian discrimination problems relative to RLDA:
Ξp={Π0,Π1,µ0,µ1,Σ, n0, n1, W RLDA}, p = 1,2, . . .
(12)
where Πidenotes a population, all parameters depend on p
(which is omitted to ease the notation), and Ξpis restricted by
the following conditions:
A: Πifollows a multivariate Gaussian distribution
N(µi,Σ).
B: n0 , n1 , p , and the following limits
exist: p
n0J0>0,p
n1J1>0, and p
n0+n1J < .
C: All eigenvalues of Σare located in a segment [c1, c2],
where c1>0and c2does not depend on p.
D: |µi|is bounded over all p, that is, there exists Ksuch
that |µi|< K for i= 0,1and p= 1,2,....
3
We previously showed the following theorem, which pro-
poses a generalized consistent estimator of εRLDA
idenoted as
ˆεRLDA
i(c.f. Section III.A in [16]).
Theorem 1: Under conditions A-D:
εRLDA
iˆεRLDA
i
a.s.
0,(13)
where ˆεRLDA
i= Φ (1)i+1 ˆ
Gi
pˆ
D!,(14)
ˆ
Gi=G(¯
xi)+(1)i+1 (n0+n12)ˆ
δ
ni
,(15)
ˆ
D= (1 + γˆ
δ)2D(ˆ
Σ),(16)
ˆ
δ=
p
n0+n12tr[H]
n0+n12
γ1p
n0+n12+tr[H]
n0+n12,(17)
and G(¯
xi)and D(ˆ
Σ)are obtained by replacing µiwith ¯
xi
in G(µi)and replacing Σwith ˆ
Σin D(Σ)defined in (11),
respectively. Furthermore, we have
ˆ
GiG(µi)a.s.
0,ˆ
DD(Σ)a.s.
0,(18)
where a.s.
denotes almost sure convergence.
V. AS YM PT OTI CA LLY BIAS-CORRECTED RLDA:
ABC-RLDA
Suppose we add a bias term to WRLDA(x)defined in (6) as
f
WRLDA(x, ω) = WRLDA (x) + ω . (19)
Let e
RRLDA(ω)be the overall risk of the classifier constructed
by the discriminant f
WRLDA(x). The effect of adding ωto
WRLDA(x, ω)will appear in the overall risk as
e
RRLDA(ω) = C10 eεRLDA
0(ω) + C01 eεRLDA
1(ω),(20)
where eεRLDA
i(ω)is obtained by replacing G(µi)with G(µi)+
ω
γin (10) for i= 0,1. Then we can find ωopt, which is defined as
the ωthat minimizes the overall risk e
RRLDA(ω). However, be-
cause ωopt depends on the functional of the actual distributional
parameters, it should be estimated in practice. The following
theorem, provides a generalized consistent estimator of ωopt
based on conditions A-D. We prove the theorem based on our
previous results developed in [16] (i.e., Theorem 1 above).
Theorem 2: Let
ˆωopt =γ
ˆ
Dln C01
C10
ˆ
G1ˆ
G0
γˆ
G0+ˆ
G1
2,(21)
where ˆ
Giand ˆ
Dare defined in (15) and (16), respectively.
Under conditions A-D, e
RRLDA(ˆωopt), which is the overall risk
of the classifier obtained by discriminant f
WRLDA(x,ˆωopt), is
minimum with respect to ω. That is to say,
e
RRLDA(ˆωopt)min
ωe
RRLDA(ω)a.s.
0.(22)
Proof : The value of ωopt, which minimizes e
RRLDA(ω)in (20),
is obtained by taking its derivative with respect to ωand setting
to zero. This leads to the optimum value of ωopt obtained as
ωopt =γ
D(Σ)ln C01
C10
G(µ1)G(µ0)γG(µ0) + G(µ1)
2.(23)
The value of ωopt depends on the actual class-conditional
parameters through G(µi)and D(Σ)defined in (11). However,
we can replace these parameters by their generalized consistent
estimators presented in Theorem 1 (i.e., ˆ
Giand ˆ
D) to obtain a
generalized consistent estimator of ωopt such that
ˆωopt ωopt
a.s.
0.(24)
The result immediately follows by replacing (24) in (14) and
following similarly to proof of (13).
Assuming C01 =C10,ˆωopt presented in (21) reduces to the
bias term obtained using RMT in [21] to minimize the overall
error rate under the assumption of having α0=α1. We also
note that if we set C(i, j)=1, Proposition 2 in [32] reduces to
expression (23) above for RLDA; however, in [32] the estimator
of the optimal bias term is obtained simply by replacing the
actual distributional parameters µiand Σby their sample
estimators, leading to a “plug-in” estimator of the optimal bias
term. Nevertheless, expression (21) provides an estimator of the
optimal bias term that is consistent in the generalized sense.
The discriminant function for the ABC-RLDA classifier is
WRLDA
ABC (x)
=WRLDA(x) + ˆωopt ,(25)
where WRLDA(x)and ˆωopt are defined in (6) and (21), respec-
tively. Then the corresponding classifier ψRLDA
ABC (x)is obtained
by replacing WRLDA(x)with WRLDA
ABC (x)in (7). From (22),
ψRLDA
ABC (x)has asymptotically the lowest overall risk with
respect to the bias terms. Therefore, γ, ABC-RLDA outper-
forms RLDA in an asymptotic sense (also see Supplementary
Materials Appendix A). Let
RRLDA
ABC
=e
RRLDA(ˆωopt),(26)
εRLDA
i,ABC
=eεRLDA
i(ˆωopt).(27)
Theorem 1 and Theorem 2 lead to the following proposition,
which provides a generalized consistent error estimator of the
overall risk for ψRLDA
ABC (x).
Proposition: Under conditions A-D:
εRLDA
i,ABC ˆεRLDA
i,ABC
a.s.
0,(28)
where ˆεRLDA
i,ABC = Φ (1)i+1(ˆ
Gi+ˆωopt
γ)
pˆ
D!,(29)
ˆ
Gi,ˆ
Dare defined in (15) and (16), respectively, and the
generalized consistent estimator of RRLDA
ABC is obtained by
replacing eεRLDA
i(ω)with ˆεRLDA
i,ABC in (20).
VI. NUMERICAL EX AM PL ES
This section presents numerical experiments using both syn-
thetic and real data to examine and compare the performance
of the ABC-RLDA with the RLDA and its bias-corrected forms
proposed in [21] and [32] (identified by RLDA [21] and RLDA
[32] in figures). More examples are provided in Supplemen-
tary Materials Appendix B. We conduct our comparisons by
taking C01 to C10 as [0.9:0.1; 0.8:0.2; 0.2:0.8; 0.1:0.9]. These
values of costs vary their ratio in a relatively wide range of
C
=C01
C10 {9,4,1/4,1/9}although in practice it is common
to set C01 > C10 assuming class 1 is the minority class [22].
Furthermore, our comparisons are across a wide range of γ
ranging from 0.01 to 1000.
4
A. Synthetic Data
In our Monte Carlo simulations, we use the following
protocol to compare the performance of the four classifiers
using synthetic data:
Synthetic Model Specification: we assume two 200-variate
Gaussian distribution where Σhas 1 on the diagonal and
0.1 off the diagonal, µ1=µ0with µ0= (a, a, . . . , a)
where ais determined according to the Mahalanobis
distance between classes [∆2= (µ0µ1)TΣ1(µ0
µ1)] such that 2= 5.
RLDA ABC−RLDA RLDA [21] RLDA [32]
0.01 0.1 1 10 100 1000
0.04 0.15 0.26 0.37
C = 9
Expected error
average R
0.01 0.1 1 10 100 1000
0.04 0.15 0.26 0.37
C = 4
Expected error
0.01 0.1 1 10 100 1000
0.04 0.15 0.26 0.37
C = 1/4
Expected error
average R
0.01 0.1 1 10 100 1000
0.04 0.15 0.26 0.37
C = 1/9
Expected error
Fig. 1: Average (overall) risk of the four types of RLDA as a function of γfor
different values of the cost ratio Cobtained on synthetic data.
RLDA ABC−RLDA RLDA [21] RLDA [32]
0.01 0.1 1 10 100 1000
0.01 0.08 0.15 0.22
C = 9
Expected error
average R
0.01 0.1 1 10 100 1000
0.01 0.08 0.15 0.22
C = 4
Expected error
0.01 0.1 1 10 100 1000
0.01 0.08 0.15 0.22
C = 1/4
Expected error
average R
0.01 0.1 1 10 100 1000
0.01 0.08 0.15 0.22
C = 1/9
Expected error
Fig. 2: Average (overall) risk of the four types of RLDA as a function of γfor
different values of the cost ratio Cobtained on real data from [33].
Step I: Using a fixed pair Π0=N(µ0,Σ)and Π1=
N(µ1,Σ), generate a training set of size n= 60 where
n0= 3n1.
Step II: Using the training data, for each value of γin the
predetermined range of 0.01 to 1000, construct the RLDA
classifier using (6), the ABC-RLDA classifier using (21)
and (25), and the RLDA [21] and [32] classifiers.
Step III: Find the overall risk of the classifier by generating
500 additional test observations from each class, estimate
the misclassification error rate on test sample (i.e., an
estimate of ε0and ε1in (2)), and find the overall risk
Rfrom (2) using pre-specified values of C01 and C10.
Step IV: Repeat Steps I through III 500 times for each
value of Cand γand record the average R.
Fig. 1 shows the average overall risk of different classifiers
as a function of γfor different values of C. We observe that
ABC-RLDA uniformly outperforms other classifiers across the
specified range of γand C. This is not surprising as the
bias term used in ABC-RLDA asymptotically minimizes the
overall risk for each value of γand, at the same time, the
asymptotic results obtained through the random matrix theory
are remarkably accurate in finite-sample [16], [19], [20].
B. Real Data, Genomics:
As the ABC-RLDA is developed under a Gaussian model, it
is necessary to examine its performance on real data. However,
as we have seen before, expressions developed using random
matrix theory are quite robust to non-Gaussianity of data [17].
Here, we use the gene expression microarray data collected
originally in [33] and preprocessed for binary classification in
[17]. The data contains 181 observations where 61 observations
are labeled as class 1 (toxicants treatment) and 120 observations
are from class 0 (other treatments). The data originally contains
8491 features but similar to [17], we employ a t-test feature
selection to keep the top 200 features with the lowest P-values.
The protocol for performing our experiments using real data is
similar to the synthetic data except for: (1) in Step I the full data
is randomly split into a training and test set with a ratio of 80%
to 20% such that we keep the class-sample ratio of 61
181 in both
training and test sets; and (2) in Step II models are constructed
on the training data obtained in Step I and their error rates,
which is used to determine the overall risk, is estimated on the
test data.
Fig. 2 shows the average risk of different classifiers as a
function of γfor different values of Cin the real data experi-
ments. We observe that the ABC-RLDA generally outperforms
other classifiers for different values of γand C.
VII. CONCLUSION
Our finite-sample numerical experiments show that the pro-
posed ABC-RLDA classifier can uniformly outperform RLDA
in terms of yielding a lower average overall risk in a wide
range of regularization parameter and misclassification costs.
This interesting result is the consequence of two factors: (1) the
ABC-RLDA is systematically designed to minimize the overall
risk of RLDA by adding an asymptotically optimal bias term;
and (2) the remarkable accuracy of asymptotic results obtained
through the random matrix theory in finite-sample estimation
of intended parameters (here, in estimating the optimal bias
term). Future work can focus on: 1) studying the performance
of the ABC-RLDA in multi-class cost-sensitive classification
(e.g., see [34]); 2) studying scenarios where the cost itself could
be variable; and 3) using RMT to calibrate other classifiers for
cost-sensitive classification.
5
REFERENCES
[1] R. Fisher, “The use of multiple measurements in taxonomic problems,”
Ann. Eugen., vol. 7, no. 2, pp. 179–188, 1936.
[2] A. Wald, “On a statistical problem arising in the classification of an
individual into one of two groups, Ann. Math. Statist., vol. 15, pp. 145–
162, 1944.
[3] T. Anderson, “Classification by multivariate analysis, Psychometrika,
vol. 16, no. 1, pp. 31–50, 1951.
[4] D. Swets and J. Weng, “Using discriminant eigenfeatures for image
retrieval, IEEE Trans. Pattern Anal. Mach. Intell., vol. 18, no. 8, pp.
831–836, 1996.
[5] S. van Vuuren and H. Hermansky, “Data-driven design of rasta-like
filters,” in Eurospeech, 1997.
[6] S. Kim, E. R. Dougherty, I. Shmulevich, K. R. Hess, S. R. Hamilton,
J. M. Trent, G. N. Fuller, and W. Zhang, “Identification of combination
gene sets for glioma classification,” Mol. Cancer Ther., vol. 1, no. 13,
pp. 1229–1236, 2002.
[7] F. Lotte, “Signal processing approaches to minimize or suppress cal-
ibration time in oscillatory activity-based brain?computer interfaces,
Proceedings of the IEEE, vol. 103, pp. 871–890, 2015.
[8] J. V. Ness, “On the dominance of nonparametric bayes-rule discriminant
algorithms in high dimensions,” Pattern Recognition, vol. 5, pp. 843–854,
1980.
[9] R. Peck and J. V. Ness, “The use of shrinkage estimators in linear
discriminant analysis,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 4,
pp. 409–424, 1982.
[10] P. J. D. Pillo, “The application of bias to discriminant analysis, Com-
munications in Statistics - Theory and Methods, vol. 5, pp. 843–854,
1976.
[11] J. Friedman, “Regularized discriminant analysis,” J. Amer. Stat. Assoc.,
vol. 84, pp. 165?–175, 1989.
[12] P. J. D. Pillo, “Biased discriminant analysis: Evaluation of the optimum
probability of misclassification,” Communications in Statistics - Theory
and Methods, vol. 8, pp. 1447–1457, 1979.
[13] A. E. Hoerl, “Application of ridge analysis to regression problems,
Chemical Engineering Progress, vol. 58, pp. 54–59, 1962.
[14] A. E. Hoerl and R. W. Kennard, “Ridge regression: Biased estimation
for nonorthogonal problems,” Technometrics, vol. 12, pp. 55–59, 1970.
[15] ——, “Ridge regression: Applications to nonorthogonal problems,” Tech-
nometrics, vol. 12, pp. 69–82, 1970.
[16] A. Zollanvari and E. R. Dougherty, “Generalized consistent error estima-
tor of linear discriminant analysis,” IEEE Trans. Sig. Proc., vol. 63, pp.
2804–2814, 2015.
[17] D. Bakir, A. P. James, and A. Zollanvari, An efficient method to estimate
the optimum regularization parameter in RLDA, Bioinformatics, vol. 32,
pp. 3461–3468, 2016.
[18] E. P. Wigner, “Characteristic vectors of bordered matrices with infinite
dimensions,” Ann. Math., vol. 62, no. 3, pp. 548–564, 1955.
[19] V. L. Girko, Statistical Analysis of Observations of Increasing Dimen-
sion. Dordrecht: Kluwer Academic Publishers, 1995.
[20] R. Couillet and M. Debbah, “Signal processing in large systems: A new
paradigm,” IEEE Signal Process. Mag., vol. 30, pp. 24–39, 2013.
[21] C. Wang and B. Jiang, “On the dimension effect of regularized linear
discriminant analysis,” Electronic Journal of Statistics, vol. 12, pp. 2709–
2742, 2018.
[22] H. He and E. A. Garcia, “Learning from imbalanced data,” IEEE Trans.
Knowledge and Data Engineering, vol. 21, pp. 1263–1284, 2009.
[23] Y. Sun, M. S. Kamel, A. K. C. Wong, and Y. Wang, “Cost-sensitive
boosting for classification of imbalanced data,” Pattern Recognition,
vol. 40, pp. 3358–3378, 2007.
[24] L. Dalton and E. R. Dougherty, “Bayesian minimum mean-square error
estimation for classification error–Part I: Definition and the Bayesian
MMSE error estimator for discrete classification,” IEEE Trans. Sig. Proc.,
vol. 59, no. 1, pp. 115–129, 2011.
[25] ——, “Bayesian minimum mean-square error estimation for classifica-
tion error–Part II: Linear classification of Gaussian models,” IEEE Trans.
Sig. Proc., vol. 59, no. 1, p. 130144, 2011.
[26] S. Raudys and A. Saudargiene, “First-order tree-type dependence be-
tween variables and classification performance,” IEEE Trans. Pattern
Anal. Mach. Intell., vol. 23, no. 2, pp. 233–239, 2001.
[27] G. J. McLachlan, “Estimation of the errors of misclassification on the
criterion of asymptotic mean square error, Technometrics, vol. 16, no. 2,
pp. 255–260, 1974.
[28] P. Domingos, “Metacost: A general method for making classifiers cost-
sensitive, in Proc. Int?l Conf. Knowledge Discovery and Data Mining,
1999, pp. 155–164.
[29] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification (2nd Ed).
Wiley, 2001.
[30] M. S. Esfahani and E. R. Dougherty, “Effect of separate sampling on
classification accuracy, Bioinformatics, vol. 30, pp. 242–250, 2013.
[31] M. Braga-Neto, A. Zollanvari, and E. R. Dougherty, “Cross-validation
under separate sampling: strong bias and how to correct it, Bioinformat-
ics, vol. 30, pp. 3349–3355, 2014.
[32] Q. Mai, H. You, and M. Yuan, “A direct approach to sparse discriminant
analysis in ultra-high dimensions,” Biometrika, vol. 99, pp. 29–42, 2012.
[33] G. Natsoulis, L. El Ghaoui, G. R. Lanckriet, A. M. Tolley, F. Leroy,
S. Dunlea, B. P. Eynon, C. I. Pearson, S. Tugendreich, and K. Jarnagin,
“Classification of a large microarray data set: algorithm comparison and
analysis of drug signatures,” Genome research, vol. 15, no. 5, pp. 724–
736, 2005.
[34] Z. Zhou and X. Liu, “On multi-class cost-sensitive learning,” Computa-
tional Intelligence, vol. 26, no. 3, pp. 232–257, 2010.
1
Asymptotically Bias-Corrected Regularized Linear
Discriminant Analysis for Cost-Sensitive Binary
Classification: Supplementary Materials
Amin Zollanvari, Member, IEEE, Muratkhan Abdirash, Aresh Dadlani, Member, IEEE, Berdakh
Abibullaev, Member, IEEE
APPENDIX A: UNIFORM IMP ROVE ME NT O F ABC-RLDA
W.R.T. RLDA AS A FUNCTION OF γIN A DOUBLE
ASY MP TOT IC SE NS E
For each γ, the overall risk of RLDA classifier is
e
RRLDA(0)
where
e
RRLDA(ω)is defined in equation (20) in the article. At
the same time, we have (equation (22) in the article)
e
RRLDA(ˆωopt)min
ω
e
RRLDA(ω)a.s.
0.
Therefore, e
RRLDA(0)
e
RRLDA(ˆωopt)a.s.
c , (S.1)
where cis a constant greater or equal to 0. Because (S.1) holds
for each γ, it holds γ.
APPENDIX B: ADDITIONAL NUMERICAL EXAMPLES
This appendix presents more numerical experiments using
both synthetic and real data to examine and compare the
performance of the ABC-RLDA with the RLDA and its bias-
corrected forms proposed in Refs. [21] and [32] in the article.
Synthetic Data:
The following results are obtained similarly to the synthetic
data experiments presented in the article except that we use two
50-variate Gaussian distributions. We observe that ABC-RLDA
RLDA ABC−RLDA RLDA [21] RLDA [32]
0.01 0.1 1 10 100 1000
0.04 0.17 0.3 0.43
C = 9
Expected error
average R
0.01 0.1 1 10 100 1000
0.04 0.17 0.3 0.43
C = 4
Expected error
0.01 0.1 1 10 100 1000
0.04 0.17 0.3 0.43
C = 1/4
Expected error
average R
0.01 0.1 1 10 100 1000
0.04 0.17 0.3 0.43
C = 1/9
Expected error
Fig. S.1: Average (overall) risk of the four types of RLDA as a function of γ
for different values of the cost ratio Cobtained on synthetic data.
uniformly outperforms other classifiers across the specified
range of γand C.
Real Data, EEG Recordings:
In this experiment, we examined the performance of clas-
sifiers on a dataset that we collected in a recent investigation
[1]. There we studied the discriminatory effect of spatiospectral
features to capture the most relevant set of neural activities
from electroencephalographic (EEG) recordings that represent
users’ mental intent to “target” and “non-target” stimuli. The
protocol for data collection is described in details in [1]. The
full dataset, which is publicly available at [2], contains multiple
subjects. Here we just present the results obtained on one of
the subjects that contains 884 observations (223 targets vs.
661 non-targets) and 96 features. The features are the (three)
parameters of a sinusoidal model of order 2 estimated over
16 channels (3×2×16 = 96; see [1] for more details).
The protocol for performing this experiment is similar to the
protocol used for the genomic real-data presented in the article
except for setting the size of the training data to n= 120; that
is to say, at each iteration of the Monte Carlo simulations, 120
observations are randomly selected from the full data and the
rest are used as test data. In Fig. S.2, we observe that ABC-
RLDA outperforms other classifiers across the specified range
of γand C.
RLDA ABC−RLDA RLDA [21] RLDA [32]
0.01 0.1 1 10 100 1000
0.05 0.14 0.23 0.32
C = 9
Expected error
average R
0.01 0.1 1 10 100 1000
0.05 0.14 0.23 0.32
C = 4
Expected error
0.01 0.1 1 10 100 1000
0.05 0.14 0.23 0.32
C = 1/4
Expected error
average R
0.01 0.1 1 10 100 1000
0.05 0.14 0.23 0.32
C = 1/9
Expected error
Fig. S.2: Average (overall) risk of the four types of RLDA as a function of γ
for different values of the cost ratio Cobtained on real data collected in [1].
REFERENCES
[1] B. Abibullaev and A. Zollanvari, “Learning discriminative spatiospectral
features of ERPs for accurate brain-computer interfaces,” In Press, IEEE
Journal of Biomedical and Health Informatics, 2019.
[2] ——, “Event-related potentials (P300, EEG) - BCI dataset, 2019.
[Online]. Available: http://dx.doi.org/10.21227/8aae-d579
... For example, the authors of [10] derive an explicit bias correction of the Linear Discriminant Analysis (LDA) classifier discriminant in order to improve classification in the high estimation noise regime. The authors of [11] similarly correct for the bias of this classifier in an explicit form, but in the context of cost-sensitive classification. Additionally, the references [12] and [13] provide explicit bias corrections for certain high-dimensional variants of LDA. ...
... To demonstrate this, a weight vector w is uniformly sampled from all w such that w 2 = 1 using the method in [17]. It is then fed to (9) and the exact expected testing error with varying α is plotted using (11). The quantity α MMSE (w) is then computed from (7) for comparison. ...
... Let us refer to these new classifiers as α-LDA, α-SVM, α-log, α-RLDA, and α-RPLDA for short. For each α-parameterized discriminant, we vary α and compute the expected testing error using (11). These errors are averaged over 100 independently generated training sets. ...
Preprint
Unlike its intercept, a linear classifier's weight vector cannot be tuned by a simple grid search. Hence, this paper proposes weight vector tuning of a generic binary linear classifier through the parameterization of a decomposition of the discriminant by a scalar which controls the trade-off between conflicting informative and noisy terms. By varying this parameter, the original weight vector is modified in a meaningful way. Applying this method to a number of linear classifiers under a variety of data dimensionality and sample size settings reveals that the classification performance loss due to non-optimal native hyperparameters can be compensated for by weight vector tuning. This yields computational savings as the proposed tuning method reduces to tuning a scalar compared to tuning the native hyperparameter, which may involve repeated weight vector generation along with its burden of optimization, dimensionality reduction, etc., depending on the classifier. It is also found that weight vector tuning significantly improves the performance of Linear Discriminant Analysis (LDA) under high estimation noise. Proceeding from this second finding, an asymptotic study of the misclassification probability of the parameterized LDA classifier in the growth regime where the data dimensionality and sample size are comparable is conducted. Using random matrix theory, the misclassification probability is shown to converge to a quantity that is a function of the true statistics of the data. Additionally, an estimator of the misclassification probability is derived. Finally, computationally efficient tuning of the parameter using this estimator is demonstrated on real data.
... Nevertheless, a plug-in estimator of this term has been recently proposed in [18,36] . In a recent study [38] , and by relying on the results obtained in [40] , we proposed the Asymptotically Bias-Corrected Regularized Linear Discriminant Analysis (ABC-RLDA). The bias correction term used in ABC-RLDA is an estimate of the optimal bias term for RLDA classifier that is obtained using the theory of random matrices of increasing dimension in which the observation dimension p and the sample size n tend to infinity p → ∞ , n → ∞ , while keeping their magnitudes comparable p/n → J with J being an arbitrary positive constant. ...
... It is not surprising anymore that in a finite sample where p and n are comparable, estimators obtained from such a double asymptotic framework outperform usual plugin estimators that are optimal in a large sample setting (i.e., when n → ∞ , p fixed) [6,26] . The explicit formulation of the proposed bias-corrected LDA classifier presented in this paper for the first time is a direct consequence of results obtained in [38] for the ABC-RLDA classifier when the regularization hyperparameter tends to infinity; however, owing to the long history and the important role of LDA in pattern recognition, it deserves a separate treatment as presented here. As a result, we explicitly present the asymptotically exact estimator of the LDA optimal intercept in terms of achieving the lowest overall risk in the classification of two multivariate Gaussian distributions with a common covariance matrix and arbitrary misclassification costs. ...
... Under conditions A-C, in [38] , we showed that as sample size and dimensionality grow large at the same rate ( n 0 → ∞ , ...
Article
Linear discriminant analysis (LDA) is perhaps one of the most fundamental statistical pattern recognition techniques. In this work, we explicitly present, for the first time, an asymptotically exact estimator of the LDA optimal intercept in terms of achieving the lowest overall risk in the classification of two multivariate Gaussian distributions with a common covariance matrix and arbitrary misclassification costs. The proposed estimator of the optimal bias term is developed based on the theory of random matrices of increasing dimension in which the observation dimension and the sample size tend to infinity while keeping their magnitudes comparable. The simple form of this estimator provides us with some analytical insights into the working mechanism of the bias correction in LDA. We then complement these analytical insights with numerical experiments. In particular, empirical results using real data show that insofar as the overall risk is concerned, the proposed bias-corrected form of LDA can outperform the conventional LDA classifier in a wide range of misclassification costs. At the same time, the superiority of the proposed form over LDA tends to be more evident as dimensionality or the ratio between class-specific costs increase.
... The LDA is an effective method for reducing redundant information in the sample data as well as for increasing the class separability. It has been widely applied in computer vision [39], image classification [40], signal processing [41], etc. Different from the previous MIG detectors, the proposed LDA-MIG detectors can achieve enhanced discriminative power of geometric measures on HPD manifolds in different scenarios by learning a manifold projection from a training dataset, which results in significant performance improvements. Main contributions of this paper are briefly summarized below. ...
Preprint
Full-text available
This paper deals with the problem of detecting maritime targets embedded in nonhomogeneous sea clutter, where limited number of secondary data is available due to the heterogeneity of sea clutter. A class of linear discriminant analysis (LDA)-based matrix information geometry (MIG) detectors is proposed in the supervised scenario. As customary, Hermitian positive-definite (HPD) matrices are used to model the observational sample data, and the clutter covariance matrix of received dataset is estimated as geometric mean of the secondary HPD matrices. Given a set of training HPD matrices with class labels, that are elements of a higher-dimensional HPD matrix manifold, the LDA manifold projection learns a mapping from the higher-dimensional HPD matrix manifold to a lower-dimensional one subject to maximum discrimination. In the current study, the LDA manifold projection, with the cost function maximizing between-class distance while minimizing within-class distance, is formulated as an optimization problem in the Stiefel manifold. Four robust LDA-MIG detectors corresponding to different geometric measures are proposed. Numerical results based on both simulated radar clutter with interferences and real IPIX radar data show the advantage of the proposed LDA-MIG detectors against their counterparts without using LDA as well as the state-of-art maritime target detection methods in nonhomogeneous sea clutter.
... For example, the authors of [11] derive an explicit bias correction of the Linear Discriminant Analysis (LDA) classifier discriminant in order to improve classification in the high estimation noise regime. The authors of [12] similarly correct for the bias of this classifier in an explicit form, but in the context of cost-sensitive classification. Additionally, the references [13] and [14] provide explicit bias corrections for certain high-dimensional variants of LDA. ...
Article
Full-text available
Unlike its intercept, a linear classifier's weight vector cannot be tuned by a simple grid search. Hence, this paper proposes weight vector tuning of a generic binary linear classifier through the parameterization of a decomposition of the discriminant by a scalar which controls the trade-off between conflicting informative and noisy terms. By varying this parameter, the original weight vector is modified in a meaningful way. Applying this method to a number of linear classifiers under a variety of data dimensionality and sample size settings reveals that the classification performance loss due to non-optimal native hyperparameters can be compensated for by weight vector tuning. This yields computational savings as the proposed tuning method reduces to tuning a scalar compared to tuning the native hyperparameter, which may involve repeated weight vector generation along with its burden of optimization, dimensionality reduction, etc., depending on the classifier. It is also found that weight vector tuning significantly improves the performance of Linear Discriminant Analysis (LDA) under high estimation noise. Proceeding from this second finding, an asymptotic study of the misclassification probability of the parameterized LDA classifier in the growth regime where the data dimensionality and sample size are comparable is conducted. Using random matrix theory, the misclassification probability is shown to converge to a quantity that is a function of the true statistics of the data. Additionally, an estimator of the misclassification probability is derived. Finally, computationally efficient tuning of the parameter using this estimator is demonstrated on real data.
Article
This paper deals with the problem of detecting maritime targets embedded in nonhomogeneous sea clutter, where limited number of secondary data is available due to the heterogeneity of sea clutter. A class of linear discriminant analysis (LDA)-based matrix information geometry (MIG) detectors is proposed in the supervised scenario. As customary, Hermitian positive-definite (HPD) matrices are used to model the observational sample data, and the clutter covariance matrix of received dataset is estimated as geometric mean of the secondary HPD matrices. Given a set of training HPD matrices with class labels, that are elements of a higher-dimensional HPD matrix manifold, the LDA manifold projection learns a mapping from the higher-dimensional HPD matrix manifold to a lower-dimensional one subject to maximum discrimination. In the current study, the LDA manifold projection, with the cost function maximizing between-class distance while minimizing within-class distance, is formulated as an optimization problem in the Stiefel manifold. Four robust LDA-MIG detectors corresponding to different geometric measures are proposed. Numerical results based on both simulated radar clutter with interferences and real IPIX radar data show the advantage of the proposed LDA-MIG detectors against their counterparts without using LDA as well as the state-of-art maritime target detection methods in nonhomogeneous sea clutter.
Code
Full-text available
Offers methods to perform asymptotically bias-corrected regularized linear discriminant analysis (ABC_RLDA) for cost-sensitive binary classification. The bias-correction is an estimate of the bias term added to regularized discriminant analysis (RLDA) that minimizes the overall risk. The default magnitude of misclassification costs are equal and set to 0.5; however , the package also offers the options to set them to some predetermined values or, alternatively, take them as hyperparameters to tune.
Article
Full-text available
Constructing accurate predictive models is at the heart of Brain-Computer Interfaces (BCIs) because these models can ultimately translate brain activities into communication and control commands. The majority of the previous work in BCI use spatial, temporal, or spatiotemporal features of event-related potentials (ERPs). In this study, we examined the discriminatory effect of their spatiospectral features to capture the most relevant set of neural activities from electroencephalographic (EEG) recordings that represent users' mental intent. In this regard, we model ERP waveforms using a sum of sinusoids with unknown amplitudes, frequencies, and phases. The effect of this signal modeling step is to represent high-dimensional ERP waveforms in a substantially lower dimensionality space, which includes their dominant power spectral contents. We found that the most discriminative frequencies for accurate decoding of visual attention modulated ERPs lie in a spectral range less than 6.4 Hz. This was empirically verified by treating dominant frequency contents of ERP waveforms as feature vectors in the state-of-the art machine learning techniques used herein. The constructed predictive models achieved remarkable performance, which for some subjects was as high as 94% as measured by the area under curve. Using these spectral contents, we further studied the discriminatory effect of each channel and proposed an efficient strategy to choose subject-specific subsets of channels that generally led to classifiers with comparable performance.
Article
Full-text available
One of the major limitations of brain–computer interfaces (BCI) is their long calibration time, which limits their use in practice, both by patients and healthy users alike. Such long calibration times are due to the large between-user variability and thus to the need to collect numerous training electroencephalography (EEG) trials for the machine learning algorithms used in BCI design. In this paper, we first survey existing approaches to reduce or suppress calibration time, these approaches being notably based on regularization, user-to-user transfer, semi-supervised learning and a priori physiological information. We then propose new tools to reduce BCI calibration time. In particular, we propose to generate artificial EEG trials from the few EEG trials initially available, in order to augment the training set size. These artificial EEG trials are obtained by relevant combinations and distortions of the original trials available. We propose three different methods to do so. We also propose a new, fast and simple approach to perform user-to-user transfer for BCI. Finally, we study and compare offline different approaches, both old and new ones, on the data of 50 users from three different BCI data sets. This enables us to identify guidelines about how to reduce or suppress calibration time for BCI.
Article
Full-text available
A classifier is epistemologically vacuous without an accurate estimate of its true error rate. In situations where the number of sample points is of the same order of magnitude as the dimension of observations, serious issues arise with respect to the performance of error estimators. In this paper, we place the problem of synthesizing an error rate estimator of a common linear classifier in an asymptotic setting in which the number of sample points is kept comparable in magnitude to the dimension of observations (double asymptotic). We construct a generalized consistent estimator of the true error rate for linear discriminant analysis in the multivariate Gaussian model under the assumption of a common covariance matrix. In other words, the estimator converges to true error rate in the double asymptotic sense. We employ simulations using both synthetic and real data to compare the performance of the new estimator to the classical consistent estimator of the true error (plug-in estimator) as well as other well-known estimators. We observe that the constructed estimator can outperform other estimators of the true error in many situations in terms of bias and root-mean-square (RMS) error.
Article
Full-text available
Motivation: It is commonly assumed in pattern recognition that cross-validation error estimation is 'almost unbiased' as long as the number of folds is not too small. While this is true for random sampling, it is not true with separate sampling, where the populations are independently sampled, which is a common situation in bioinformatics. Results: We demonstrate, via analytical and numerical methods, that classical cross-validation can have strong bias under separate sampling, depending on the difference between the sampling ratios and the true population probabilities. We propose a new separate-sampling cross-validation error estimator, and prove that it satisfies an 'almost unbiased' theorem similar to that of random-sampling cross-validation. We present two case studies with previously published data, which show that the results can change drastically if the correct form of cross-validation is used. Availability and implementation: The source code in C++, along with the Supplementary Materials, is available at: http://gsp.tamu.edu/Publications/supplementary/zollanvari13/.
Article
This paper studies the dimension effect of the linear discriminant analysis (LDA) and the regularized linear discriminant analysis (RLDA) classifiers for large dimensional data where the observation dimension p is of the same order as the sample size n. More specifically, built on properties of the Wishart distribution and recent results in random matrix theory, we derive explicit expressions for the asymptotic misclassification errors of LDA and RLDA respectively, from which we gain insights of how dimension affects the performance of classification and in what sense. Motivated by these results, we propose adjusted classifiers by correcting the bias brought by the dimension effect.
Article
Motivation The biomarker discovery process in high-throughput genomic profiles has presented the statistical learning community with a challenging problem, namely learning when the number of variables is comparable or exceeding the sample size. In these settings, many classical techniques including linear discriminant analysis (LDA) falter. Poor performance of LDA is attributed to the ill-conditioned nature of sample covariance matrix when the dimension and sample size are comparable. To alleviate this problem regularized LDA (RLDA) has been classically proposed in which the sample covariance matrix is replaced by its ridge estimate. However, the performance of RLDA depends heavily on the regularization parameter used in the ridge estimate of sample covariance matrix. Results We propose a range-search technique for efficient estimation of the optimum regularization parameter. Using an extensive set of simulations based on synthetic and gene expression microarray data, we demonstrate the robustness of the proposed technique to Gaussianity, an assumption used in developing the core estimator. We compare the performance of the technique in terms of accuracy and efficiency to classical techniques for estimating the regularization parameter. In terms of accuracy, the results indicate that the proposed method vastly improves on similar techniques that use classical plug-in estimator. In that respect, it is better or comparable to cross-validation based search strategies while, depending on the sample size and dimensionality, being tens to hundreds of times faster to compute. Contact amin.zollanvari{at}nu.edu.kz Supplementary Information Supplementary materials are available at Bioinformatics online.
Book
List of Basic Notations and Assumptions. Introduction to the English Edition. 1: Introduction to General Statistical Analysis. 2: Limit Theorems for the Empirical Generalized Variance. 3: The Canonical Equations C1,...,C3 for the Empirical Covariance Matrix. 4: Limit Theorems for the Eigenvalues of Empirical Covariance Matrices. 5: G2-Estimator for the Stieltjes Transform of the Normalized Spectral Function of Covariance Matrices. 6: Statistical Estimators for Solutions of Systems of Linear Algebraic Equations. References. Index.