Comparison of logistic regression and linear discriminant analysis: a simulation study
ABSTRACT Two of the most widely used statistical methods for analyzing categorical outcome variables are linear discriminant analysis and logistic regression. While both are appropriate for the development of linear classification models, linear discriminant analysis makes more assumptions about the underlying data. Hence, it is assumed that logistic regression is the more flexible and more robust method in case of violations of these assumptions. In this paper we consider the problem of choosing between the two methods, and set some guidelines for proper choice. The comparison between the methods is based on several measures of predictive accuracy. The performance of the methods is studied by simulations. We start with an example where all the assumptions of the linear discriminant analysis are satisfied and observe the impact of changes regarding the sample size, covariance matrix, Mahalanobis distance and direction of distance between group means. Next, we compare the robustness of the methods towards categorisation and nonnormality of explanatory variables in a closely controlled way. We show that the results of LDA and LR are close whenever the normality assumptions are not too badly violated, and set some guidelines for recognizing these situations. We discuss the inappropriateness of LDA in all other cases.

Dataset: ASSESSMENT (2012)
Jaume Aguado, Alistair Campbell, Carlos Ascaso, Purificación Navarro, Lluïsa GarciaEsteve, Juan V Luciano  [Show abstract] [Hide abstract]
ABSTRACT: A sample of 249 skeletons (154 males, 95 females) from the Chiang Mai University Skeletal Collection was studied to investigate the potential of proximal hand phalanges as indicators of sex among individuals from the Chiang Mai province of Thailand. The sample ranged in age from 19 to 93 years. Six measurements were taken on each proximal phalanx: maximum length, mediolateral base width, anteroposterior base height, mediolateral head width, anteroposterior head height and maximum midshaft diameter. The measurements were then subjected to ROC analysis as well as binary logistic regression to assess the relative correct allocation accuracy for each bone, and for different combinations of measurements from each bone. All proximal phalanges from both sides exhibited greater than 87% correct allocation accuracy for at least one logistic regression equation that included only two or three measurements. When the sample was limited to individuals with no missing measurements (n=209) in any of the phalanges, the most accurate equations for each proximal phalanx ranged from 87.6% to 92.3%, with the most accurate equation based on two measurements from the left 1st proximal phalanx, and the next most accurate from three measurements of the left 2nd proximal phalanx. The results suggest that proximal phalanges produce better allocation accuracies than metacarpals among modern individuals from Thailand.Forensic science international 02/2013; · 2.10 Impact Factor  SourceAvailable from: Peter Tamboer[Show abstract] [Hide abstract]
ABSTRACT: Methods for identifying dyslexia in adults vary widely between studies. Researchers have to decide how many tests to use, which tests are considered to be the most reliable, and how to determine cutoff scores. The aim of this study was to develop an objective and powerful method for diagnosing dyslexia. We took various methodological measures, most of which are new compared to previous methods. We used a large sample of Dutch firstyear psychology students, we considered several options for exclusion and inclusion criteria, we collected as many cognitive tests as possible, we used six independent sources of biographical information for a criterion of dyslexia, we compared the predictive power of discriminant analyses and logistic regression analyses, we used both sum scores and item scores as predictor variables, we used selfreport questions as predictor variables, and we retested the reliability of predictions with repeated prediction analyses using an adjusted criterion. We were able to identify 74 dyslexic and 369 nondyslexic students. For 37 students, various predictions were too inconsistent for a final classification. The most reliable predictions were acquired with item scores and selfreport questions. The main conclusion is that it is possible to identify dyslexia with a high reliability, although the exact nature of dyslexia is still unknown. We therefore believe that this study yielded valuable information for future methods of identifying dyslexia in Dutch as well as in other languages, and that this would be beneficial for comparing studies across countries.Annals of Dyslexia 12/2013; · 1.48 Impact Factor
Page 1
Metodološki zvezki, Vol. 1, No. 1, 2004, 143161
Comparison of Logistic Regression and Linear
Discriminant Analysis: A Simulation Study
Maja Pohar1, Mateja Blas2, and Sandra Turk3
Abstract
Two of the most widely used statistical methods for analyzing
categorical outcome variables are linear discriminant analysis and logistic
regression. While both are appropriate for the development of linear
classification models, linear discriminant analysis makes more assumptions
about the underlying data. Hence, it is assumed that logistic regression is
the more flexible and more robust method in case of violations of these
assumptions. In this paper we consider the problem of choosing between the
two methods, and set some guidelines for proper choice. The comparison
between the methods is based on several measures of predictive accuracy.
The performance of the methods is studied by simulations. We start with an
example where all the assumptions of the linear discriminant analysis are
satisfied and observe the impact of changes regarding the sample size,
covariance matrix, Mahalanobis distance and direction of distance between
group means. Next, we compare the robustness of the methods towards
categorisation and nonnormality of explanatory variables in a closely
controlled way. We show that the results of LDA and LR are close
whenever the normality assumptions are not too badly violated, and set
some guidelines for recognizing these situations. We discuss the
inappropriateness of LDA in all other cases.
1 Introduction
Linear discriminant analysis (LDA) and logistic regression (LR) are widely used
multivariate statistical methods for analysis of data with categorical outcome
1 Department of Medical Informatics, University of Ljubljana; maja.pohar@mf.unilj.si
2 Postgraduate student of Statistics, University of Ljubljana; mateja.blas@guest.arnes.si
3 Sandra Turk, Krka d.d., Novo mesto; sandra.turk@krka.biz
Page 2
144
Maja Pohar, Mateja Blas, and Sandra Turk
variables. Both of them are appropriate for the development of linear
classification models, i.e. models associated with linear boundaries between the
groups.
Nevertheless, the two methods differ in their basic idea. While LR makes no
assumptions on the distribution of the explanatory data, LDA has been developed
for normally distributed explanatory variables. It is therefore reasonable to expect
LDA to give better results in the case when the normality assumptions are
fulfilled, but in all other situations LR should be more appropriate. The theoretical
properties of LR and LDA are thoroughly dealt with in the literature, however the
choice of the method is often more related to the field of statistics than to the
actual condition of fulfilled assumptions.
The goal of this paper is not to discourage the current practice but rather to set
some guidelines as to when the choice of either one of the methods is still
appropriate. While LR is much more general and has a number of theoretical
properties, LDA must be the better choice if we know the population is normally
distributed. However, in practice, the assumptions are nearly always violated, and
we have therefore tried to check the performance of both methods with
simulations. This kind of research demands a careful control, so we have decided
to study just a few chosen situations, trying to find a logic in the behaviour and
then to think about the expansion onto more general cases. We have confined
ourselves to compare only the predictive power of the methods.
The article is organized as follows. Section 2 briefly reviews LR and LDA and
explains their graphical representation. Section 3 details the criteria chosen to
compare both methods. Section 4 describes the process of the simulations. The
results obtained are presented and discussed in Section 5, starting with the case
where all the assumptions of LDA are fulfilled and continuing with cases where
normality is violated in sense of categorization and skewness. It is shown how
violation of the assumptions of LDA affects both methods and how robust the
methods are. The paper concludes with some guidelines for the choice between the
models and a discussion.
2 Logistic regression and linear discriminant analysis
The goal of LR is to find the best fitting and most parsimonious model to describe
the relationship between the outcome (dependent or response variable) and a set of
independent (predictor or explanatory) variables. The method is relatively robust,
flexible and easily used, and it lends itself to a meaningful interpretation. In LR,
unlike in the case of LDA, no assumptions are made regarding the distribution of
the explanatory variables.
Contrary to the popular beliefs, both methods can be applied to more than two
categories (Hosmer and Lemeshow, 1989, p. 216). To simplify, we only focus on
Page 3
Comparison of Logistic Regression and Linear…
145
the case of a dichotomous outcome variable (Y). The LR model can be expressed
as
T
β Xi
T
ii
β Xi
e
+
P(Y1X )
1 e
==
(2.1)
where the Yi are independent Bernoulli random variables. The coefficients of this
model are estimated using the maximum likelihood method. LR is discussed
further by Hosmer and Lemeshow (1989).
Linear discriminant analysis can be used to determine which variable
discriminates between two or more classes, and to derive a classification model for
predicting the group membership of new observations (Worth and Cronin, 2003).
For each of the groups, LDA assumes the explanatory variables to be normally
distributed with equal covariance matrices. The simplest LDA has two groups. To
discriminate between them, a linear discriminant function that passes through the
centroids of the two groups can be used. LDA is discussed further by Kachigan
(1991). The standard LDA model assumes that the conditional distribution of Xy
is multivariate normal with mean vector µy and common covariance matrix Σ.
With some algebra we can show that we assign x to group 1 as
()
1
x
1
P(1x)
1e
−
α+β
=
+
(2.2)
where α and β coefficients are
T1
10
T1
1
1010
0
β
,
(µµ )
,
1
2
log ()()
−
−
=−
∑
π
π
α = −+µ +µ
∑
µ −µ
(2.3)
π1 and π0 are prior probabilities of belonging to group 1 and group 0. In practice
the parameters π1, π0, µ1, µ0 and Σ will be unknown, so we replace them by their
sample estimates, i. e.:
nn
ˆˆ
nn
1
ˆˆ
x x ,x
n
=
∑
()()( )()
0
1
10
11i00i
yy
∑
i 1i 0
=
10
TT
i1i1i0i0
y 1
i
=
∑
y
∑
0
i
1
n
x ,
ˆ
xxxxxxxx /n
=
π =π =
µ == µ ==
∑ =−−+−−
(2.4)
(2.2) is equal in form to LR. Hence, the two methods do not differ in functional
form, they only differ in the estimation of coefficients.
Page 4
146
Maja Pohar, Mateja Blas, and Sandra Turk
2.1 Graphical representation: An explanation
When the values of α and β are known, the expression for a set of points with
equal probability of allocation can be derived as
e
0.5
1 e
+
In twodimensional perspective this set of points is a line, while in three
dimensions it is a plane.
Figure 1 shows the scatterplot for two explanatory variables. Each of the two
groups is plotted with a different character. The linear borders presented are
calculated on the basis of the estimates of each method. The ellipses indicate the
distributions assumed by the LDA.
Tx
T
Tx
0x
α+β
α+β
=
⇒
= α +β
(2.5)
2024
2
0
2
4
x1
x2
logistic regression
discriminant analysis
Figure 1: The linear borders between the groups for LR (solid) and LDA (dotted line).
3 Comparison criteria
The simplest and the most frequently used criterion for comparison between the
two methods is classification error (percent of incorrectly classified objects; CE).
However, classification error is a very insensitive and statistically inefficient
measure (Harrell, 1997). The fact is that the classification error is usually nearly
the same in both methods, but, when differences exist, they are often
overestimated (for example, if the threshold for “yes” is 0.50, a prediction of 0.99
rates the same as one of 0.51). The minimum information gained with the
classification error is in the case of categorical explanatory variables. The
boundary lines in figures below differ approximately equally in coefficients, but
the classification errors provide different information. In Figure 2a, one of the
Page 5
Comparison of Logistic Regression and Linear…
147
possible outcomes lies in the area where the lines are different, and therefore the
predictions will differ in all objects with this outcome. On the contrary, the area
between the lines in Figure 2b covers none of the possible outcomes. The
classification error therefore does not reveal any difference.
12345
1
2
3
4
5
x1
x2
12345
1
2
3
4
5
x1
x2
Figure 2a and 2b: Examples of categorised explanatory variables.
Since more information is needed regarding the predictive accuracy of the
methods than just a binary classification rule, Harrell and Lee (1985) proposed
four different measures of comparing predictive accuracy of the two methods.
These measures are indexes A, B, C and Q. They are better and more efficient
criteria for comparisons and they tell us how well the models discriminate between
the groups and/or how good the prediction is. Theoretical insight and experiences
with simulations revealed that some indexes are more and some less appropriate at
different assumptions. In this work, we focus on three measures of predictive
accuracy, the B, C and Q indexes. Because of its intuitive clearness we sometimes
add the classification error (CE) as well.
The C index is purely a measure of discrimination (discrimination refers to the
ability of a model to discriminate or separate values of Y). It is written as follows
nn
jiji01
i 1
Y 0 Y 1
i
=
j 1
=
j
1
2
C[I(P P) I(PP)]/n n
=
=
=>+=
∑∑
(3.1)
where Pk denotes an estimate of P(Yk=1Xk) from (2.1) and I is an indicator
function.
We can see that the value of the C index is independent of the actual group
membership (Y), and as such it is only a measure of discrimination between the
groups, and not a measure of accuracy of prediction. A C index of 1 indicates
perfect discrimination; a C index of 0.5 indicates random prediction.
Page 6
148
Maja Pohar, Mateja Blas, and Sandra Turk
The B and Q indexes can be used to assess the accuracy of the outcome
prediction. The B index measures an average of squared difference between an
estimated and actual value:
n
2
ii
i 1
=
∑
B 1
= −
(PY) /n
−
(3.2)
where Pi is a probability of classification into group i, Yi is the actual group
membership (1 or 0), and n is the sample size of both populations. The values of
the B index are on the interval [0,1], where 1 indicates perfect prediction. In the
case of random prediction in two equally sized groups, the value of the B index is
0.75.
The Q index is similar to the B index and is also a measure of predictive
accuracy:
n
Y 1 Y
−
ii
2ii
i 1
=
∑
Q1 log (P (1 P)
+
) /n
=−
. (3.3)
A score of 1 of the Q index indicates perfect prediction. A Q index of 0 indicates
random predictions, and values less than 0 indicate worse than random predictions.
When predicted probabilities of 0 or 1 exist, the Q index is undefined. The B, C
and Q indexes are discussed further by Harrell and Lee (1985).
While the C index is purely a measure of discrimination, the B and Q indexes
(besides discrimination) also consider accuracy of prediction. Hence, we can
expect these two indexes to be the most sensitive measures in our simulations.
Instead of comparing the indexes directly, we will often focus only on the
proportion of simulations in which LR predicts better than LDA. As we always
perform 50 simulations, this proportion will be statistically significant whenever it
lies outside the interval [0.36, 0.64].
4 Description of the Simulations
4.1
The basic function enables us to draw random samples of size n and m from two
multivariate normal populations with different mean vectors, but equal covariance
matrix Σ. The mean vector of one group is always set at (0,0). The distance to the
other one is measured using Mahalanobis distance, while the direction is set as the
angle (denoted by υ) to the direction of the eigenvector of the covariance matrix.
Each sample is then randomly divided into two parts, a training and a test
sample. The coefficents of LDA and LR are computed using the first sample and
then predictions are made in the second one. The sampling experiment is
replicated 50 times. Each time the indexes for both methods are computed. Finally,
the average value of indexes and the proportion of simulations in which LR
performs better are recorded.
Basic function
Page 7
Comparison of Logistic Regression and Linear…
149
4.2 Categorization
After sampling, the normally distributed variables can be categorised, either only
one or both of them. The minimum and maximum value are computed, then the
whole interval is divided into a certain number of categories of equal size.
4.3 Skewness
As in the case of categorization, we can also decide here to transform only one of
two explanatory variables or both of them. The BoxCox type of transformation
(Box and Cox, 1964) is used to make normal distribution skewed.
4.4 Remarks
To ensure clarity of the graphical representation, we have confined ourselves to a
two dimensional perspective, i.e. two explanatory variables. We have nevertheless
made some simulations in more dimensions, but the trends of the results seemed to
follow the same pattern.
In most of the simulations we have also set an upper limit for the Mahalanobis
distance, in order to prevent LR from failing to converge and LDA from giving
unreliable results.
To simplify, we have fixed the two group sizes as the same. As unequally sized
groups (or unequal a prori probabilities in LR) only shift the border line closer to
the smaller group (the one with the less probable outcome), this only impacts the
constant, while the coefficient estimates remain the same.
All the simulations and computations were performed by using the statistical
software package R.
5 Results
5.1
We start from the situation where both explanatory variables are normally
distributed. We observe the impact of changes connected with the parameters:
sample size, covariance matrix, Mahalanobis distance and direction of distance
between the group means.
The sample size has the most obvious impact on the difference between
methods. LDA assumes normality and the errors it makes in prediction are only
due to the errors in estimation of the mean and variance on the sample. On the
contrary, LR adapts itself to distribution and assumes nothing about it. Therefore,
Comparison of methods when LDA assumptions are satisfied
Page 8
150
Maja Pohar, Mateja Blas, and Sandra Turk
in the case of small samples, the difference between the distribution of the training
sample and that of the test sample can be substantial. But, as the sample size
increases, the sampling distributions become more stable which leads to better
results for the LR. Consequently, the results of the two methods are getting closer
because the populations are normally distributed.
Table 1: Simulation results for the effect of sample size (n).
n
40
60
100
200
1000
B C Q CE
LR LDA
0.7861
0.7925
0.7993
0.7982
0.8011
LR LDA
0.7199
0.7405
0.7590
0.7537
0.7609
LR LDA
0.1089
0.1334
0.1541
0.1514
0.1608
LR LDA
0.1700
0.1647
0.1527
0.1585
0.1543
0.7747
0.7846
0.7939
0.7967
0.8008
0.7190
0.7405
0.7593
0.7536
0.7609
0.0489
0.1029
0.1313
0.1456
0.1595
0.1785
0.1693
0.1591
0.1593
0.1550
The proportion of simulations in which LR performs better
B C
LR better same LR better same
0.18 0.00 0.36 0.18
0.20 0.00 0.36 0.28
0.20 0.00 0.48 0.16
0.24 0.00 0.48 0.08
0.26 0.00
0.62
0.00
N
40
60
100
200
1000
Q CE
LR better
0.14
0.20
0.22
0.24
0.30
same
0.00
0.00
0.00
0.00
0.00
LR better
0.24
0.36
0.26
0.36
0.32
same
0.32
0.28
0.18
0.24
0.18
Parameters:
=
15. 0
5. 01
Σ
, υ=π/4
210123
1
0
1
2
3
x1
x2
210123
2
1
0
1
x1
x2
3210123
3
2
1
0
1
2
3
x1
x2
Figure 3: The impact of sample size of n=50 (left), n=100 (middle) and n=200 (right).
The results from Table 1 confirm the consideration above. As the sample size
increases, the LDA coefficient estimations become more accurate and therefore all
four indexes are improving (bold face is used to highlight the method that
performs better). The LR indexes are increasing even faster, thus approaching
those of LDA. Decreasing difference between the two methods is best presented
with the Q index, which is the most sensitive one. As the differences between
index means are negligible, it is also interesting to look at the proportion of
simulations where LR performs better. It can be seen that the value of rates to
Page 9
Comparison of Logistic Regression and Linear…
151
which we pay special attention, that of B index and of Q index, is constantly
increasing.
In the case of other changes (tables below) the results of the two methods
remain very close, in fact LDA is only a little bit better than LR. The exception
appears in the case of large Mahalanobis distance presented in Table 4. We can see
that for low values of Mahalanobis distance LDA yields better results, but as this
distance increases and it takes values above 2, LR performs better.
Table 2: Simulation results for the effect of correlation between explanatory
variables(σ).
B C
LR better same LR better same
0.20 0.00 0.54 0.12
0.20 0.12 0.00 0.32 0.12
0.50
0.20 0.00 0.44 0.12
0.90 0.20 0.00 0.46 0.18
Parameters: υ= π/4, m=n=50
Π/4
Π/2
σ
B C Q CE
LR LDA
0.7979
0.7967
0.7965
0.7990
LR LDA
0.7533
0.7495
0.7498
0.7567
LR LDA
0.1499
0.1456
0.1456
0.1535
LR LDA
0.1587
0.1587
0.1580
0.1561
0 0.7938
0.7909
0.7925
0.7961
0.7536
0.7490
0.7497
0.7568
0.1340
0.1215
0.1291
0.1403
0.1623
0.1629
0.1601
0.1575
0.20
0.50
0.90
The proportion of simulations in which LR performs better
Σ
Q CE
LR better
0.26
0.18
0.20
0.26
same
0.00
0.00
0.00
0.00
LR better
0.30
0.20
0.34
0.32
same
0.22
0.36
0.22
0.30
0
Table 3: Simulation results for the effect of direction of distance between group
means(υ).
ν
ν
B C Q CE
LR LDA
0.7969
0.7989
0.8029
0.8012
LR LDA
0.7501
0.7547
0.7645
0.7619
LR LDA
0.1475
0.1524
0.1644
0.1613
LR LDA
0.1609
0.1565
0.1480
0.1569
0 0.7928
0.7957
0.7991
0.7966
0.7502
0.7548
0.7642
0.7620
0.1322
0.1392
0.1491
0.1428
0.1629
0.1579
0.1511
0.1579
Π/3
Parameters:
Π/2
The proportion of simulations in which LR performs better
B C
LR better same LR better same
0.18 0.00 0.44 0.14
0.30 0.00 0.44 0.26
0.22 0.00 0.40 0.14
0.22 0.00 0.36 0.30
Q CE
LR better
0.16
0.36
0.28
0.24
same
0.00
0.00
0.00
0.00
LR better
0.28
0.34
0.24
0.32
same
0.36
0.18
0.34
0.30
0
Π/4
Π/3
=
15. 0
5. 01
Σ
, m=n=50
Page 10
152
Maja Pohar, Mateja Blas, and Sandra Turk
Table 4: Simulation results for the effect of Mahalanobis distance (M).
B C Q CE
M LR LDA
0.7697
0.7985
0.8067
0.8315
0.8557
0.8816
LR LDA
0.6767
0.7551
0.7747
0.8374
0.8860
0.9305
LR LDA
0.0554
0.1512
0.1799
0.2650
0.3492
0.4398
LR LDA
0.1871
0.1569
0.1458
0.1224
0.0975
0.0747
0.50
1.00
1.25
2.00
3.00
4.50
0.7687
0.7947
0.8014
0.8305
0.8570
0.8922
0.6769
0.7552
0.7741
0.8372
0.8857
0.9310
0.0525
0.1331
0.1568
0.2612
0.3575
0.4994
0.1889
0.1606
0.1486
0.1241
0.1026
0.0756
The proportion of simulations in which LR performs better
B C
LR better same LR better same
0.46 0.00 0.46 0.22
0.24 0.00 0.42 0.24
0.20 0.00 0.20 0.22
0.56 0.00 0.36 0.28
0.60
0.00 0.38 0.22
0.90
0.00 0.42 0.08
Q CE
M LR better
0.52
0.28
0.16
0.60
0.70
0.90
same
0.00
0.00
0.00
0.00
0.00
0.00
LR better
0.38
0.30
0.22
0.28
0.26
0.26
same
0.26
0.30
0.38
0.30
0.24
0.40
0.50
1.00
1.25
2.00
3.00
4.50
Parameters:
=
15. 0
5. 01
Σ
, υ= π/4, m=n=50
To sum up, we can say that in the case of normality LDA yields better results
than LR. However, for very large sample sizes the results of the two methods
become really close.
5.2 The effect of categorisation
The effect of categorisation is studied under the assumption that the explanatory
variables are in fact normally distributed, but measured only discretely. This
means they only have a limited number of values or categories. When the number
of categories is big enough not to disturb the accuracy of the estimates, the
categorisation will not cause any changes in our results. But when the values are
forced into just a few categories, we can expect more discrepancies.
All the simulations in this section are performed in the following way: First,
the values of the indexes for LR and LDA are calculated for the samples from the
normally distributed population. We start from the situation, where the LDA
performs better as shown in the previous section (in the tables, these results are
denoted with ∞). These samples are then categorised into a certain number of
categories and the indexes are again calculated and compared.
As expected, the effect of the categorisation depends somewhat on the data
structure (the correlation among the variables), but nevertheless, in all the
simulations similar trends can be observed.
Linear discriminant analysis proves to be rather robust. Its prediction power is
not much lower when the values are in 5 or more categories, and it usually
Page 11
Comparison of Logistic Regression and Linear…
153
performs better than LR. The story changes when the number of categories is low,
and LR is the only appropriate choice in the binary case.
The effect of categorisation also depends on the significance of the effect of a
certain explanatory variable on the outcome. This is understandable – a
nonsignificant variable will not change the model if transformed. On the other
hand, if two covariates, equally powerful when predicting the result, are
categorised, each of them will have a similar impact on the result.
4202
3
2
1
0
1
2
3
x1
x2
2 101234
2
1
0
1
2
3
4
x1
x2
32101234
3
2
1
0
1
2
x1
x2
Figure 4a, 4b and 4c: The basic situations used in the study. The ellipses describe the
distributions within the groups.
We have studied the impact of categorisation in two extreme and one
intermediate case. Figures below present the situations that were the basis of our
simulations. Figure 4a presents two uncorrelated explanatory variables with a
similar impact on the outcome. In Figure 4b only one of the variables is
significant, while in Figure 4c the covariates are correlated and both have a
significant but different impact on the outcome variable.
Table 5a summarizes the results of the situation shown in Figure 4c. The upper
part of this table contains the Q indexes for the case in which both covariates are
categorised. It can be seen that the categorisation into only two categories
severally lowers the predictive power of the two variables (the Q index falls close
to zero) and that this effect is greater with LDA. For better clarity, the lower part
of this table concentrates only on the proportion of the simulations in which the
LR performs better (with regard to index Q) and compares these results with the
categorisation of only one variable at a time. It is obvious that LR always
outperforms LDA in the binary case. As discussed above, this effect is greater
when we categorise the more significant variable (x2) and even more so when we
categorise both explanatory variables.
The results summed up in Table 5b are similar. The effect of both x1 and x2 is
similar and therefore the trends are even more comparable. However, logistic
regression is not truly better even in the two category case. That is probably due to
the too big “head start” of LDA. When categorising both covariates the advantages
of LR are again more obvious.
Page 12
154
Maja Pohar, Mateja Blas, and Sandra Turk
Table 5a: Simulation results for different number of categories (Figure 4c).
Q
Num. of categ. LR LDA LR better
0.88
0.78
0.58
2
3
4
0.0712
0.0891
0.1084
0.0579
0.0839
0.1076
5 0.1267 0.1281 0.46
10 0.1467 0.1505 0.18
∞ 0.1553 0.1595 0.20
The proportion of simulations in which LR performs
better (Q index)
Num. of categ. x1
0.58
0.50
0.36
x2
0.70
0.44
0.36
Both
0.88
0.78
0.58
2
3
4
5 0.30 0.26 0.46
∞
2
3
4
0.20 0.20 0.20
Parameters:
=
15. 0
5. 01
Σ
,υ=0, m=n=200
0.28
0.26
Table 5b: Simulation results for different number of categories (Figure 4a).
The proportion of simulations in which LR performs
better (Q index)
Num. of categ. x1
0.48
x2
0.40
0.24
0.24
Both
0.74
0.46
0.32
5 0.24 0.26 0.24
∞ 0.26 0.26 0.26
Parameters:
=
10
01
Σ
,υ=π/4, m=n=200
The proportion of simulations in which LR performs
better (Q index)
Table 5c clearly shows the absence of any effect on the result when we
categorise an insignificant variable (x1). The results in the second and the third
column are practically the same, because categorising only x2 variable is the same
as categorising both.
Table 5c: Simulation results for different number of categories (Figure 4b).
Num. of categ. x1
0.20
0.18
0.22
x2
0.78
0.48
0.34
Both
0.76
0.48
0.34
2
3
4
5 0.20 0.30 0.30
∞ 0.26 0.20 0.20
Parameters:
=
10
01
Σ
,υ=0, m=n=200
Page 13
Comparison of Logistic Regression and Linear…
155
If the study of the categorisation effect is done by taking smaller samples, the
advantages of LDA are greater (see the previous section). Therefore they do not
tail off even in the case of a small number of categories. Table 5d presents the
results of an identical situation as in the lower part of Table 5a, but the sample
size is shrunk to 100 units.
Table 5d: Simulation results for different number of categories (Figure 4c).
The proportion of simulations in which LR performs
better (Q index)
Num. of categ. x1
0.42
0.34
0.24
x2
0.24
0.32
0.22
Both
2
3
4
0.54
0.42
0.26
5 0.24 0.30 0.26
∞ 0.22 0.22 0.20
Parameters:
=
15 . 0
5 . 01
Σ
,υ=0, m=n=100
The results in this table tend to vary a bit. Too small a sample size, and at the
same time a small number of outcomes, causes the results to be unreliable. This is
even more obvious when the Mahalanobis distance is increased, because LR often
has problems with convergence.
5.3 The effect of nonnormality
In the case of categorical explanatory variables above, the assumption of normality
has been preserved and only the consequences of discrete measurement have been
studied. Now, we are interested in the robustness of LDA when the normality
assumptions are not met and in how much better can LR be in these cases. As non
normality is a very broad term, we have confined ourselves to transforming normal
distributions with a BoxCox transformation and thus making them skewed.
Again we begin with the three situations shown in Figure 4 and transform them
into what is shown in Figure 5.
0510
0
2
4
6
8
10
12
x1
x2
0510
0
2
4
6
8
10
x1
x2
02468
0
2
4
6
x1
x2
Figure 5a, 5b and 5c: Examples of right skewed distributions (to make groups more
discernible, a part of the convex hull has been drawn for each of them).
Page 14
156
Maja Pohar, Mateja Blas, and Sandra Turk
Table 6a: Simulation results for different degree of skewness (Figures 4c, 5c).
Q
CS* LR LDA LR better
0.88
0.78
0.60
0.28
0.44
0.60
0.88
0.96
0.5
0.4
0.2
0.1
0.1
0.2
0.4
0.3149
0.2685
0.2262
0.1885
0.1269
0.1025
0.0648
0.2969
0.2610
0.2259
0.1920
0.1293
0.1007
0.0494
0.5 0.0505 0.0267
*Coefficient of skewness
Parameters:
Σ
focus on the proportion of simulations where LR does better. Tables 6b, 6c and 6d
show the results for all the three cases we have described in Figures 4 and 5. The
first two columns always show the results when only one of the two explanatory
variables is skewed, while in the third column both are transformed.
The trends we can see are rather similar. When the skewness is small and
therefore the distribution close to normal, LDA performs better. But when the
skewness increases, LR becomes more and more constantly better.
=
15. 0
5. 01
,υ=0, m=n=200
The performance of LDA and LR does not depend on the sign of the skewness.
Therefore we have used the same transformation function to check the impact of
the extent of separation of the groups at the same time. Right skewness thus also
mean less separated groups. This is obvious in Table 6a, as index Q is constantly
decreasing.
To be able to compare LR and LDA solely in terms of skewness we again
Table 6b: Simulation results for different degree of skewness (Figures 4c, 5c).
The proportion of simulations in which LR performs
better (Q index)
CS* x1
0.68
x2
0.74
Both
0.5
0.88
0.78
0.60
0.28
0.44
0.60
0.88
0.4 0.50 0.44
0.2
0.1
0.1
0.38
0.24
0.28
0.26
0.24
0.28
0.2 0.38
0.52
0.42
0.54
0.4
0.5
0.58 0.64 0.96
*Coefficient of skewness
Parameters:
Σ
=
15 . 0
5. 01
,υ=0, m=n=200
Page 15
Comparison of Logistic Regression and Linear…
157
If both explanatory variables are skewed, the highest value of skewness under
which LDA is still more appropriate is about ±0,2. We can observe that these
boundaries are the same regardless of the separation of the groups.
If only one of the covariates is asymmetric and the other one is left as normal,
the LDA is expectedly more robust – the interval widens a bit and the trends again
remain similar with positive and negative skewness. The same effect on robustness
can be seen by lowering the sample size as discussed in the previous sections.
Table 6d again shows that transforming insignificant variables has no impact
on the results. However, it is impossible to control the simulations to the extent
where we could say anything exact about the boundaries depending on the
significance of the variables.
Table 6c: Simulation results for different degree of skewness (Figures 4a, 5a).
The proportion of simulations in which LR performs
better (Q index)
CS* x1 x2 Both
0.5
0.72 0.68 0.96
0.4
0.54 0.58 0.80
0.2 0.38 0.32
0.62
0.1
0.1
0.16
0.16
0.18
0.20
0.30
0.32
0.56
0.86
0.2 0.28 0.28
0.4 0.50 0.40
0.5
0.58 0.50 0.94
*Coefficient of skewness
Parameters:
=
10
01
Σ
, υ=π/4, m=n=200
0.26
0.26
0.26
Table 6d: Simulation results for different degree of skewness (Figures 4b, 5b).
The proportion of simulations in which LR performs
better (Q index)
CS* x1 x2
0.92
0.84
0.64
0.32
0.38
0.62
0.92
Both
0.5 0.26
0.92
0.84
0.64
0.32
0.38
0.60
0.92
0.4 0.26
0.2
0.1
0.1
0.2 0.26
0.4 0.26
0.5 0.26
0.98 0.98
*Coefficient of skewness
Parameters:
=
10
01
Σ
, υ=0, m=n=200
View other sources
Hide other sources
 Available from Maja Pohar Perme · Jul 16, 2014
 Available from unilj.si