Comparison of Logistic Regression and Linear. Discriminant Analysis: A Simulation Study

Metodoloski Zvezki 01/2004; 1(1):143-161.


Two of the most widely used statistical methods for analyzing categorical outcome variables are linear discriminant analysis and logistic regression. While both are appropriate for the development of linear classification models, linear discriminant analysis makes more assumptions about the underlying data. Hence, it is assumed that logistic regression is the more flexible and more robust method in case of violations of these assumptions. In this paper we consider the problem of choosing between the two methods, and set some guidelines for proper choice. The comparison between the methods is based on several measures of predictive accuracy. The performance of the methods is studied by simulations. We start with an example where all the assumptions of the linear discriminant analysis are satisfied and observe the impact of changes regarding the sample size, covariance matrix, Mahalanobis distance and direction of distance between group means. Next, we compare the robustness of the methods towards categorisation and non-normality of explanatory variables in a closely controlled way. We show that the results of LDA and LR are close whenever the normality assumptions are not too badly violated, and set some guidelines for recognizing these situations. We discuss the inappropriateness of LDA in all other cases.

Download full-text


Available from: Maja Pohar Perme, Jul 16, 2014
  • Source
    • "A comparative study between the linear discriminant analysis and logistic regression may also be found in [10]. "
    [Show abstract] [Hide abstract]
    ABSTRACT: The study of anthropometric characteristics of different communities plays an important role in design, ergonomics and architecture. As the change of life style, nutrition and ethnic composition of different communities leads to obesity epidemic etc. The authors performed two experiments. In the first experiment, the authors tried to classify three communities of Assam, India based on anthropometric characteristics using R Programming. The authors mined out the statistically significant anthropometric characteristics among the Chutia, Mising and Deori communities of Assam. In the second experiment, the authors performed the Cochran Mantel Haenszel test to find out the association between the communities and BMI based nutritional status stratified by the age of the people studied.
    International Journal of Advanced Computer Science and Applications 01/2015; 6(10). DOI:10.14569/IJACSA.2015.061009 · 1.32 Impact Factor
  • Source
    • "We performed a comparison between different classification algorithms to find the one that best classifies Q&A pairs from SO. In the classification process, we used seven classification algorithms: Logistic Regression (LR) [19] [11], Naive Bayes (NB) [13], Multilayer Perceptron (MLP) [9], Support Vector Machine (SVM) [13], J4.8 Decision Tree (J4.8) [13] [24], Random Forest (RF) [12] and K -Nearest Neighbors (KNN) [13]. We decided to classify the Q&A pair instead of classifying only the question body because we observed that in some cases the answer body provides relevant information to help to make decision of the Q&A pair category (e.g., differentiate between pairs of How-to-do-it and Debug-Corrective categories). "
    [Show abstract] [Hide abstract]
    ABSTRACT: (SO) is a Question and Answer service oriented to support collaboration among developers in order to help them solving their issues related to software devel-opment. In SO, developers post questions related to a pro-gramming topic and other members of the site can provide answers to help them. The information available on this type of service is also known as "crowd knowledge" and cur-rently is one important trend in supporting activities related to software development and maintenance. We present an approach that makes use of "crowd knowl-edge" available in SO to recommend information that can assist developers in their activities. This strategy recom-mends a ranked list of pairs of questions/answers from SO based on a query (list of terms). The ranking criteria is based on two main aspects: the textual similarity of the pairs with respect to the query (the developer's problem) and the quality of the pairs. Moreover, we developed a classifier to consider only "how-to" posts. We conducted an experiment considering programming problems on three different topics (Swing, Boost and LINQ) widely used by the software devel-opment community to evaluate the proposed recommenda-tion strategy. The results have shown that for 77.14% of the assessed activities, at least one recommended pair proved to be useful concerning the target programming problem. Moreover, for all activities, at least one recommended pair had a source code snippet considered reproducible or almost reproducible.
    22nd International Conference on Program Comprehension, Hyderabad, India; 06/2014
  • Source
    • "For example , at least 25 cases per dependent variable are necessary for accurate hypothesis testing using LR, especially when the dependent variable has many groups (Grimm and Yarnold 1995). There is an ongoing debate (Pohar et al. 2004) about the different sensitivities of LR and DA to violations of their underlying assumptions and, therefore, which method is to be preferred for modeling topographic survey data. For instance, Montgomery et al. (1986) selected the same set of variables as important predictors using both methods and with nearly equal classification values, but the assumption of equal covariance was not met. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Mapping the occurrence and thickness of layers within a soil profile is a prerequisite for soil characterization. The objective of this paper is to compare the applicability of two statistical methods—discriminant analysis (DA) and logistic regression (LR)—used to calculate the thickness of Quaternary sediments in a formal way and to identify parameters controlling the occurrence of these sediments. The investigations were carried out in southern Bavaria in an area of about 150 ha presenting a large variability in relief and parent material (Tertiary material, Pleistocene loess, colluvial/alluvial sediments). Comparisons between the two statistical methods were carried out with a training dataset and an evaluation dataset. The results show that DA was preferable under the assumptions of normality and equal variance/covariance matrices. The analyses produced models with 80 % and 79 % correctly reclassified assignments and a canonical correlation coefficient of approximately 0.60. From the simulations, it was found (i) that the determining predictors were altitude, slope, and upslope catchment area (partly expressed as topographical wetness index), SAGA wetness index and specific catchment area; and (ii) that a disadvantage of LR was that trial and error was frequently necessary to find the optimal composition of variables. In this study, a hierarchical combination of binary and ordinal LR was used and revealed (iii) that when the probabilities in LR between adjacent categories were similar, the possibility of incorrect calculations increased and (iv) that visual inspections as well as RMSE showed that DA with weighted depths (5 cm-stepwise DA) provided the best prediction accuracy. This information can help improve soil surveys and the predictability of the spatial heterogeneity in landscapes.
    Mathematical geosciences 04/2014; 46(3):361–376 DOI 10.1007. DOI:10.1007/s11004-013-9486-x · 1.65 Impact Factor
Show more