# Which test do I use to estimate the correlation between an independent categorical variable and a dependent continuous variable?

Is it a fair assumption that if you do an Anova or Kruskal Wallis test with an independent categorical variable and a dependent continuous variable that shows no significance, to assume that there is no "correlation" between the two variables? For two continuous variables you can perform a Pearson or Spearman's correlation test, but I am not sure to use which test in the above mentioned situation?

## Popular Answers

Emmanuel Curis· Université René Descartes - Paris 5Let's say X is your independant categorical variable and Y your dependant, continuous variable.

First of all, strictly speaking, a test will not estimate anything, juste give a kind of yes/or no answer, here « there is/there is not association/correlation between X and Y ».

Second, if X is categorical, speaking of correlation is somehow abusive, since correlation is defined by means and categorical variables do not have mean. Speaking of association is better.

To answer specifically to your question: for ANOVA and Kruskall-Wallis, the null hypothesis is that the two variables are independant (ANOVA: Y is gaussian and has the same variance and mean for each X value; KW: Y has the same distribution function for each X value --- not forgetting the tests assumptions!).

Hence, a significant result prooves that Y and X are dependant.

However, a non-significant results may not be enough to proove independance since not-rejecting the null hypothesis does not proove it is true, by any way --- in fact, it does not proove anything at all.

To _estimate_ the correlation/association, I think you should first more precisely define your question

Both ANOVA and K-W test are basically _tests_; estimation of the strength of the association can be derived from it, but the more useful one I think strongly depends on the exact problem you are working with.

The tetrachoric/polychoric is made for two categorical variables (read the link given in the post of Luis), hence cannot be used here without raising the problem of defining classes for your continuous variable.

## All Answers (138)

Jochen Wilhelm· Justus-Liebig-Universität GießenEmmanuel Curis· Université René Descartes - Paris 5> Jochen: I had in fact the same interrogation: seeing the test result as a game and the multiplicity correction as protection against « the more we play, the more we win » (or loose), why not correct for all the test a researcher will make in his life, to be sure he "never" looses by chance?

I have no definite answer, but may be a hint for thinking: I think multiplicity correction is well defined in case of kind of composite hypothesis, and you take a decision on a whole based on several tests, rejecting the null if any of these tests is significant. For instance, you ask « is my treatment efficient? » and tests this on systolic, diastolic, differential, mean blood pressure.

If you answer « Yes » if at least one of this four tests is significent, you have the risk of "the more you play, the higher you risk to you loose" ==> multiplicity correction

If you answer « Yes » if all of these tests are significent, you do not need to correct (in fact, I think you should correct the beta for multiplicity in the power computation, but I never seen done that so I am not sure).

Now, going to your example about all litterature or mine about a carrier, there is no real link between all the hypotheses tested, unless you ask something like « did the searcher sometime get the wrong results » or things like that, so multipliciy correction is not required. Could be done, however, but with so much power loss that we are sure never to conclude anything...

As for the test: I agree. I would add that the test indeed gives a Yes or No answer, but useless without context interpretation (including the choice of alpha & beta; I am not convinced that the conventionnal 0.05 represents more than a way to avoid thinking about this part of the methodology and its difficulties we are discussing...)

Jochen Wilhelm· Justus-Liebig-Universität GießenRameswar Nag· Utkal UniversityTHANK YOU.

Peter Smetaniuk· San Francisco State UniversityPeter Smetaniuk· San Francisco State UniversityNada El Osta· Saint Joseph University, LebanonIf the categorical variable has more than two categories, you have to compare different means and you use the ANOVA (parametrics test after verifying the assumptions of normality) or Kruskal Wallis (non parametric test) if the assumption of normality is not met.

Nada El Osta· Saint Joseph University, LebanonBut if the categorical variable has more than 3 categories, you must only used the ANOVA or Kruskal Wallis

Giovanni Bubici· National Research CouncilPearson's correlation is adequate for continuos variables, whereas Spearman's correlation and Kendall's correlation are adequate for categorical (ordinal) variables. You could use a Spearman's correlation by transforming your continuous into a ordinal variables (or ranks).

Instead, for non-ordinal categorical variables association tests must be performed, but not correlation tests.

Emmanuel Curis· Université René Descartes - Paris 5Giovanni Bubici· National Research CouncilI'm sorry, you're right. Can we say that Spearman's and Kendall's correlations are adequate for variables measured at least on ordinal scales?

Jochen Wilhelm· Justus-Liebig-Universität GießenSergio Pezzulli· Kingston University LondonGiovanni Bubici· National Research Councilconcerning the variable transformation that I mentioned: can a correlation be done between a continuous and an ordinal variable? Shouldn't they have almost the same number of categories?

Sergio Pezzulli· Kingston University LondonEmmanuel Curis· Université René Descartes - Paris 5No problem for that, at least on a theoretical point of view. You do not need to have the same number of categories for the two variables, you just need to have "complete" pairs of measurements and be able to sort both variables.

For instance, if Y is continuous and X is dichotomic with A < B, and if you have the sample set { (A, 0.145), (B, 0.15), (A, 0.13), (B, 0.143), (A, 0.12) }, you can compute ranks for X and Y hence make a Spearman's ranks coefficient of correlation or a Kendall's number of discording pairs.

Here, rank pairs would be, neglecting the ties problem, { (1, 4), (4, 5), (2, 2), (5, 3), (3, 1) }

However, as mentionned Sergio, with at least one ordinal variable with few categories, the problem of ties appears, and it is not always easy to solve, especially for small sample sizes where asymptotic normality does not hold. Sergio's approach is the easiest one, and probably the most used, but not the only one.

Emmanuel Curis· Université René Descartes - Paris 5Jón Einar Jónsson· University of Icelandhttp://www.amazon.com/Introduction-Categorical-Analysis-Probability-Statistics/dp/0471226181/ref=sr_1_3?ie=UTF8&qid=1348663996&sr=8-3&keywords=alan+Agresti

I took a course in graduate school which was based around this book, and I particularly found the examples in the book useful. The text was fairly accessible.

Sunil Shrestha· Tri-Chandra CampusJose Renato Kitahara· University of São PauloElia Vecellio· South Eastern Area Laboratory ServicesFabio Montanaro· Latis Srl, Italy, GenovaRaghunandan G.C.· University of Agricultural Sciences, BangalorePeter Smetaniuk· San Francisco State UniversityIvan Kshnyasev· Russian Academy of SciencesWalter Hillabrant· Support Services International, Inc.Neal Van Eck· University of MichiganMCA examines the relationships between several categorical independent variables and a single dependent variable using an additive model. The technique handles predictors with no better than nominal measurement and interrelationships of any form among predictors or between a predictor and the dependent variable. The dependent variable should be an interval-scaled variable without extreme skewness or a dichotomous variable with two frequencies which are not extremely unequal. MCA determines the effects of each predictor before and after adjustment for its inter-correlations with other predictors in the analysis. It also provides information about the bivariate and multivariate relationships between the predictors and the dependent variable.

Mavuto Mukaka· University of Oxford (UK), Mahidol-Oxford Tropical Medicine Reseach Unit, Bangkok, ThailandIrfan Yurdabakan· Dokuz Eylul UniversityAccording to the data that you specify, you have to compute with Point Bi-serial Correlation .

Take it easy.

Mary Jannausch· University of Michigan"Is it a fair assumption that if you do an Anova or Kruskal Wallis test with an independent categorical variable and a dependent continuous variable that shows no significance, to assume that there is no "correlation" between the two variables?"

No, it's not at all reasonable to assume that the true correlation (or, more broadly, covariance) is = 0. Cov(X, Y) = 0 is necessary but not sufficient to demonstrate independence between X and Y. It's possible for Cov(X,Y) and Corr(X,Y) to be zero but the underlying relationship between X and Y is not independent.

This is true regardless of what statistical tests you use, for inference. You've said that one variable is independent, and the other is dependent. Dependent, in what sense? What are X and Y in this case?

Jayalakshmy Parameswaran· National Institute of OceanographyShree Nanguneri· University of Southern MississippiCan you help by adding an answer?