Question

# Which test do I use to estimate the correlation between an independent categorical variable and a dependent continuous variable?

Is it a fair assumption that if you do an Anova or Kruskal Wallis test with an independent categorical variable and a dependent continuous variable that shows no significance, to assume that there is no "correlation" between the two variables? For two continuous variables you can perform a Pearson or Spearman's correlation test, but I am not sure to use which test in the above mentioned situation?

### Topics

5 / 3  ·  138 Answers  ·  14618 Views

## All Answers (138)Show full discussion

• Jochen Wilhelm · Justus-Liebig-Universität Gießen
I think the wheather-forecast is better suited as an example to illustate the problem:

Based on previous observations (data) we can build a probability distribution which reflects out current knowledge about the weather tomorrow. Such probability assignments can not be interpreted in a frequentistic way, because there is nothing like a population of similar "days like the day tomorrow". This principle is very general and the application to defined populations of similar entities is just a special case, and in this special case there is a frequentistic interpretation of probabilities possible, too.
• Jeffrey Welge · University of Cincinnati
Hi Jochen, maybe we will have to agree to disagree about causality -- if I understand you correctly, and I may have misunderstood -- you assert that it is certainly true that e.g., had I taken a different route to work this morning, the probability of rain in Tokyo at this moment would be different (a priori the causal effect cannot be zero), but since the effect is probably not very large it is uninteresting. I assert that [with high prior probability, as I subscribe to Bayesianism] the causal effect is precisely zero, and if it could be shown to be non-zero to any degree, it would be interesting since I am unaware of any testable existing theory that could explain such a causal connection.

This is not just philosophical hair-splitting [though it is fun :) ] -- it is now common and in my opinion proper to state that any observed effect should not only be shown to be "statistically" significant but also "scientifically" significant ("considerable" size as you say). We are in full agreement on this point. But what does the latter mean? It is totally context-dependent: If it could be shown conclusively that a new drug has the causal effect of reducing your systolic blood pressure by 1mmHg, we might consider this too small to be relevant. The clinical impact is trivial, and there would be plausible scientific theories under which such an effect size is reasonable. If equally conclusive evidence is obtained that I can produce the same effect on you by sheer willpower, it is very relevant: Though the clinical impact is still trivial, what testable theory in existence can explain even a tiny effect of that type?

Per Bayes, as most of us would consider existence of a small psychic effect much less plausible a priori than a small drug effect, finding that both observations are equally unlikely under the corresponding null hypotheses would not lead us to believe that both effects are equally likely to exist.
• Jochen Wilhelm · Justus-Liebig-Universität Gießen
For my example it was not important whether or not your route to work does change the weather tomorrow. The point is: If a frequentist would give a test (or a confidence interval in the frequentistic interpretation) for the weather tomorrow based on a "sample" of previous days, what is the meaning of this? There is no even theoretic way to check the conclusions. To do this, one would need to see how the weather will behave in future under circumstances similar to the sample used for the forecast. This will take ages, and during this time also your way to work will have some measurable impact, rendering the initial definition of the "population" meaningless.

It's fun to discuss this - although my thought are still not too clear about this topic!
• Jeffrey Welge · University of Cincinnati
Thank you both for such interesting discussion! I admit that my characterization of the "strict" frequentist doesn't reflect the working behavior of many real frequentists, at least that I know.

RA Fisher tried hard to define populations for which frequency interpretations applied, and from my limited understanding, his thinking was not very different from that of the Bayesian deFinetti: The population of interest is one within which in Fisher's words "no recognizable and relevant subset exists". deFinetti would call the members of such a population "exchangeable" -- though we do not assume them all to be the same with respect to our event of interest, we are not able to specify the factors that make them different (relevant subsets are believed to exist, but they are not recognizable) -- we may be able to distinguish subsets but do not believe the subsets are relevant (in the weather example, we can distinguish Mondays from Fridays, but we have no reason to believe the distribution of weather patterns is different on these two days, so the distinction would be judged irrelevant).

I think our examples differ from each other and from a classic frequentist situation such as throwing dice only in the ease with which we can suggest credibly relevant and recognizable subsets.

Dice -- will I throw a six on the next toss? -- This single toss is a unique event. But I have no basis to suspect that it is different from many other throws I have observed, and the proportion of sixes in those throws was 1/6, so I assume this throw exchangable with those, and state the probability of the event as 1/6.

Weather -- will it rain in Tokyo tomorrow? -- Also a unique event. But I may be able to define "days like the day tomorrow" as all the historical occasions where the covariates of my model took on the same values as they do now. Since the model represents the sum of my knowledge about what is relevant, I cannot further distinguish these days from each other or from tomorrow, so I state that the probability of rain is given by the proportion of rainy days in that set. I consider my route to work this morning to be an irrelevant factor :)

US Presidential election -- will Obama win? -- Prior elections are considered exchangeable with this one if all factors judged relevant are the same. I can distinguish many differences between this election and the others, but I am too ignorant to know whether or how most of these are relevant (such as that the incumbent recently supported controversial health care reforms). I recognize as relevant only the subset of prior elections where an incumbent was seeking re-election, and estimate pr(Obama wins) as the propotion of times an incumbent was re-elected. You, being more sophisticated, have identified that within this subset, incumbents are less likely to be re-elected when the economy is doing poorly, consider the current election only exchangeable with that smaller subset, and state a different pr...
[more]
• Jochen Wilhelm · Justus-Liebig-Universität Gießen
You are right, these philosophical aspects that we are discussing here are practiacally irrelevant for the daily work of most statisticians (or other scientists performing statistical analyses). The practise is much more pragmatic. To my experience in a way like: "if the reviewer wants to have some significance stars, we will do some analysis in order to sanctify such stars". Practise...

In your previos comment sounds pretty much like a Bayesian standpoint to me. You talk about knowledge that we can have from observed data. To my (sadly very limited) knowledge about these philosophies, it is the essence of the frequentists standpoint that the result of a particular analysis can in principle not be known as being right or wrong. The p value refers to the true null hypothesis, not to the real world (where this null may be true - to follow your point of view - but it is not known if this is the case).

For the dice, let's take an icosahedron (die with 20 faces). Not knowing anything else but that all the 20 faces ("1","2",..."20") are distinguishable, P(tossing "3") = 0.05. If we think "3" is some kind of special face, the question if this die will preferably show up exactrely this face is considered interesting. So do the experiment and toss the die. If it now really shows "3", we do have a (borderline) significant result at the conventional level (go an publish!). The frequentist states that this particular result does not tell us anything about the property of the die to preferably show up "3". He says: "if the die truely does not prefer '3' over the remaining 19 faces, and we would repeat this experiment really often, the relative frequencies of '3' will converge to 0.05." He explicitely does not state that this die does prefer face "3".

Another example with a coin (head or tails). Not knowing anything else, P(head) = 0.5. But if you know that the coin has the same symbol on either side buit you don't know if it is head or tail, the ferquentist concept is not anymore applicable, since P(head) is a property of the coin (or its behaviour in a long series of tosses) and this must be either 0 or 1 - what can't be decided at this stage (P is the limiting relative frequency). The Bayesian takes P(head) still as 0.5, because he still can't prefer one outcome over the other (P is a representation of the current knowledge).

The exchangebility was a good point. You state that the exchangebility is defined by our knowledge, and I think the frequentist must see exchangebility as a true given property. The frequentists presents P as an estimate for the limiting relative frequency, and he's absolutely right (and efficient and all good things) when this is in fact the case (even if it is just a [good] approximation). Under the premise that we really have incomplete knowledge and something is missing or wrongly considered in our models, our CURRENT knowledge/data may not provide any hint that the model is suboptimal, but if P is taken as the limiting freque...
[more]
• Emmanuel Curis · Université René Descartes - Paris 5
Hi,

> Jochen : I still not agree with the "there is no frequentist interpretation" for the weather forecast, but may be I do not have the same interpretation of "frequentist"... For me, I assume as "frequentist" a framework in which a probabized space of events is defined with given properties and random variables are set in it. Then experiments are performed again and again, and because of that the final space of events is defined like a cartesian product of all of them, and one then must define a new probability on this.
"Strict frequentist" as you seems to use --- correct me if I am wrong --- would assume independance and identical distribution on every single set ("day") and take the product. But other ways are possible and still enter the previous framework without beeing Bayesian - so enter my own definition of "frequentist" (may be I am the only one to use it ?).
Bayesian approach adds the use of other events and the Bayes theorem to this framework.
Now, going back to weather forecast, assume you have results from several days and want to forecast for tomorrow. "Strict" frequentist approach assume independance between days, so may be as you says it is not very realistic.
But if you use correlations, and tools like time series, then you can enter the knowledge of previous days to better forecast tomorrow's weather.
And next step, you can develop a physical or empirical equation, "model", from various variables, fit it to past data (means squares for instance) and use it to predict tomorrow's weather with a given probability/uncertainty, still using previous information, knowledge, either in the selection of the equation formulation and in the fitting process.
All these approaches are not Bayesian, since Baye's theorem is not involved. For me, they are "frequentist" since they are based on the basic probabilities approaches --- and the term "frequentist" is something like an historical usage.

By the way, I think this idea of a population from which a sample is drawn (which seems the basis of the "classical frequentist" approach) very useful for introducing statistics, but I am not quite sure it is really convenient for more advanced practice. May be I am too theoretical though, but I think only defining random variables on random sets, and precising the assumptions, is more general. But of course comes after that the problem of knowing the limitations of the predictions/conclusions and how far they can be extended. But that last point seems outside the scope of the probabilist tools used in statistics and involve human knowledge.

And for the coin experiments: if the two sides are identical, how do you define a pile and a face? So probability of having one or the other seems to me undefined, and both frequentist and Bayesian cannot handle undefined objects...

> Jeffrey : Refering to the butterfly effect, changing your way to work leads to different air movements, so may slightly change Tokyo's weather ;) But I agree with your point: to cu...
[more]
• Jeffrey Welge · University of Cincinnati
We might look for resolution to the so-called “Likelihood” school, which has sometimes been called the "third way" of statistical philosophy. It is a bridge between Frequency and Bayesian interpretations, and at least for simple problems is often IMHO the most satisfying of all solutions. Consider the distribution of possible data D as a function of unknown parameter or set of competing hypotheses H: P(D|H) plays a role for both schools. Frequentists condition on a particular value H=h0, and integrate P(D|h0) over all values as or more extreme as the observed data to obtain the p-value. Bayesians condition on D=d (what actually occurred) and compute P(H|d0) by combination with prior distribution P(H) via Bayes theorem.

If there are at least two competing hypotheses (H=h0 and H=h1), the likelihood ratio measures the relative support for h0 relative to h1 in light of the data:

LR=P(D=d|H=h0) / P(D=d|H=h1)

This ratio does not depend on any assumption about the prior odds P(h0) / P(h1), so it is not Bayesian, though the Bayesian combines these quantities to compute posterior odds:

P(h0|D=d) / P(h1|D=d)

Only when P(h0) / P(h1) = 1 will the LR equal the posterior odds.

As Neyman and Pearson also require (but Fisher does not), at least two specific point hypotheses must be given and the probability distribution of the data must be known under each of them: For Jochen’s icosahedron, if we can state a specific alternative h1 for the “specialness” of the “3” face, say that the die is loaded such that “3” will always appear, the relative evidence for these hypotheses provided by observing a single event “3” is:

LR = P(“3”|h1) / P(“3”|h0)
= 1/0.05
= 20

And generic guidelines for this strength of evidence have been suggested. The LR is a special case of the Bayes Factor, which is central to the most familiar Bayesian method of model selection / hypothesis testing (there is no real distinction between the terms). Harold Jeffreys, who is strongly associated with the concept of “non-informative” priors, gave generic guidelines for interpreting Bayes Factors – a value of 20 would be considered moderately strong, but as always such qualitative labels must be adjusted to a particular context. Of course, observing anything other than “3” would definitely disprove h1 --- and if h0 was the only alternative to h1, h0 is similarly proven true. If “special” is less extreme, e.g., P(“3”=0.10), then:

LR = P(“3”|h1) / P(“3”|h0)
= 0.10/0.05
= 2

Which is rather weak: One roll doesn’t provide much evidence to discriminate these hypotheses, because although a “3” is improbable under h0, it is not very probable under h1 either! Therefore, improbability of the data under h0 is not necessarily sufficient to reject h0 because the alternative doesn’t explain the data much better.

When the hypotheses are quite specific like this (so-called “simple vs. simple”, as they may be in problems of genetic linkag...
[more]
• Emmanuel Curis · Université René Descartes - Paris 5
Really an interesting discussion...

I may be wrong, but isn't this likelihood ratio idea the basis of the family of tests called likelihood ratio tests or LR tests, as opposed to for instance Wald tests?

And if so, I think I've read that as far as usual linear model is assumed, that is linear regression, analysis of variance and so on, they in fact give the same thing that « usual » tests like Student's T test, F test... But in that case, that would mean that the LR for the complex H1 hypothesis is in fact majorated by using as the unknown parameter value the experimental value it had, which will give the maximal likelihood... so something like the 1) alternative you gave, with only a single computation. Of course, it works because the H0 is a simple hypothesis.

However, results change when multivariate analysis or generalized models are involved, and probably in other contexts too.

With the icosaedron example, and only one throwing, of course we are completely outside the linear model framework. But if I well understood, that would give something like
H0 : dice is fair, so p(X = 3) = 1/20
H1 : dice is unfair, p(X = 3) = &pi; unkwown

Now, we throw the dice, we obtain 3. The value of &pi; leading the highest likelihood to H1 is &pi; = 1, so the LR test will give a ratio of 1/(1/20) = 20 as in your simple hypothesis example, hence leading to the same conclusion.

That also assumes that we are more interested in prooving that H0 is false than really knowing the &pi; value... Coherent with a test.
• Jochen Wilhelm · Justus-Liebig-Universität Gießen
The likelihood is a measure of the strength of support for a hypothesis. The ratio of two likelihoods is the relative strength of support of one hypothesis over another. Ok, t-tests and the related stuff are all kinds of likelihood-ratio tests. What seems so be overseen is that this "strength" relates to a "force" that can change our knowledge or beliefes more or less into one or another direction. This knowledge or belief in this analogy refers to the "impulse". Looking at the forces alone does not help to where we are. Sure, we can compare such forces from sifferent data sets and experiments, but we must have some idea about the impulse to be able to judge what these forces will change.

Example: If we don't have any idea about an association between smoking and lung cancer, a study showing such an association with p<0.05 seems to be worth to be taken as "informative"; it will lead further research or even public behaviour into a considerably different direction. But what if another study would show that the application of RNA molecules to cells is associated with a change in the genomic sequence of the cells, also with p<0.05. We do have strong beliefs that RNA does not influence the DNA sequence. There is a paradigm that (sequence) information is transferred from DNA to RNA and never the other way around. My guess is that almost nobody would take this study seriously. It would even not be published, because I think the reviewers would aks for much more "evidence", different experiments and and and... to be convinced that this paper will be worth to be published. So, in these two cases the same "statistical evidence against H0" has entirely diffrent responses. In terms of the above decribed analogy, forces of similar strength were acting on different impulses. In our everyday experience we naturally do consider the "impulses" that are already there, also the scientific community, reviewers and editors do so. I wonder why this is not formally included into the presentation of the results. From this prespective, Jeffrey, I do not see wha Bayesians should have a problem with definning priors. This is - to my limited knowledge - the only known formal way really to consider the "impulse".

Another analogy: Giving a p-value (or a LR) is like naming an amount of money but missing to specify the currency and value of the money.
• Jochen Wilhelm · Justus-Liebig-Universität Gießen
@ Emmanuel: You said: "For me, I assume as "frequentist" a framework in which a probabized space of events is defined with given properties and random variables are set in it. " This is exactly the problem: this space of event can not always be defined in a reasonable way, and experiments can not always be repeated under equal (i.e. similarily variable) conditions.
• Emmanuel Curis · Université René Descartes - Paris 5
> Jochen: Well, in that case, Bayesian framework should not apply either, since it is also based on probabilities, hence one must have a probabilized space of events...
Note that in my framework, I did not assumed repetitions under equal conditions... Only that event space and random variables can be assumed to exist.
But to be honest, I do not see example under which you cannot define a space of events and a random variable to model a single experiment result, as far as there is something random (or considered to be so) in your experiment.
• Emmanuel Curis · Université René Descartes - Paris 5
> Jochen: transfer from RNA to DNA exists, it is used by the so-called « retrovirus » like AIDS one. But that does not change the spirit of the example, and I continue as if there were not.
I think these examples shows the difficulty to interpret the statistical results that are, basically, just answering a question not really interesting. I mean, for your RNA exemple, the « p < 0.05 » only means that, for instance, there is a significant difference between experiment with/without RNA, and this the only not-discutable thing (assuming a suited statistical method was used, but that's another problem).
But after that, does it mean that the difference is because of RNA absence/presence or something else? This is only interpretation using other knowledge about the field, researcher background...
In a way, interpretation in this meaning is much like bayesian idea to include exterior, prior knowledge in the experiment: because of this knwoledge, you will believe or not that your difference comes from RNA absence/presence instead of something else.
But now, how would you formalize a prior in a Bayesian approach to enter this background that « RNA does not change DNA sequence »? If you really believe in it, why make an experiment about it, by the way? But if you use a very strong prior in favor of that, how would you ever proove the contrary? And how objective it is if you set up an analysis that is somehow biased towards what you want?

I precise, I have nothing against Bayesian methods, but I must confess I seldomly use them; however, I'm convinced they can be very useful, but when using uninformative priors it is difficult to say it better allows to enter what we already know about the subject that a careful a posteriori interpretation of non-Bayesian results.

And also, this example shows that despite nobody believes in it, it may be true.

I think may be the idea in statistics for sciences is to show that OK, experiments is not in disagreement with my theory/idea/hypothesis, so now it's up to you, reader/referee to proove that I am wrong ;)
• Jeffrey Welge · University of Cincinnati
No, the Bayesian framework does not depend on a defined sample space. Bayesians condition on the observed data, so the probability of data that did not occur (e.g., those "more extreme" than what was observed) is not relevant -- dependence of inference about unknown parameters on such factors is in violation of the Likelihood Principle. Since, as Jochem stated, the space isn't (or cannot be) always defined, this is an important consideration.

For the icosaedron, we have been assuming that it was decided in advance to roll the die just N=1 time, and a "3" occurred on that roll. But suppose we had decided instead to roll the die as many times as necessary to get exactly m=one "3", and got it on the first roll. Though the data are the same, the relevant sample space is not -- in the first case, the experiment was guaranteed to end after one roll, and the probability distribution was binomial with N=1. In the second case, the experiment could have gone on forever, and the probability distribution is not the same -- but the likelihood in both cases is the same, being proportional to (pi^m)([1-pi]^N-m).

Thus frequentists will in general get different p-values, etc., for the same data depending on the sample space they assume to have applied to the experiment -- in sequential trials, for example, this can get quite complicated and controversial.
• Jeffrey Welge · University of Cincinnati
The prior can be as informative as you like, as long as you are extremely careful about going so far as to give some hypothesis zero prior probability -- in this case, no data could ever change your opinion that that hypothesis is impossible. Otherwise, a strong prior that is "wrong" will eventually be overwhelmed by data bearing on that parameter -- it just might take a lot of data, and that is perfectly appropriate if we are very skeptical about the hypothesis.

It is not necessarily a problem for me to set a prior -- if it reflects my belief (which may or may not be based wholly or in part on genuine objective information that differs from what you possess), then that is the truth for me, period. The "problem" is when everyone does not agree on the prior. Then we can all reach different conclusions from the same data -- however, we WILL all agree on the strength of the evidence contained in the data, as long as we agree on the form of the likelihood function.

An interesting suggestion for setting a sample size is to generate two priors: An "optimistic" one, and a "pessimistic" one, so as to represent opposing opinions about the plausibility of the effect to be investigated. Evidence will be collected until these divergent views have been forced into substantial consensus (similar posterior distributions, with "similar" to be agreed upon in advance by the opposing parties).
• Jeffrey Welge · University of Cincinnati
"If you really believe in it, why make an experiment about it, by the way?"

The reason to do an experiment is because not everyone else believes it!

If you undertake a study of some association, if you were being entirely honest about your prior beliefs, you would likely indicate that the probability of the null hypothesis (defined, if you prefer, as a small interval rather than exactly zero) is small, or at least less than 0.50 -- in most cases, experimenters are doing the study because they believe the association exists. But not everyone believes it -- we must convince our colleagues whose obligation is to be skeptical. And despite what we truly believe, the ideal is for the investigators themselves to set aside that belief and assume a position of "objectivity", or even outright skepticism.

Why do the investigators I work with sometimes go to pains to describe a "marginal" result, e.g., p=0.12. as a "trend"? Because they believed a priori that the effect exists, and while the data have not by the conventional standard sufficiently contradicted the hypothesis that the effect is absent, they have also not contradicted the hypothesis that it exists. In fact, such a result often legitimately does support the existence of the effect (e.g., by LR between a "minimal" effect and zero), so my colleague is not wrong to feel somewhat encouraged by the result.

Here I would point out that Emmanuel is correct about the LR test based on the MLE as the alternative hypothesis -- naturally, the null is disadvantaged in this situation since the LR can at best equal one and can never exceed it -- there is virtually always some evidence against the null in the two-sided testing situation. Therefore, we require that the LR must be "sufficiently large" to reject the null. If an externally specified alternative was weighed against the null (rather than the MLE), the data could certainly favor the null, perhaps quite strongly. Part of my dissatisfaction with "power analysis" is that the alternative used in the calculation is often ignored altogether in the analysis. The N-P alternative h1=x is replaced with h1 ne 0. Then data that are closer to 0 than to x may be sufficient to reject h0. Then we have the irritating situation where we must say "though the result is statistially significant, it may not be practically significant" -- if we are capable of stating the limit of practical significance, why not set up the test so that these conclusions must be in alignment?

Neyman & Pearson, in the "simple-v-simple" hypothesis testing framework, showed how to find tests that minimize Type II error after fixing Type I error to a known maximum. They might have elected instead to minimize the sum of these error rates, which would lead one to choose the hypothesis that was better supported by the LR (by any amount).
• Jochen Wilhelm · Justus-Liebig-Universität Gießen
It is a result of the general use of the Newman-Pearson test theory (or a kind of an impossible conglomerate of Fisher and NP as it is often tought in books and (non-math) stat courses) that data has to generate a final dichotomous decision (significant or not; effect or not; yes or no). [this is, like anything else I write, my personal point of view].
In fact, there is no way to get precise anwers from fuzzy data. Sharp distinctions are introduced by arbitrary decision rules. But the appropriateness of such rules again depends on our beliefes (for me a p<0.05 may be convincing but you may require a p<0.0001 to get convinced). The problem here is that we still think in yes/no categories. I think many philosophical problems of this kind will vanish when conviction is seen as a countinuous variable, probably pretty much like posterior odds. Am I Bayesian? (btw, I don't care).

Another point: the multiple testing problem becomes really strange when not a particular experiment is considered but the entire scientiific research. If one claims to add to the scientific knowledge, the p value must be corrected against very very many other p values published (not against *all* for sure, but the sample space will pretty much inflate). This is kind of a rough thought. Maybe you can show me where I'm wrong with this.,,
• Jeffrey Welge · University of Cincinnati
"Another analogy: Giving a p-value (or a LR) is like naming an amount of money but missing to specify the currency and value of the money."

I think I will disagree with that: The LR (but not the p-value) does always represent the same exchange rate. For any prior odds on the hypotheses (the "impulse", if I understand you), the LR dictates how they must change in the face of the evidence. If I consider h0 and h1 equally likely a priori, while you are skeptical of h1 and consider h0 19 times more likely, but we both agree on the validity of an experiment and the likelihood function for the data it generates, then a LR of 19 in favor of h1 dictates that I must have posterior probability=0.95 that h1 is true, while for you the odds are now even.
• Emmanuel Curis · Université René Descartes - Paris 5
> Jeffrey: I agree, we want to make experiments to convince the others that our ideas are correct. I misformulate, sorry; I meant, speaking of Jochen's example, « If you really are convinced that RNA does not modify DNA, as everyone, why do an experiment to proove the contrary? ». Of course, the case « I think RNA can modify DNA, but everyother ones do not » calls for an experiment, but in that case using a prior using "common" knowledge would make very difficult to proove our own idea and somehow like « se tirer une balle dans le pied » ([to shot a bullet in his own foot? Do not know if it exists in English).

> Jochen: I had in fact the same interrogation: seeing the test result as a game and the multiplicity correction as protection against « the more we play, the more we win » (or loose), why not correct for all the test a researcher will make in his life, to be sure he "never" looses by chance?

I have no definite answer, but may be a hint for thinking: I think multiplicity correction is well defined in case of kind of composite hypothesis, and you take a decision on a whole based on several tests, rejecting the null if any of these tests is significant. For instance, you ask « is my treatment efficient? » and tests this on systolic, diastolic, differential, mean blood pressure.
If you answer « Yes » if at least one of this four tests is significent, you have the risk of "the more you play, the higher you risk to you loose" ==> multiplicity correction
If you answer « Yes » if all of these tests are significent, you do not need to correct (in fact, I think you should correct the beta for multiplicity in the power computation, but I never seen done that so I am not sure).

Now, going to your example about all litterature or mine about a carrier, there is no real link between all the hypotheses tested, unless you ask something like « did the searcher sometime get the wrong results » or things like that, so multipliciy correction is not required. Could be done, however, but with so much power loss that we are sure never to conclude anything...

As for the test: I agree. I would add that the test indeed gives a Yes or No answer, but useless without context interpretation (including the choice of alpha & beta; I am not convinced that the conventionnal 0.05 represents more than a way to avoid thinking about this part of the methodology and its difficulties we are discussing...)
• Jeffrey Welge · University of Cincinnati
I certainly do not disagree with you about multiple testing. It is often not at all obvious what constitutes the "family" over which error rates are to be controlled, or why and to whom it matters. If you examine 10000 genes for differential expression, and YOUR objective is to choose a small set that YOU will examine more closely, then it makes sense for you to control the False Discovery Rate so that only alpha% of the genes in the small set will be false positives. Here YOU face real dichotomous decisions (study each gene further or not) -- for the moment, a gene that is "just barely" significant will be treated the same as the "most significant" gene: They will both go into the follow-up set. Nothing wrong with N-P acceptance/rejection testing in this situation (assuming you choose alpha and beta to reflect your actual priorities, not mechanically based on conventional rules).

But I can look at the same data in a very different light. Perhaps my theory implies that one of the genes should be differentially expressed, and says nothing at all about the others. Surely in evaluating the evidence about my single gene of interest, I am not obligated to consider tests on those 9999 genes that are for me irrelevant, nor to necessarily reach a dichotomous conclusion about the truth of my theory from your data.

My point is that decision theory is just fine *when there is a decision to be made* -- and sometimes there is. I think we are in substantial agreement, though, that most scientific investigations are not of this type -- a single study provides evidence that alters beliefs, but is rarely so definitive that a particular theory should be completely ruled in or out. Fisher felt the same way: He considered the p-value a continuous-valued measure of evidence (though in my opinion it cannot serve this purpose well), and stated that conclusions drawn about null hypotheses should be viewed as provisional, not final.

Humans feel a need to categorize things, even in cases where we have no explicit decision to make. In practice, we DEMAND of each other that we do this. So while the LR is a continuous measure of evidence, I can no more present it without commenting on whether it is "strong", "weak", "neutral", etc. than I can present a p-value without similar context -- the "asterisk" system of categorizing as 0.05, 0.01, 0.001, etc. arises from the same need to reduce a continuous measure to a suitably small number of categories. Two categories (Accept/Reject) just seems too few in situations where we do not actually face an accept/reject decision.

Any continuous statistic (e.g., posterior probabilities or odds) will face the same problem -- humans will demand guidelines for categorizing it.
• Jochen Wilhelm · Justus-Liebig-Universität Gießen
Jeffrey, there is nothing more to be said from my side. You very precisely formulated my own thinking. Particularily, I found it nice that here you explicitely state that NP theory is well suited for "screenings" whereas it has draw-backs in promoting knowledge from specific experiments, what I answered Kristien at the beginniing of this discussion. I certainly agree than humans like to think in categories. But scientists should be trained more to think "continuous"... And whatever categories we build, the problem what strength of evidence is convincing will always depend on our beliefes. By focussing so much on the strength of evidence we lose sight of carefully stating or current state of belief. Usually, the latter is done - if at all - implicitely and not very objectively or transparently. At the beginning of a research different researchers will have different more or less comprehensible opinions/beliefes, but by correctly stating and quantifying them, new evidence from different sides will modify those beliefes so that they will become more and more similar. This reflects the scientific consensus and the current state of knowledge that is common to mankind (something like an inter-subjective objectivity...). By now we employ an objective way to quantify evidence from data. The next step should be to objectify the process of learning, i.e., the way how the evidence changes our knowledge.
• Rameswar Nag · Utkal University
DEAR,YOU CAN USE THE CO-RELATION AND REGRESSION ANALYSIS.
THANK YOU.
• Peter Smetaniuk · San Francisco State University
@Kristien: Look up point-biserial and biserial correlations (You can find their definitions simply by Googling). Spearman's is a nonparametric statistic and used when your distribution has violated certain assumptions (e.g., your data set IS NOT normally distributed, too many outliers, skew or kurtosis value is too high or too low,). If your data set is not normally distributed, you can perform specific 'transformations' to "normalize" your data--investigate before performing any transformations (there are a few procedures to choose from depending on your data).
• Peter Smetaniuk · San Francisco State University
Also Kristien: Your question seems to address the choice of using a parametric or nonparametric test for the equality of means or homogeneity across two samples. The Kruskal-Wallis test is for nonparametric testing. ANOVA can be quite robust to violations, but not always--which is why one can choose the Kruskal-Wallis test.
• Philimon Gona · University of Massachusetts Medical School
Use biserial correlation, there is a sub routine in R package or use a freely available SAS macro Biserial.sas
• Nada El Osta · Saint Joseph University, Lebanon
If the categorical variable has two categories (dichotomous), you can use the Pearson correlation or Spearman correlation.
If the categorical variable has more than two categories, you have to compare different means and you use the ANOVA (parametrics test after verifying the assumptions of normality) or Kruskal Wallis (non parametric test) if the assumption of normality is not met.
• Nada El Osta · Saint Joseph University, Lebanon
To be noted, that if the categorical variable is dichotomous, the pearson correlation, the ANOVA and the Student t test gave the same -p-value
But if the categorical variable has more than 3 categories, you must only used the ANOVA or Kruskal Wallis
• Giovanni Bubici · National Research Council
Kristien,
Pearson's correlation is adequate for continuos variables, whereas Spearman's correlation and Kendall's correlation are adequate for categorical (ordinal) variables. You could use a Spearman's correlation by transforming your continuous into a ordinal variables (or ranks).
Instead, for non-ordinal categorical variables association tests must be performed, but not correlation tests.
• Emmanuel Curis · Université René Descartes - Paris 5
> Giovanni: Spearman's and Kendall's correlation coefficient ARE suited for continuous data, not only for ordinal data. You do NOT have to convert them to ordinal variables, thanksfully. Pearson's correlation coefficient is suited for continuous data, but tests on its value can be done only for bidimensionnal normally distributed data only.
• Giovanni Bubici · National Research Council
Emmanuel,
I'm sorry, you're right. Can we say that Spearman's and Kendall's correlations are adequate for variables measured at least on ordinal scales?
• Jochen Wilhelm · Justus-Liebig-Universität Gießen
Yes, we can :-)
• Sergio Pezzulli · Kingston University London
First of all you have to ask HOW can we measure DEPENDENCE between a categorical and a continuous variable? And this depends on you categorical. Is it ordinal or not?
• Giovanni Bubici · National Research Council
Emmanuel,
concerning the variable transformation that I mentioned: can a correlation be done between a continuous and an ordinal variable? Shouldn't they have almost the same number of categories?
• Sergio Pezzulli · Kingston University London
Giovanni, ciao. You dont need the same number and so the continuous variable could be split into a relatively fine grid (but not too much compared to ordinal ). Then, you can use rank correlation (Spearman's rho) using the ties corrections. Be careful that not all programs do it. Tie correction is that if for example the ranks are 1 2 2 4 5 6... eg there is a tie at the second place, then you have to use the average, that is 1 2.5 2.5 4 5 6 ...
• Emmanuel Curis · Université René Descartes - Paris 5
Hi Giovanni,
No problem for that, at least on a theoretical point of view. You do not need to have the same number of categories for the two variables, you just need to have "complete" pairs of measurements and be able to sort both variables.
For instance, if Y is continuous and X is dichotomic with A < B, and if you have the sample set { (A, 0.145), (B, 0.15), (A, 0.13), (B, 0.143), (A, 0.12) }, you can compute ranks for X and Y hence make a Spearman's ranks coefficient of correlation or a Kendall's number of discording pairs.
Here, rank pairs would be, neglecting the ties problem, { (1, 4), (4, 5), (2, 2), (5, 3), (3, 1) }
However, as mentionned Sergio, with at least one ordinal variable with few categories, the problem of ties appears, and it is not always easy to solve, especially for small sample sizes where asymptotic normality does not hold. Sergio's approach is the easiest one, and probably the most used, but not the only one.
• Emmanuel Curis · Université René Descartes - Paris 5
Also, from a theoretical point of view, I think Spearman's correlation, like many other tools based on ranks, assume that both variables are absolutly continuous, ensuring the absence of ties... And are after that extended to discrete variables, hence the problem to handle ties unambiguously.
• Jón Jónsson · University of Iceland
For those interested in categorical data analysis, I suggest the book of Alan Agresti: an introduction to categorical data analysis:

http://www.amazon.com/Introduction-Categorical-Analysis-Probability-Statistics/dp/0471226181/ref=sr_1_3?ie=UTF8&qid=1348663996&sr=8-3&keywords=alan+Agresti

I took a course in graduate school which was based around this book, and I particularly found the examples in the book useful. The text was fairly accessible.
• Sunil Shrestha · Tri-Chandra Campus
You should calculate Spearman's Correlation coefficient.
• Jose Kitahara · University of São Paulo
Kristien, as operational approach, I use GLM (General Linear Model) inside SPSS - Statistical Package for the Social Sciences (IBM). You can model regression with quantitative and qualitative independent variables. It provides lots of info that can provide model and validations tests.
• Elia Vecellio · University of New South Wales
Follow up from Jeffrey Welge's comment, and the categorization of continuous variables to make them fit into an ANOVA frame work (eg the ubiquitous median-split high-something vs low-something groups) I recommend reading a great article: MacCallum, Zhang, Preacher, Rucker (2002). On the practice of dichotomization of quantitative variables. Let me spoil the ending: categorization/dichotomization of continuous variables is very rarely a good thing to do.
• Fabio Montanaro · Latis Srl, Italy, Genova
GLM (or regression) sounds as the most reliable answer. If you have only the two mentioned variables, even logistic regression could give you some information,
• Raghunandan G.C. · University of Agricultural Sciences, Bangalore
Biserial correlation is the solution.
• Peter Smetaniuk · San Francisco State University
You may need to adjust the coefficient if your dichotomous variable is either discrete or continuous. Biserial is used when the variable is continuous (passing or failing an exam), and Point-biserial is used for discrete variables (pregnant or not pregnant). It's the biserial coefficient that ought to adjusted by using an equation found in most comprehensive stats manuals explaining the proportions that fall into specific categories of your dichotomous data set.
• Ivan Kshnyasev · Russian Academy of Sciences
Just use multiple regression: factor’s categories as dummy variables. Multiple correlation coefficient (or its square) is an appropriate measure of the association.
• Walter Hillabrant · Support Services International, Inc.
Maybe I missed it, but if the categorical variable is dichotomous, the point-biserial correlation coefficient is appropriate. In any event, I'm not big on tests of Independence as in my view of measurement, failure to detect associations are often the result of weak measures...sort of akin to accepting the null hypothesis.
• Neal Van Eck · University of Michigan
You might want to look at MCA (Multiple Classification Analysis) in MicrOsiris and other packages:

MCA examines the relationships between several categorical independent variables and a single dependent variable using an additive model. The technique handles predictors with no better than nominal measurement and interrelationships of any form among predictors or between a predictor and the dependent variable. The dependent variable should be an interval-scaled variable without extreme skewness or a dichotomous variable with two frequencies which are not extremely unequal. MCA determines the effects of each predictor before and after adjustment for its inter-correlations with other predictors in the analysis. It also provides information about the bivariate and multivariate relationships between the predictors and the dependent variable.
• Mavuto Mukaka · University of Malawi
I think it really depends on the levels of your categorical variable and whether or not they are ordinal or nominal. Firstly if the categorical variable has many levels which are ordinal Spearmans correlation coefficient is the way to go and mind you the question of dependent and independent does not arise when assessing correlations. However if they are binary or nominal I thnk the question of correlation does not arise as correlation is about linear relationship between two continous/ordinal variables. That would clearly be an example of where Altman's book writes "misuse of correlation is so common that some statisticians wish the method had never been devised"The question in that case is of whether the two variables are related or not which can be answered by ANOVA/Kruskal wallis test or regression techniques.
• Irfan Yurdabakan · Dokuz Eylul University
Dear Kristien,
According to the data that you specify, you have to compute with Point Bi-serial Correlation .
Take it easy.
• Mary Jannausch · University of Michigan
In response to your question:
"Is it a fair assumption that if you do an Anova or Kruskal Wallis test with an independent categorical variable and a dependent continuous variable that shows no significance, to assume that there is no "correlation" between the two variables?"

No, it's not at all reasonable to assume that the true correlation (or, more broadly, covariance) is = 0. Cov(X, Y) = 0 is necessary but not sufficient to demonstrate independence between X and Y. It's possible for Cov(X,Y) and Corr(X,Y) to be zero but the underlying relationship between X and Y is not independent.

This is true regardless of what statistical tests you use, for inference. You've said that one variable is independent, and the other is dependent. Dependent, in what sense? What are X and Y in this case?
• Jayalakshmy Parameswaran · National Institute of Oceanography
I agree very well with Mary Jannausch comments because when the correlation is zero it does not mean that the two variables are totally independent. Pearson's correlation or spearmann's correlation measure only the linear correlation. So when r=0 it does not imply that the two variables are curvilinearly independent. They may be related by a higher order equation such as a quadratic or cubic function. Also the statement that one variable is independent and the other is dependent is to be made more clear.
• Shree Nanguneri · University of Southern Mississippi
If you are willing to share details on the exact example of what categorical variable you are using, I can help you convert it into a continuous one and eliminate the need to deal with this situation and make a difference in your final objective. Good luck!javascript: