# P- value VS Trend: Is it plausible that a trend may predict the p-value, if not significant, when the sample size has increased?

In many cases, there is a positive trend trend between two events, but the relationship is not statistically significant. The researcher may think that, let us say p -value slightly higher than 0.05, if the sample size has increased, the relationship may become significant. On the other hand the investigated relationship may become largely insignificant. What do you think about?

## Popular Answers

Jochen Wilhelm· Justus-Liebig-Universität GießenInterpreting the trend (or, more generally, the effect) is not decision-theoretic. It is an "inferential" approach. The effect is estimated from the available data and one can try to judge if the estimated effect is somehow relevant and if the precision of the estimate is sufficient to be confident enough about the usefulness of the model. There is NO distinction between "significant" and "non-significant". The question is about the most likely size of the effect, given the available data. There is NO control of error-rates here. The reasoning is based on logical/resonable arguments and models.

If you go to separate significant from non-significat findings, then any thinking of trends/effects is a waste of time. If you do so, you should define a minimum relevant effect you wish to detect as a "significant result" with a desired power, before you even start the experiment. This, plus knowing the expected variance, enables you to calculate the required sample size. Then, when this planned experiment gives a "non-significant" result, you know that it was unlikely enough (depending on the power) to get this result given there was a relevant effect. This enables you to also keep a maximum false-negative rate. Again, this still does not tell you how likely your decision in this case is right or wrong. You just can control worst-case error-rates.

Jochen Wilhelm· Justus-Liebig-Universität GießenCIs are calculated exactly on the same basis as p-values, and originally they do have the same frequentistic interpretation as p-values: a CI is a random interval, sometimes including the "true" value, sometimes not. The construction rules ensure that not more than 5% of the 95%-CIs will miss the "true" value. Rejecting H0 based on p is the same (logically and practically) like rejecting H0 based on the fact thet the CI does not include its value. From this frequentistic point of view, a confidence interval tells us nothing about the effect. It is a random interval, it can be here and there, narrower or wider. The particular CI is just a realization, for itself not of any help for inference. Just the long-run properties are controllable, in just the same (identical!) way as for the p-values.

But CIs in fact provide a way towards a different interpretation. In contrast to a p-value, a CI gives an estimate for the location, the effect-size (plus a measure of the precision of this estimate, given the data!). Further, the CI can be seen (and is mathematically identical) to the central region of estimates with the highest likelihood, given the data. And now it is not a big step further to a Bayesian interpretation: The posterior is proportional to the prior multiplied with the likelihood. If the prior is essentially flat (constant), then the posterior is just proportional to the likelihood, and the 95%CI is identical to the 95% credible interval. Additionally, the maximum likelihood estimate (often the mean) is identical to the maximum posterior estimate, so just that value of the parameter that is expected most likely, given the data and the flat prior. Whether or not a flat prior is reasonable is another discussion. For many (practical) problems, to my opinion, it is good enough.

The "road to Bayes" is clearly left as soon as the existence of an effect is merely based on the inclusion/exclusion of the H0 value. We are then again selling the information about the individual effect size and logical coherence in order to buy the control of error rates in long-run decision processes.

I find it interesting that the Bayesian interpretation is (largely) incompatible with controlling error rates. Thinking about science may show that the major aim is to "explain" ("predict") observations based on models. Models, however, are not the truth, they don't even aim to be "close to truth". Models are neccessarily wrong, what is actually a prerequisite to be (potentially) useful. Thus, the whole model-building approach in science is not and can never be about being right or wrong, and controlling error-rates for this purpose is flawed or at least misleading. Instead, the main question is (should be!) for the reasonablity, usability, and efficiacy of a proposed model. This can always and only be judged in light of the available data, and the door is always open to modify, change or abandon a model when new data becomes known.

## All Answers (13)

Jochen Wilhelm· Justus-Liebig-Universität GießenInterpreting the trend (or, more generally, the effect) is not decision-theoretic. It is an "inferential" approach. The effect is estimated from the available data and one can try to judge if the estimated effect is somehow relevant and if the precision of the estimate is sufficient to be confident enough about the usefulness of the model. There is NO distinction between "significant" and "non-significant". The question is about the most likely size of the effect, given the available data. There is NO control of error-rates here. The reasoning is based on logical/resonable arguments and models.

If you go to separate significant from non-significat findings, then any thinking of trends/effects is a waste of time. If you do so, you should define a minimum relevant effect you wish to detect as a "significant result" with a desired power, before you even start the experiment. This, plus knowing the expected variance, enables you to calculate the required sample size. Then, when this planned experiment gives a "non-significant" result, you know that it was unlikely enough (depending on the power) to get this result given there was a relevant effect. This enables you to also keep a maximum false-negative rate. Again, this still does not tell you how likely your decision in this case is right or wrong. You just can control worst-case error-rates.

Patrick S Malone· Malone QuantitativeElaborating somewhat, I'd add that this is not the only approach to drawing statistical inference -- even if it is the widely-accepted one in some fields. My own field, psychology, relies heavily on it, though there is an increasing expectation (including in the APA publication manual) of also examining confidence intervals in a complementary fashion. CIs have the potential to address the "trend" question in that their width, as well as position relative to the null-hypothesis value, can be taken as a post-hoc indication of whether power may be the issue.

In my opinion, the pendulum might be swinging too far, in that confidence intervals are not easily or obviously derived for certain statistical tests (thinking particularly of tests of multiple-degree-of-freedom hypotheses). Further, there is a widespread lack of understanding that confidence limits are also estimated with error.

While I am trying to learn more about Bayesian approaches, I confess to not yet having a deep enough understanding to say whether their widespread adoption would help resolve these issues in a fashion likely to be accepted by psychology peer reviewers any time soon.

But Jochen's answer to the question as stated is dead on.

Jochen Wilhelm· Justus-Liebig-Universität GießenCIs are calculated exactly on the same basis as p-values, and originally they do have the same frequentistic interpretation as p-values: a CI is a random interval, sometimes including the "true" value, sometimes not. The construction rules ensure that not more than 5% of the 95%-CIs will miss the "true" value. Rejecting H0 based on p is the same (logically and practically) like rejecting H0 based on the fact thet the CI does not include its value. From this frequentistic point of view, a confidence interval tells us nothing about the effect. It is a random interval, it can be here and there, narrower or wider. The particular CI is just a realization, for itself not of any help for inference. Just the long-run properties are controllable, in just the same (identical!) way as for the p-values.

But CIs in fact provide a way towards a different interpretation. In contrast to a p-value, a CI gives an estimate for the location, the effect-size (plus a measure of the precision of this estimate, given the data!). Further, the CI can be seen (and is mathematically identical) to the central region of estimates with the highest likelihood, given the data. And now it is not a big step further to a Bayesian interpretation: The posterior is proportional to the prior multiplied with the likelihood. If the prior is essentially flat (constant), then the posterior is just proportional to the likelihood, and the 95%CI is identical to the 95% credible interval. Additionally, the maximum likelihood estimate (often the mean) is identical to the maximum posterior estimate, so just that value of the parameter that is expected most likely, given the data and the flat prior. Whether or not a flat prior is reasonable is another discussion. For many (practical) problems, to my opinion, it is good enough.

The "road to Bayes" is clearly left as soon as the existence of an effect is merely based on the inclusion/exclusion of the H0 value. We are then again selling the information about the individual effect size and logical coherence in order to buy the control of error rates in long-run decision processes.

I find it interesting that the Bayesian interpretation is (largely) incompatible with controlling error rates. Thinking about science may show that the major aim is to "explain" ("predict") observations based on models. Models, however, are not the truth, they don't even aim to be "close to truth". Models are neccessarily wrong, what is actually a prerequisite to be (potentially) useful. Thus, the whole model-building approach in science is not and can never be about being right or wrong, and controlling error-rates for this purpose is flawed or at least misleading. Instead, the main question is (should be!) for the reasonablity, usability, and efficiacy of a proposed model. This can always and only be judged in light of the available data, and the door is always open to modify, change or abandon a model when new data becomes known.

Patrick S Malone· Malone QuantitativeJochen Wilhelm· Justus-Liebig-Universität GießenPatrice S. Rasmussen· University of South FloridaJochen Wilhelm· Justus-Liebig-Universität GießenFurther, there are two stumbling blocks in your answer:

1) "According to the Central Limit you will eventually end up with normal curve if you keep increasing your sample size.". This "eventually" is in infinite future, given a finite amount of time required for each measurement. And it is restricted to variables with a finite variance. The mean position of a scattered photon on an X-ray screen, for instance, will never follow a normal distribution.

2) "You will fin[d] your answers already in the literature.". Many subject-specific (non-statistical!) papers do present results like "although the [effect] was non-significant, there was a clear trend... (p=0.061)...". You find it quite frequently, and any poor scientist (non-statistician) can easily become confused. Even stats reviews in biomedical journals are sometimes wrong w.r.t. some statements. For instance read Critical Care, Vol 6 No 3. (http://ccforum.com/content/6/3/222) where the authors state (sereval times!) that "A P value is the probability that an observed effect is simply due to chance". *fail* Textbooks about statistics (at least for bio/medical readers) very often present an unfortunate and wrong conglomeration of Neyman/Pearson's and Fisher's approaches of testing. "Reading the literature", as you suggest, can easily and will likely lead to confusion and misconceptions.

I prefer that people ask when they are unsure insead of being left alone reading books they might not well understand and draw the wrong conclusions. This forum is a place for such questions. And questions being posted here quite often animate others to think about something, getting new ideas, insights, and topics develop beyond the original focus. Answers may found to be wrong or suboptimal by others and can thus be refined. Means to say: we (can) all(!) learn and profit.

J. Patrick Kelley· University of New Hampshire1) "We do not want p vale more than p .05..." I'm optimistic that this was not meant to imply that we are wishing (however secretly) for a particular outcome of a statistical test. It is certainly difficult to remain objective, but it's important to eliminate such language lest future readers adopt it.

2) "Try to learn the facts instead of searching for answers yourself. The research lit. review is the same approach. You will fin[d] your answers already in the literature." This is also problematic, as Jochen as stated. As an additional example, one should also consider all those published studies that cite Zar's or Sokal and Rohlf's statistical textbooks as support for normalizing all independent variables (via data transformations), when indeed they suggest otherwise. So, even close reading can be misguided. Through an evaluative process of reading, posting on forums like this, reading, posting, reading, ad infinitum, you learn to verbalize argument. As Jochen says, we most certainly can all learn and profit, but we can't sit around reading all the books before we speak up and test our knowledge.

3) "...however first you must know the literature." Admittedly, I have an odd perspective here. There's no doubt that a foundational knowledge of one's field is a crucial element of development. However, knowing a field's entire literature can be detrimental. In the field of the ecology, I have experienced a narrowing of ideas if I read too much about a subject beforehand. In most iterations of the cyclical Scientific Method, the first step is "Observe and ask a question." It isn't "Read the literature exhaustively." In some fields--like physics--I suspect that this observation step amounts to a good bit of reading rather than observing the behavior of birds or insects (as I do) responding to predators. But not in the observation-based fields I'm in. The literature simply hurts me at the initial stages of a project. We are the agents of science. As such, we must be the ones to experience the phenomena and interpret them. Reading about the statistically significant response of a sparrow singing rate in response to a song playback has yet to tell me anything about how the tropical antbird I study will respond. I think decent ideas and forward progress require the flexibility to devote unfettered thought to a set of observations as well as the desire to then search the literature for corroboration or gaps.

In sum, know the general literature, but don't read everything. "Everything" will simply confuse you and prevent you from developing the multitude of alternate hypotheses on which good scientific investigation should be based. Also, the literature is not that big considering all the questions and ideas and possibilities out there. So, chances are, it won't contain the answers that one is looking for.

Jerry Miller· Texas Children's HospitalIn epidemiology we sometimes make projections, for example taking a small-ish study and then doubling or tripling the dataset to create an (artificial) dataset in order to increase our "sample size" and project what the statistical significance would be in a larger population. But this should be done carefully if at all. It may be better to

take or estimate an actual sample population from, e.g., census data if that is possible.

Your study has two events (two time points?) -- if you can add a third event or time point, then this will provide more data to establish a statistical trend. I'm not familiar with all the methods to establish a p-value for trend.

Patrick S Malone· Malone QuantitativeFirst -- and this may apply to artificial data, as well -- the decision-theoretic approach that Joachim discussed in depth at the beginning does not support "poking" at the results and then deciding whether to collect more data. That has the potential to increase Type I error because of inevitable bias in the decision to collect more data. You'll most likely only go back to your sample only if you have a "trend," so you're relying on an ns effect to decide to go for significance -- possibly capitalizing on chance in the original dataset.

Second -- and this would *not* apply to artificial data -- you risk contamination by history effects in your sampling frame. That is, going back, especially in a non-randomized study, may result in a sample from a different "population," if there are shifts in the population parameters over time. This could be dealt with by incorporating a sampling-time X predictor interaction, but that's probably going to reduce your power if it doesn't exist, and make your results very difficult to interpret if it does.

Really, a priori power analysis and sample size determination is the only un-challenged way around this problem, at least that I can think of offhand.

Ulrich Frick· University of Zurichbe this topic would clarifiy some of your questions:

Monitoring of large randomised clinical trials: a new approach with Bayesian methods

Mahesh KB Parmar DPhil,Gareth O Griffiths MSc,Prof David J Spiegelhalter PhD,Prof Robert L Souhami MD,Prof Douglas G Altman MSc,Emmanuel van der Scheuren,for the CHART steering committee

The Lancet - 4 August 2001 ( Vol. 358, Issue 9279, Pages 375-381 )

Jerry Miller· Texas Children's HospitalI know we should ideally compute our sample size in advance and not secretly wish for certain results and so on, but this situation comes up legitimately in the exploration of data or in pilot studies. I'm using logistic regression on a dataset and one important variable is almost a statistically significant predictor of the outcome. I wonder if increasing the sample size in a subsequent study (again assuming everything is equal which it might not be) would narrow the confidence interval enough to make this variable significant. My intuition says it would. I can't find methods to compute sample size for multivariate logistic regression with a mixture of categorical and continuous variables.

And when does one properly use the word "trend" ? When describing statistical results at two or more time points? Thanks for any theoretical and practical advice on this.

Jochen Wilhelm· Justus-Liebig-Universität GießenThe p-value, on the other hand, just tells you how likely you should reasonably expect a statistic (e.g. a t-, F-, Chi²-, R²-, r- ... value) that is at least as extreme as the one claculated from the observed data, under the assumption that there is no relation between the predictor and the response (the null hypothesis). Rejecting the null hypothesis based on this information alone does not tell you anything about this particular experiment. The rejection is part of a testing procedure to make up a definite decision (will you go this way or that way). Such definite decisions are rarely sensible in research - but this is another topic. Given you take such decisions, each one may be right or wrong. Still you CAN'T SAY anything about the chance of making a right or wrong decision in a particular test, for a particular set of data, because for this you would need to know something about the trend in general, i.e. without seeing the data (Bayesians call this the "prior" information/distribution). The only thing you can achieve based only on the observed data is to keep a (maximum) long-run type-I error rate. Nothing more, and noting less.

The effect of increasing the sample size has a theoretical and a practical aspect. Theoretically, if H0 is in fact true, it *will* happen that 5% of experiments performed will have p-values below 5% (under H0 the distribution of the p-values is uniform!). If you have a p close to 0.05 (or below!!) and if H0 is true and if you increase the sample size, it is very likely that the p-value will become larger. If H0 was not true, the p-value will systematically become smaller and smaller with increasing sample size (plus some noise, for sure). I practice, unfortunately, it is rarely a sensible assumption that you sample from *identical* populations. There will always be some very very minor, subtle, irrelevant, negligible differences somewhere with some small, vague and indirect relation to the response you are measuring. Therefore, H0 is actually never ever "true" in real examples, just like there is no real variable that is *exactly* normally distributed. With a large-enough sample size you will thus always get a small p-value, and you can make it as small as you wish just be increasing the sample size. And you will surely get always some "highly significant" result, often for some irrelevant difference or trend in the data, that is there because the populations are never exactly identical anyway.

Side note: Increasing the "samle size" for a test by simply re-using the observed data is not at all helpful, because it doesn't add new information. If there is some difference or trend in this sampled data, and re-use of this same data will systematically contain this difference/trend, only the "virtual" sample size the test "sees" is increased, so the p-value will be smaller. This will ALWAYS be like this, with one exeption: the data shows *exactly* no difference/trend, so the original p-value is *exactly* 1.0. The problem with this procedure is that the "relevant effect" plus the variance are both estimated from the available data, and using this in turn to get a sample size to reliably estimate this effect is like a Münchausen syndrome.

Can you help by adding an answer?