Question
Asked 11 July 2014

How can I interpret regression when an insignificant interaction term makes significant predictors insignificant?

I have two predictors in linear regression: A (gender coded as 0-1) and B (continuous, centered). A and B are significant predictors. When I further introduce the interaction term (A x B), the interaction term is insignificant, yet, it also makes B insignificant.
How to interpret and handle this situation? Should I retain the interaction term in the equation? Does it mean that B is insignificant after all?
Theory-wise, the effects of A and B are hypothesized, yet I introduce the interaction to claim that these effects are independent from gender differences.
Thank you RG community for your support!
 

Most recent answer

Popular answers (1)

Jochen Wilhelm
University of Giessen
If you follow the Neymanian philosophy of hypotheis tests you have designed the experiment to achieve a certain power for the tests. Then you follow the rules and either accept or reject a hypothesis (with a well-defined confidence as specified through 1-alpha and 1-beta), for instance based on comparison of the p-values with the given alpha.
Then you would start with the full model A+B+A:B. The first decision (if I follow your comments) would be to accept the null that A:B=0. Then the accepted model is A+B. The next step is to decide about the main effects. These are two tests, and it might be required to control the FWER instead of the TWER (family- or test-wise error rate). The decisions will be (again following your comments) to reject both nulls (A=0 and B=0). So the final model is A+B.
The question for significance is different. Following Fisher's philosophy if significance tests, the p-value gives you a (continuous) measure of the statistical significance. This can help you to make a decision if it would be worth to continue to work on the hypothesis or if you should better change your plans. It is about what you consider significant. The p-value alone is meaningless. You must consider the theory behind the hypothesis, the amount of data you have, the quality of data, the effect size, the implications and much more. And do never confuse statistical significance with (theoretical or practical) relevance! There is no statistical tool to measure, derive, or judge relevance.
(Don't think that this is more difficut or arbitrary than doing hypothesis tests! For hypothesis tests you have to define alpha and beta, and this is either very difficult to practically impossible [at least in science!] - or completely arbitrary! Only the rule is more strict, what makes the procedure appear more objective)
However, my advice is going into a third direction:
Justify theoretically why (or why not) the effect of A on the response should(not) depend on B (or vice versa). Make this decision on theoretical grounds, not based on the data. Then use the model you think (and have agrued) that is reasonable/sensible and interpret the estimated effects. Do not consider p-values. Instead, have a look at the confidence intervals when you interpret the effects. Also plot the data and the model prediction and use it for a wise interpretation that would convince a critical colleague.
 
PS: tests do not tell you if a model is correct (eg. A+B is correct but not A+B+A:B). Test assume that the model is correct!
29 Recommendations

All Answers (33)

Jochen Wilhelm
University of Giessen
If you follow the Neymanian philosophy of hypotheis tests you have designed the experiment to achieve a certain power for the tests. Then you follow the rules and either accept or reject a hypothesis (with a well-defined confidence as specified through 1-alpha and 1-beta), for instance based on comparison of the p-values with the given alpha.
Then you would start with the full model A+B+A:B. The first decision (if I follow your comments) would be to accept the null that A:B=0. Then the accepted model is A+B. The next step is to decide about the main effects. These are two tests, and it might be required to control the FWER instead of the TWER (family- or test-wise error rate). The decisions will be (again following your comments) to reject both nulls (A=0 and B=0). So the final model is A+B.
The question for significance is different. Following Fisher's philosophy if significance tests, the p-value gives you a (continuous) measure of the statistical significance. This can help you to make a decision if it would be worth to continue to work on the hypothesis or if you should better change your plans. It is about what you consider significant. The p-value alone is meaningless. You must consider the theory behind the hypothesis, the amount of data you have, the quality of data, the effect size, the implications and much more. And do never confuse statistical significance with (theoretical or practical) relevance! There is no statistical tool to measure, derive, or judge relevance.
(Don't think that this is more difficut or arbitrary than doing hypothesis tests! For hypothesis tests you have to define alpha and beta, and this is either very difficult to practically impossible [at least in science!] - or completely arbitrary! Only the rule is more strict, what makes the procedure appear more objective)
However, my advice is going into a third direction:
Justify theoretically why (or why not) the effect of A on the response should(not) depend on B (or vice versa). Make this decision on theoretical grounds, not based on the data. Then use the model you think (and have agrued) that is reasonable/sensible and interpret the estimated effects. Do not consider p-values. Instead, have a look at the confidence intervals when you interpret the effects. Also plot the data and the model prediction and use it for a wise interpretation that would convince a critical colleague.
 
PS: tests do not tell you if a model is correct (eg. A+B is correct but not A+B+A:B). Test assume that the model is correct!
29 Recommendations
Bettina Braun
University of Konstanz
Dear Jochen, thanks for the very detailed outline, with which I fully agree. It reminded me of a recent paper by Cumming (2014) in Psychological Science. However, practically speaking, it is very difficult to get research published if you do not provide p-values (I just mention the efforts of various authors to estimate p-values for linear-mixed-effects regression models). 
Regarding Lukasz' question, I would have thought that a reasonable next step would be to compare the linear regression model with the interaction term to one with only the main effects (for instance using the R-function anova(mod1.lmer,mod2.lmer, see suggestions in Cunnings 2012). Or do you see problems with that procedure, Jochen?
 
References
Cumming, Geoff. 2014. The new statistics: Why and How. Psychological Science 25(2). 7-29.
Cunnings, Ian. 2012. An overview of mixed-effects statistical models for second language researchers. Second Language Research 28(3). 369-382.
3 Recommendations
Ignacio Tobía-González
Hospital Italiano de Buenos Aires
When you are using a linear regression you are modelling a response based in (in this case) two independent variables. The interaction test is not mandatory unless you think that there could be a modification of each variable depending of the value of each other,  
When you do that almost all the statistics programs tells you how the modell is adjusting your data. If the inclusion of the interaction terms doesn´t change the adjustment ofthe model, it isn´t important to incude it. If the adjustment is better (let´s think about in 10%), you must include it and study that interaction may be in terms of some kind of stratification of the variables included in the regression.
3 Recommendations
Raid Amin
University of West Florida
If the interaction term is insignificant, but it contributes information, a deeper look at your experiment is needed.
 
 
If the interaction term AB is significant, I create an interaction plot of cell means, followed by testing contrasts based on the interaction plot. The significant interaction makes main effects meaningless to me. I will only discuss the results of the interaction plot then.
When discussing main effect A, we are discussing the differences in the mean response when comparing different levels of A, averaged over all levels of B. Having a significant interaction makes "averaging over all levels of B" contradictory to the definition of a main effect.
 
Don't forget to also check all assumptions for the model used here. If data are non-normal, consider transforming the data to normal scores (similar to what Van der Waerden proposed many years ago).
 
3 Recommendations
Jochen Wilhelm
University of Giessen
Bettina, first thank you for the positive feedback :)
I know too well that is difficult to impossible to get a story published without p-values. Some days back a paper came back from reviewers who rejected the paper because "no statistics was performed" (I provided confidence intervals but no p-values and discussed the effects instead of the statistical significances). Thats the unpleasant truth... I wish I'd review p-value loaded manuscripts of these reviewers!
Lukasz wrote that "the interaction term is insignificant". This result is obtained by an ANOVA, comparing the residual variance of a model with interaction and a model without interaction, and this is exactly what the R function anova(modelWith,modelWithout) does. So he already did what you suggested. This is a standard procedure in the world of "null-hypothesis-significance-testing". To my understanding his problem was the interpretation of this result.But this is not answered by statistics but by expert judgement - either before the test [Neyman] (often disregarded because alpha is mindlessly set to 0.05 and beta is not controlled at all!) or after the test [Fisher] (also usually disregarded as "not objective").
I still would find it more important to inspect the actually estimated interaction. How does it look like? How strong it it? Is it reasonable? Is it relevant?
2 Recommendations
Ole Kudsk Jensen
Regional Hospital Silkeborg
The answer is simple: If you make a regression analysis in men and wome separately, beta may vary. However, you have now shown that that the estimate of beta in men is not statistically different from beta in women. That is, there is no interaction. However, the association with B is statistically significant, also when adjutsed for gender.
Kind regards Ole Kudsk Jensen
 
 
Raid Amin
University of West Florida
It should be noted that many statisticians use a much larger significance level for the AB interaction F test than what they use for the main effects. The reason is to get a higher chance to detect existing interactions. 
What were the p-values for the three F tests that you have done here?
Was the p-value>0.15, say, for the AB interaction, or was it 0.07?
 
1 Recommendation
Yolande Tra
University of Maryland, Baltimore
When adding interaction to the model, the p-values change because it is a new model so B becomes non-significant. It was a good idea to test for interaction to confirm A and B are independent.  A likelihood ratio test between model without interaction and model with interaction could be performed to verify that the interaction term is not needed after all. If significant, the interaction terms are dropped.  You are back to model without interaction which is the one you use for inference.
2 Recommendations
Edward Rigdon
Georgia State University
Focus on incremental R2.  Including an interaction term will enlarge standard errors of individual slopes due to collinearity with main effect predictors, but this does not affect R2.
Remember also that "not significantly different from 0" does not mean "equal to 0."  Nonsignificant only means that the experiment lacked the statistical power to distinguish the value from 0.  Thus, a nonsignificant p-value is not an argument for fixing a parameter at 0, as opposed to leaving it at the estimated value.
Very likely, the absence of "significant" interaction effects is more often due to weak research designs than to the actual absence of interaction effects.  Thus, the conclusion to be drawn from any one experiment ought to be "lacked the power to detect the effect," not "the effect does not exist."
2 Recommendations
Raid Amin
University of West Florida
Have you inspected the interaction plot visually, and if so, does it make any sense to you with respect to whatever your experiment is about? I mean, is it something that might occur or is it a side product of the regression that is meaningless to you?
 Adding an interaction term will:
1. Decrease the error degrees of freedom
2. Decrease the error sum of squares.
The role of the interaction (not necessarily "significant interaction") will result either in a smaller mean square error or a larger mean square error. The tests for the individual regression coefficients depend on the MSE value in the denominators.
If the addition of the interaction term results in a larger MSE, then the tests for the individual regression coefficients may switch from significant (based on a smaller MSE) to nonsignificant (based on a larger MSE).
 
Just my thoughts here.
 
2 Recommendations
Dillon Chrimes
University of Victoria
Step-wise regression would be a good start to find the magnitude of the unwanted variable interactions. Step-wise will show you the progression of significant variables be adding or removing each one sequentially in the linear model. If the interaction significantly undermines your model of linear regression and you need to include this interaction in yoour model then you can try variations of its co-variant in the regression (multiply the variables together for example).  Co-variants can be incldued in the model if this is significant and is part of what your linear model is testing.
Raid Amin
University of West Florida
Stepwise Regression has many flaws, and it is taught in statistics programs why not to use it. The F-tests are not independent of each other, and it does not take into account multicollinearity. I regularly illustrate to my classes with data set why stepwise regression sometimes works well and sometimes it does not.
 
 
3 Recommendations
Adelin Barbacci
French National Institute for Agriculture, Food, and Environment (INRAE)
Perhaps you can check 2 things:
1) In you regression if B is not a reference measurment try to use orthogonal regression instead of least squares
2) If you compute p-value with ANOVA (in the case of least square regression) be sure to use type III ANOVA (and not type I)
 
 
André Achim
University of Quebec in Montreal
Remeber that the test of each predictor, in multiple regression, is actually a test of the unique variance accounted for by the predictor AFTER all other predictors in the equation are taken into account. You probably have unequal group sizes causing a correlation between a main effect and the interaction. The result is that you cannot claim the the interaction has no effect nor that B has no effect. Both have no effect only when the other factor is present..
Noel Artiles-Leon
University of Puerto Rico System
I guess your situation is similar to this one (see attached file)
Raid Amin
University of West Florida
I would not trust any statistical procedure that is applied to such a small sample. Here, n=7. 
Noel Artiles-Leon
University of Puerto Rico System
@Raid:
Do you know what is the size of the sample that Lukasz Dominik Kaczmarek has? I don't.
Regression analysis is a statistical procedure that has a solid mathematical foundation. I am still struggling to understand your statement. Consider, for example, the simple statistical procedure of computing an average. If I take the average of 4 numbers, how can I said "I do not trust the procedure of adding 4 numbers and dividing their total by 4 because the averaging procedure is applied to a small sample"?
1 Recommendation
Jochen Wilhelm
University of Giessen
Raid, why not? I mean, therefore CIs get larger with smaller sample sizes. The analysis should always tell us what we can reasonably expect (given the data), and for littlte data the uncertainty may be too large to draw confident conclusions. But effects may also be so large that they are clear even when the uncertainty is large. The analysis estimates both, the effect size and the associated uncertainty, and they are interpreted in relation to each other. This is indepent of the sample size. I think you get what I want to say... so: why not?
PS: In case you argue via the assumptions and the central limit theorem: (i) you depreciate any statistical procedure (independent of required assumptions and the theoretical correctness of the method) and (ii) even if this might be a real concern then -statistically- the uncertainty is overestimated, so it is unlikely to draw wrong conclusions with a high confidence in such cases.
1 Recommendation
Edward Rigdon
Georgia State University
     I understand Raid's concern, regarding the example that he posted.  While we can use statistics to describe a particular sample of data, we often use statistics to make generalizations about the larger population from which the population was sampled.  With a small n (even an n much larger than n = 7), we know that these results will not replicate in another sample taken from the same population.  Simulation research (perhaps most recently, the work of Dana and Dawes 2004)  shows that, with small n, simple unit weights / equal weights would do a better job  in predicting values for other members of the population than regression weights.
     Yes, if all we want to do is to describe the small sample of data in front of us, then this regression method is probably no worse than any other approach.  On the other hand, small sample sizes make collinearity problems even worse, so we do have additional reason to be cautious.  Our understanding of regression's behavior does rely on asymptotic properties, and we cannot expect those properties to hold at very small n.
Jochen Wilhelm
University of Giessen
I do not understand the collinearity issue. Why is it worse for small samples?
I also have doubts that "our understanding of regression's behavior does rely on asymptotic properties". I already addressed the point regarding the central limit theorem. All other "behaviours" are statistical (i.e.they are about expectations, or, if you want, about long-run properties). This is independent of the sample size, or it applies equally for any sample size.
2 Recommendations
Noel Artiles-Leon
University of Puerto Rico System
What can I say with mere 7 data points from a designed experiment?
  • It is extremely likely that sex has an effect on the response Y (or, it will be a cold day in hell, if the differences that we observe are caused just by chance)
  • It is likely, that as x increases Y increases (for both sexes)
  • There is not enough evidence to suggest that rate of increase in Y when X increases is different for males and females.
1 Recommendation
M. T. Bradley
University of New Brunswick, Saint John Campus
I much admire the answer by Jochen Wilhem .   In science if you have some but an incomplete idea of a more general population Fisher's model is appropriate.  The Neyman Pearson model finds a home in technology when you do not care to generalize because your comparison population is specified exactly.  Theory, purpose, quality of measurement are the important elements of science.  The inferential test approach by Fisher suggests whether you wish to pursue the particular hypothesis or not.  Neyman and Pearson's alpha-beta error model found a good use in quality control for manufacturing where all the input elements are controlled.  For complex reasons,  it ended up being applied in science where there are too many unknown influences to talk about the parent distributions. The N-P model does not belong in science.  
5 Recommendations
Felipe Corchs
University of São Paulo
Dear all. I know this thread is old, but I kind of have the opposite here. I’m using baseline scores (continuous) and treatment group (experiments or control) as predictors to the outcome (post intervention scores, continuous) Before the interaction is included, the baseline scores is significant (p<.001), but treatment is not (.144). After including the interaction in the model, interaction is not significant (.372), but treatment group becomes significant (.03) and baseline continues (.012). Adjusted R-squared is a bit higher for the interaction model. It is important, in my case, the interaction model to find out whether group effects were dependent on baseline. In this case, how do I interpret this? Can I choose the model with the interaction and assume from it that both baseline and group are significant predictors? In this case, can I assume coefficients from this model? Or it would be better if I use the model without the interaction to extract the effects (coefficients, p-values etc) of baseline and group and use the model with the interaction only to report that no interaction was observed? Kind of lost here, any comment highly appreciated. Best, Felipe
Jochen Wilhelm
University of Giessen
"Can I ... assume from it that both ... are significant predictors?"
You have a severe misconception here. It's not about if the predictors are significant. It's about if the data is significant under your (restricted) model. Significance does not tell you anything about the predictor. It only tells you if your data is sufficient to give you a sufficiently good "signal-to-noise ratio" in your analysis to interpret the sign of the coefficient(s) in the given model.
If you want to see if you can interpret the sign of the differential effect (of the treatment, depending on the pre-score) then you must include the interaction. If you have to assume that the tretament effect might depend on the pre-score, you also must include the interaction, even when you don't need to interpret the sign of this differential effect. Note that the simple treatment effect in your model with interaction refers to a pre-score of 0 (it may be that this score value is even not in the range of possible scores, or of scores you observed). It might thus be wise to center the pre-scores, to make the coefficients more meaningful (e.g. as the treatment effect at an average pre-score value).
Felipe Corchs
University of São Paulo
Thanks @Jochen. How would you suggest I interpret these findings?
Jochen Wilhelm
University of Giessen
What model you should use depends on your interests and on what you, as the expert in the subject matter, think what an appropriate model is. You should first be clear about that. If you then have your model, it should not be a big problem to interpret it (otherwise: why did you then use a model that you could not interpret?).
1 Recommendation
André Achim
University of Quebec in Montreal
A problem with the model selected for the analyses is that group is irrelevant at pretest. Therefore, group differences are expected to be present only at post-test. In terms of analysis of variance (ANOVA), a real effect of the treatment is thus expected to drive both a group effect (on average of pre and post) and an interaction effect (on the post-pre difference). Unless the groups already differed at baseline (you should verify the likelihood of that), it does not make sense that only one of these effects is actually present, although it is quite possible that only one is statistically significant.
Given that the groups are defined by a treatment, the direction of its effect on the measurement must be predicted a priori. A one-tail test is likely to apply here, as you would not declare the treatment useful if it actually degraded performance. Thus, your group effect in the analysis without the interaction would have, under the null hypothesis, a probability of .072 or .928, depending on whether the experimental group did better on average than the control group. Similarly, the p value of the group effect could rather be .015 if the direction is as expected.
In principle, one-tail test can also apply for the interaction, which brings the question of how group was coded. If you coded group as 0 or 1, the interaction term is a score of 0 for the control group and whatever it was at baseline for the experimental group (assuming 1 for this group).Then, the sign of the weight applied to the interaction term might depend of the direction of the group trend at baseline.
This brings us to recognise that the non significant interaction and the reduced baseline effect are a consequence of this partial redundancy between these two sources of variance (each effect is tested after the other is taken into account). They would be more independent and you should much better see what is happening if you scored the two groups -1 and +1 respectively. Then, assuming that the treatment is expected to increase the score and that the experimental group is coded +1, the interaction term is expected to have a positive coefficient for a proper one-tail test.
Finally, a repeated measure ANOVA would tell you if the change score (i.e. the interaction effect) differs between the groups. If your groups tended to differ at baseline, this is a safer test than using baseline as a covariable in a repeated measure design.
1 Recommendation
Felipe Corchs
University of São Paulo
Thanks so much, @Andre. I'll look into these comments! Best!
Sohaib Hayat
Institute of Business Administration Karachi
A linear regression will always try to pass the line from the maximum number of points?

Similar questions and discussions

How do I report the results of a linear mixed models analysis?
Question
47 answers
  • Subina SainiSubina Saini
1) Because I am a novice when it comes to reporting the results of a linear mixed models analysis, how do I report the fixed effect, including including the estimate, confidence interval, and p-value in addition to the size of the random effects. I am not sure how to report these in writing. For example, how do I report the confidence interval in APA format and how do I report the size of the random effects?
2) How do you determine the significance of the size of the random effects (i.e. how do you determine if the size of the random effects is too large and how do you determine the implications of that size)?
3) Our study consisted of 16 participants, 8 of which were assigned a technology with a privacy setting and 8 of which were not assigned a technology with a privacy setting. Survey data was collected weekly. Our fixed effect was whether or not participants were assigned the technology. Our random effects were week (for the 8-week study) and participant. How do I justify using a linear mixed model for this study design? Is it accurate to say that we used a linear mixed model to account for missing data (i.e. non-response; technology issues) and participant-level effects (i.e. how frequently each participant used the technology; differences in technology experience; high variability in each individual participant's responses to survey questions across the 8-week period). Is this a sufficient justification? 
I am very new to mixed models analyses, and I would appreciate some guidance. 
Have I applied command xtabond2 correctly for running system GMM in STATA?
Question
10 answers
  • Chhavi JatanaChhavi Jatana
I want to apply the two-step system GMM to investigate the impact of ownership concentration on the CEO pay-performance relationship with 201 firms for 5 years of balanced panel data. I have applied the command given below.
xtabond2 TC L.TC ROEP ROET3 T3 LFSIZE LFAGE LEV RISK CEOD BSZE IND_P ID* YD*, gmmstyle(L.TC L.(ROEP ROET3 T3 LFSIZE LEV RISK CEOD BSZE IND_P), lag(0 1)) ivstyle(LFAGE ID* YD*) twostep robust small
DV is TC; IDVs and control variables are ROEP T3 LFSIZE LFAGE LEV RISK CEOD BSZE IND_P; ROET3 is the interaction variable; ID* are 5 industry dummy variables and YD* are 4 year dummy variables
(1) I want to confirm whether the command is correct or not?
(2) Please suggest some solution to meet all the assumptions along with retaining the significance of the coefficients.
My results were not up to the mark- the p-value of the Hansen and Sargan test was less than 0.1 and sometimes very high; AR(1) and AR (2) both are insignificant. I made some changes to the command like adding collapse to the equation to reduce the number of instruments, changed the classification of some variables from endogenous to exogenous but none worked.
A copy of few results is also attached for reference.
Testing sex differences: interaction effects vs splitting the groups
Discussion
3 replies
  • Annelies van't WesteindeAnnelies van't Westeinde
I have the following issue with a paper, which I think is very common:
We are studying the effect of a certain disease on white matter microstructure. We want to know if there is an effect of this disease in both (or either) males and females. We have run a whole group analysis, and after that split the group by sex to test the association in each sex separately. Our results are as follows: there is no effect in the whole group, there is an effect in males but not in females.
However, as pointed out by one of our reviewers, the correct way of testing sex effect is really to do an interaction analyses, and only split by group if the interaction term is significant. In our case there is no significant interaction term. Thus, if we choose this option, we have no effect in the whole group and no interaction effect. However, if we choose such an approach, we "miss out" on the effect of the disease on white matter in the males.
I guess this boils down to what the question really is. Do we really need to test if males and females differ significantly from each other (which is what an interaction term tests) to know if disease X has an effect on the brain in either sex? Would it not also be accurate to just split and report that we find an effect in males, but not females, and add that we cannot draw a conclusion about the females (as they might have been to low N) and not about males vs females either?
I am curious what your views are on interaction vs splitting groups.

Related Publications

Conference Paper
L'objectif de cet article est de rappeler l'existence de la méthode de régression linéaire de Williamson, qui est la seule à prendre en compte de manière rigoureuse les incertitudes des deux coordonnées des points de mesure. Après avoir rappelé les hypthèses et les résultats de la régression linéaire classique, le principe de cette méthode est rapp...
Article
A changepoint in a time series is a time of change in the marginal distribution, autocovariance, or any other distributional structure of the series. Examples include mean level shifts and volatility (variance) changes. Climate data, for example, is replete with mean shift changepoints, occurring whenever a recording instrument is changed or the ob...
Chapter
This chapter introduces regression analysis, the cornerstone of hypothesis-driven inquiry about health care outcomes. Regression analysis is the quantitative framework that is most commonly used to establish whether outcomes are associated with individual, community, or environmental characteristics. It quantifies the strength of relationships in c...
Got a technical question?
Get high-quality answers from experts.