Content uploaded by Emil O. W. Kirkegaard
Author content
All content in this area was uploaded by Emil O. W. Kirkegaard on Jan 06, 2020
Content may be subject to copyright.
Automatic testing of all possible multiple
regression models given a set of predictors
12. March 2015
Emil O. W. Kirkegaard
Published in The Winnower. https://thewinnower.com/papers/automatic-testing-of-all-possible-
multiple-regression-models-given-a-set-of-predictors
Abstract
Researcher choice in reporting of regression models allow for questionable research practices. Here I
present a function for R that reports all possible regression models given a set of predictors and a
dependent variable. I illustrate this function on two datasets of artificial data.
Introduction
In an email to L.J Zigerell I wrote:
I had a look at your new paper here: rap.sagepub.com/content/2/1/2053168015570996
[Zigerell, 2015]
Generally, I agree about the problem. Pre-registration or always reporting all comparisons
are the obvious solutions. Personally, I will try to pre-register all my survey-type studies
from now on. The second is problematic in that there are sometimes quite a lot of ways to
test a given hypothesis or estimate an effect with a dataset, your paper mentions a few. In
many cases, it may not be easy for the researcher to report all of them by doing the analyses
manually. I see two solutions: 1) tools for making automatic comparisons of all test
methods, and 2) sampling of test methods. The first is preferable if it can be done, and
when it cannot, one can fall back to the second. This is not a complete fix because there
may be ways to estimate an effect using a dataset that the researcher did not even think of.
If the dataset is not open, there is no way for others to conduct such tests. Open data is
necessary.
I have one idea for how to make an automatic comparison of all possible ways to analyze
data. In your working paper, you report only two regression models (Table 1). One
controlling for 3 and one for 6 variables in MR. However, given the choice of these 6
variables, there are 2^6-1 (63) ways to run the MR analysis (the 64th is the empty model, I
guess which could be used to estimate the intercept but nothing else). You could have tried
all of them, and reported only the ones that gave the result you wanted (in line with the
argument in your paper above). I’m not saying you did this of course, but it is possible.
There is researcher degree of freedom about which models to report. There is a reason for
this too which is that normally people run these models manually and running all 63 models
would take a while doing manually (hard coding them or point and click), and they would
also take up a lot of space to report if done via the usual table format.
You can perhaps see where I’m going with this. One can make a function that takes as input
the dependent variable, the set of independent variables and the dataset, and then returns all
the results for all possible ways to include these into MRs. One can then calculate
descriptive statistics (mean, median, range, SD) for the effect sizes (std. betas) of the
different variables to see how stable they are depending on which other variables are
included. This would be a great way to combat the problem of which models to report when
using MR, I think. Better, one can plot the results in a visually informative and appealing
way.
With a nice interactive figure, one can also make it possible for users to try all of them, or
see results for only specific groups of models.
I have now written the code for this. I tested it with two cases, a simple and a complicated one.
Simple case
In the simple case, we have three variables:
a = (normally distributed) noise
b = noise
y = a+b
Then I standardized the data so that betas from regressions are standardized betas. Correlation matrix:
a b y
a 1.00 0.01 0.70
b 0.01 1.00 0.72
y 0.70 0.72 1.00
The small departure from expected values is sampling error (n=1000 in these simulations). The beta
matrix is:
a b
1 0.7 NA
2 NA 0.72
3 0.7 0.71
We see the expected results. The correlations alone are the same as their betas together because they are
nearly uncorrelated (r=.01) and correlated to y at about the same (r=.70 and .72). The correlations and
betas are around .71 because math.
Complicated case
In this case we have 6 predictors some of which are correlated.
a = noise
b = noise
c = noise
d = c+noise
e = a+b+noise
f = .3*a+.7*c+noise
y = .2a+.3b+.5c
Correlation matrix:
a b c d e f y
a 1.00 0.00 0.00 -0.04 0.58 0.24 0.31
b 0.00 1.00 0.00 -0.05 0.57 -0.02 0.47
c 0.00 0.00 1.00 0.72 0.01 0.52 0.82
d -0.04 -0.05 0.72 1.00 -0.04 0.37 0.56
e 0.58 0.57 0.01 -0.04 1.00 0.13 0.46
f 0.24 -0.02 0.52 0.37 0.13 1.00 0.50
y 0.31 0.47 0.82 0.56 0.46 0.50 1.00
And the beta matrix is:
model # a b c d e f
1 0.31 NA NA NA NA NA
2 NA 0.47 NA NA NA NA
3 NA NA 0.82 NA NA NA
4 NA NA NA 0.56 NA NA
5 NA NA NA NA 0.46 NA
6 NA NA NA NA NA 0.50
7 0.31 0.47 NA NA NA NA
8 0.32 NA 0.83 NA NA NA
9 0.34 NA NA 0.57 NA NA
10 0.07 NA NA NA 0.41 NA
11 0.21 NA NA NA NA 0.45
12 NA 0.47 0.83 NA NA NA
13 NA 0.50 NA 0.58 NA NA
14 NA 0.31 NA NA 0.28 NA
15 NA 0.48 NA NA NA 0.51
16 NA NA 0.87 -0.07 NA NA
17 NA NA 0.82 NA 0.45 NA
18 NA NA 0.78 NA NA 0.09
19 NA NA NA 0.58 0.48 NA
20 NA NA NA 0.43 NA 0.34
21 NA NA NA NA 0.40 0.45
22 0.31 0.47 0.83 NA NA NA
23 0.34 0.49 NA 0.59 NA NA
24 0.29 0.45 NA NA 0.03 NA
25 0.20 0.47 NA NA NA 0.46
26 0.31 NA 0.86 -0.04 NA NA
27 0.09 NA 0.82 NA 0.40 NA
28 0.32 NA 0.83 NA NA -0.01
29 0.09 NA NA 0.58 0.43 NA
30 0.27 NA NA 0.47 NA 0.26
31 -0.04 NA NA NA 0.42 0.46
32 NA 0.47 0.84 -0.02 NA NA
33 NA 0.32 0.82 NA 0.27 NA
34 NA 0.47 0.77 NA NA 0.10
35 NA 0.33 NA 0.59 0.30 NA
36 NA 0.50 NA 0.45 NA 0.34
37 NA 0.37 NA NA 0.18 0.48
38 NA NA 0.84 -0.02 0.45 NA
39 NA NA 0.82 -0.07 NA 0.09
40 NA NA 0.81 NA 0.45 0.02
41 NA NA NA 0.48 0.44 0.27
42 0.31 0.47 0.83 0.00 NA NA
43 0.31 0.47 0.83 NA 0.00 NA
44 0.31 0.47 0.83 NA NA 0.00
45 0.32 0.48 NA 0.59 0.02 NA
46 0.27 0.49 NA 0.49 NA 0.26
47 0.18 0.46 NA NA 0.03 0.46
48 0.09 NA 0.83 -0.02 0.40 NA
49 0.32 NA 0.86 -0.04 NA -0.01
50 0.09 NA 0.82 NA 0.40 0.00
51 0.02 NA NA 0.48 0.43 0.26
52 NA 0.32 0.83 -0.01 0.27 NA
53 NA 0.47 0.79 -0.02 NA 0.10
54 NA 0.33 0.79 NA 0.26 0.06
55 NA 0.36 NA 0.47 0.23 0.30
56 NA NA 0.83 -0.02 0.44 0.02
57 0.31 0.47 0.83 0.00 0.00 NA
58 0.31 0.47 0.83 0.00 NA 0.00
59 0.31 0.47 0.83 NA 0.00 0.00
60 0.26 0.48 NA 0.49 0.02 0.26
61 0.09 NA 0.84 -0.02 0.40 0.00
62 NA 0.33 0.80 -0.01 0.26 0.06
63 0.31 0.47 0.83 0.00 0.00 0.00
So we see that the full model (model 63) finds that the betas are the same as the correlations while the
other second-order variables get betas of 0 altho they have positive correlations with y. In other words,
MR is telling us that these variables have zero effect when taking into account a, b and c, and each
other — which is true.
By inspecting the matrix, we can see how one can be an unscrupulous researcher by exploiting
researcher freedom (Simmons, 2011). If one likes the variable d, one can try all these models (or just a
few of them manually), and then selectively report the ones that give strong betas. In this case model 60
looks like a good choice, since it controls for a lot yet still produces a strong beta for the favored
variable. Or perhaps choose model 45, in which beta=.59. Then one can plausibly write in the
discussion section something like this:
Prior research and initial correlations indicated that d may be a potent explanatory factor of y, r=.56
[significant test statistic]. After controlling for a, b and e, the effect was mostly unchanged, beta=.59
[significant test statistic]. However, after controlling for f as well, the effect was somewhat attenuated
but still substantial, beta=.49 [significant test statistic]. The findings thus support theory T about the
importance of d in understanding y.
One can play the game the other way too. If one dislikes a, one might focus on the models that find a
weak effect of a, such as 31, where the beta is -.04 [non-significant test statistic] when controlling for e
and f.
It can also be useful to examine the descriptive statistics of the beta matrix:
var # n mean sd median trimmed mad min max range skew kurtosis se
a 1 32 0.24 0.11 0.31 0.25 0.03 -0.04 0.34 0.38 -0.96 -0.59 0.02
b 2 32 0.44 0.06 0.47 0.45 0.01 0.31 0.50 0.19 -1.10 -0.57 0.01
c 3 32 0.82 0.02 0.83 0.82 0.01 0.77 0.87 0.10 -0.44 0.88 0.00
d 4 32 0.25 0.28 0.22 0.25 0.38 -0.07 0.59 0.66 0.05 -1.98 0.05
e 5 32 0.28 0.17 0.35 0.29 0.14 0.00 0.48 0.48 -0.59 -1.29 0.03
f 6 32 0.21 0.19 0.18 0.20 0.26 -0.01 0.51 0.52 0.27 -1.57 0.03
The range and sd (standard deviation) are useful as a measure of how the effect size of the variable
varies from model to model. We see that among the true causes (a, b, c) the sd and range are smaller
than among the ones that are non-causal correlates (d, e, f). Among the true causes, the weaker causes
have larger ranges and sds. Perhaps one can find a way to adjust for this to get an effect size
independent measure of how much the beta varies from model to model. The mad (median absolute
deviation, robust alternative to the sd) looks like a very promising candidate the detecting the true
causal variables. It is very low (.01, .01 and .03) for the true causes, and at least 4.67 times larger for
the non-causal correlates (.03 and .14).
In any case, I hope to have shown how researchers can use freedom of choice in model choice and
reporting to inflate or deflate the effect size of variables they like/dislike. There are two ways to deal
with this. Researchers must report all the betas with all available variables in a study (this table can get
large, because it is 2^n-1 where n is the number of variables), e.g. using my function or an equivalent
function, or better, the data must be available for reanalysis by others.
Source code
The function is available from the psych2 repository at Github. The R code for this paper is available
on the Open Science Framework repository.
References
Simmons JP, Nelson LD, Simonsohn U. (2011). False-positive psychology: Undisclosed flexibility in
data collection and analysis allows presenting anything as significant. Psychological Science 22(11):
1359–1366.
Zigerell, L. J. (2015). Inferential selection bias in a study of racial bias: Revisiting ‘Working twice as
hard to get half as far’. Research & Politics, 2(1), 2053168015570996.