ArticlePDF Available

Automatic testing of all possible multiple regression models given a set of predictors

Authors:
  • Ulster Institute for Social Research

Abstract

Researcher choice in reporting of regression models allow for questionable research practices. Here I present a function for R that reports all possible regression models given a set of predictors and a dependent variable. I illustrate this function on two datasets of artificial data.
Automatic testing of all possible multiple
regression models given a set of predictors
12. March 2015
Emil O. W. Kirkegaard
Published in The Winnower. https://thewinnower.com/papers/automatic-testing-of-all-possible-
multiple-regression-models-given-a-set-of-predictors
Abstract
Researcher choice in reporting of regression models allow for questionable research practices. Here I
present a function for R that reports all possible regression models given a set of predictors and a
dependent variable. I illustrate this function on two datasets of artificial data.
Introduction
In an email to L.J Zigerell I wrote:
I had a look at your new paper here: rap.sagepub.com/content/2/1/2053168015570996
[Zigerell, 2015]
Generally, I agree about the problem. Pre-registration or always reporting all comparisons
are the obvious solutions. Personally, I will try to pre-register all my survey-type studies
from now on. The second is problematic in that there are sometimes quite a lot of ways to
test a given hypothesis or estimate an effect with a dataset, your paper mentions a few. In
many cases, it may not be easy for the researcher to report all of them by doing the analyses
manually. I see two solutions: 1) tools for making automatic comparisons of all test
methods, and 2) sampling of test methods. The first is preferable if it can be done, and
when it cannot, one can fall back to the second. This is not a complete fix because there
may be ways to estimate an effect using a dataset that the researcher did not even think of.
If the dataset is not open, there is no way for others to conduct such tests. Open data is
necessary.
I have one idea for how to make an automatic comparison of all possible ways to analyze
data. In your working paper, you report only two regression models (Table 1). One
controlling for 3 and one for 6 variables in MR. However, given the choice of these 6
variables, there are 2^6-1 (63) ways to run the MR analysis (the 64th is the empty model, I
guess which could be used to estimate the intercept but nothing else). You could have tried
all of them, and reported only the ones that gave the result you wanted (in line with the
argument in your paper above). I’m not saying you did this of course, but it is possible.
There is researcher degree of freedom about which models to report. There is a reason for
this too which is that normally people run these models manually and running all 63 models
would take a while doing manually (hard coding them or point and click), and they would
also take up a lot of space to report if done via the usual table format.
You can perhaps see where I’m going with this. One can make a function that takes as input
the dependent variable, the set of independent variables and the dataset, and then returns all
the results for all possible ways to include these into MRs. One can then calculate
descriptive statistics (mean, median, range, SD) for the effect sizes (std. betas) of the
different variables to see how stable they are depending on which other variables are
included. This would be a great way to combat the problem of which models to report when
using MR, I think. Better, one can plot the results in a visually informative and appealing
way.
With a nice interactive figure, one can also make it possible for users to try all of them, or
see results for only specific groups of models.
I have now written the code for this. I tested it with two cases, a simple and a complicated one.
Simple case
In the simple case, we have three variables:
a = (normally distributed) noise
b = noise
y = a+b
Then I standardized the data so that betas from regressions are standardized betas. Correlation matrix:
a b y
a 1.00 0.01 0.70
b 0.01 1.00 0.72
y 0.70 0.72 1.00
The small departure from expected values is sampling error (n=1000 in these simulations). The beta
matrix is:
a b
1 0.7 NA
2 NA 0.72
3 0.7 0.71
We see the expected results. The correlations alone are the same as their betas together because they are
nearly uncorrelated (r=.01) and correlated to y at about the same (r=.70 and .72). The correlations and
betas are around .71 because math.
Complicated case
In this case we have 6 predictors some of which are correlated.
a = noise
b = noise
c = noise
d = c+noise
e = a+b+noise
f = .3*a+.7*c+noise
y = .2a+.3b+.5c
Correlation matrix:
a b c d e f y
a 1.00 0.00 0.00 -0.04 0.58 0.24 0.31
b 0.00 1.00 0.00 -0.05 0.57 -0.02 0.47
c 0.00 0.00 1.00 0.72 0.01 0.52 0.82
d -0.04 -0.05 0.72 1.00 -0.04 0.37 0.56
e 0.58 0.57 0.01 -0.04 1.00 0.13 0.46
f 0.24 -0.02 0.52 0.37 0.13 1.00 0.50
y 0.31 0.47 0.82 0.56 0.46 0.50 1.00
And the beta matrix is:
model # a b c d e f
1 0.31 NA NA NA NA NA
2 NA 0.47 NA NA NA NA
3 NA NA 0.82 NA NA NA
4 NA NA NA 0.56 NA NA
5 NA NA NA NA 0.46 NA
6 NA NA NA NA NA 0.50
7 0.31 0.47 NA NA NA NA
8 0.32 NA 0.83 NA NA NA
9 0.34 NA NA 0.57 NA NA
10 0.07 NA NA NA 0.41 NA
11 0.21 NA NA NA NA 0.45
12 NA 0.47 0.83 NA NA NA
13 NA 0.50 NA 0.58 NA NA
14 NA 0.31 NA NA 0.28 NA
15 NA 0.48 NA NA NA 0.51
16 NA NA 0.87 -0.07 NA NA
17 NA NA 0.82 NA 0.45 NA
18 NA NA 0.78 NA NA 0.09
19 NA NA NA 0.58 0.48 NA
20 NA NA NA 0.43 NA 0.34
21 NA NA NA NA 0.40 0.45
22 0.31 0.47 0.83 NA NA NA
23 0.34 0.49 NA 0.59 NA NA
24 0.29 0.45 NA NA 0.03 NA
25 0.20 0.47 NA NA NA 0.46
26 0.31 NA 0.86 -0.04 NA NA
27 0.09 NA 0.82 NA 0.40 NA
28 0.32 NA 0.83 NA NA -0.01
29 0.09 NA NA 0.58 0.43 NA
30 0.27 NA NA 0.47 NA 0.26
31 -0.04 NA NA NA 0.42 0.46
32 NA 0.47 0.84 -0.02 NA NA
33 NA 0.32 0.82 NA 0.27 NA
34 NA 0.47 0.77 NA NA 0.10
35 NA 0.33 NA 0.59 0.30 NA
36 NA 0.50 NA 0.45 NA 0.34
37 NA 0.37 NA NA 0.18 0.48
38 NA NA 0.84 -0.02 0.45 NA
39 NA NA 0.82 -0.07 NA 0.09
40 NA NA 0.81 NA 0.45 0.02
41 NA NA NA 0.48 0.44 0.27
42 0.31 0.47 0.83 0.00 NA NA
43 0.31 0.47 0.83 NA 0.00 NA
44 0.31 0.47 0.83 NA NA 0.00
45 0.32 0.48 NA 0.59 0.02 NA
46 0.27 0.49 NA 0.49 NA 0.26
47 0.18 0.46 NA NA 0.03 0.46
48 0.09 NA 0.83 -0.02 0.40 NA
49 0.32 NA 0.86 -0.04 NA -0.01
50 0.09 NA 0.82 NA 0.40 0.00
51 0.02 NA NA 0.48 0.43 0.26
52 NA 0.32 0.83 -0.01 0.27 NA
53 NA 0.47 0.79 -0.02 NA 0.10
54 NA 0.33 0.79 NA 0.26 0.06
55 NA 0.36 NA 0.47 0.23 0.30
56 NA NA 0.83 -0.02 0.44 0.02
57 0.31 0.47 0.83 0.00 0.00 NA
58 0.31 0.47 0.83 0.00 NA 0.00
59 0.31 0.47 0.83 NA 0.00 0.00
60 0.26 0.48 NA 0.49 0.02 0.26
61 0.09 NA 0.84 -0.02 0.40 0.00
62 NA 0.33 0.80 -0.01 0.26 0.06
63 0.31 0.47 0.83 0.00 0.00 0.00
So we see that the full model (model 63) finds that the betas are the same as the correlations while the
other second-order variables get betas of 0 altho they have positive correlations with y. In other words,
MR is telling us that these variables have zero effect when taking into account a, b and c, and each
other — which is true.
By inspecting the matrix, we can see how one can be an unscrupulous researcher by exploiting
researcher freedom (Simmons, 2011). If one likes the variable d, one can try all these models (or just a
few of them manually), and then selectively report the ones that give strong betas. In this case model 60
looks like a good choice, since it controls for a lot yet still produces a strong beta for the favored
variable. Or perhaps choose model 45, in which beta=.59. Then one can plausibly write in the
discussion section something like this:
Prior research and initial correlations indicated that d may be a potent explanatory factor of y, r=.56
[significant test statistic]. After controlling for a, b and e, the effect was mostly unchanged, beta=.59
[significant test statistic]. However, after controlling for f as well, the effect was somewhat attenuated
but still substantial, beta=.49 [significant test statistic]. The findings thus support theory T about the
importance of d in understanding y.
One can play the game the other way too. If one dislikes a, one might focus on the models that find a
weak effect of a, such as 31, where the beta is -.04 [non-significant test statistic] when controlling for e
and f.
It can also be useful to examine the descriptive statistics of the beta matrix:
var # n mean sd median trimmed mad min max range skew kurtosis se
a 1 32 0.24 0.11 0.31 0.25 0.03 -0.04 0.34 0.38 -0.96 -0.59 0.02
b 2 32 0.44 0.06 0.47 0.45 0.01 0.31 0.50 0.19 -1.10 -0.57 0.01
c 3 32 0.82 0.02 0.83 0.82 0.01 0.77 0.87 0.10 -0.44 0.88 0.00
d 4 32 0.25 0.28 0.22 0.25 0.38 -0.07 0.59 0.66 0.05 -1.98 0.05
e 5 32 0.28 0.17 0.35 0.29 0.14 0.00 0.48 0.48 -0.59 -1.29 0.03
f 6 32 0.21 0.19 0.18 0.20 0.26 -0.01 0.51 0.52 0.27 -1.57 0.03
The range and sd (standard deviation) are useful as a measure of how the effect size of the variable
varies from model to model. We see that among the true causes (a, b, c) the sd and range are smaller
than among the ones that are non-causal correlates (d, e, f). Among the true causes, the weaker causes
have larger ranges and sds. Perhaps one can find a way to adjust for this to get an effect size
independent measure of how much the beta varies from model to model. The mad (median absolute
deviation, robust alternative to the sd) looks like a very promising candidate the detecting the true
causal variables. It is very low (.01, .01 and .03) for the true causes, and at least 4.67 times larger for
the non-causal correlates (.03 and .14).
In any case, I hope to have shown how researchers can use freedom of choice in model choice and
reporting to inflate or deflate the effect size of variables they like/dislike. There are two ways to deal
with this. Researchers must report all the betas with all available variables in a study (this table can get
large, because it is 2^n-1 where n is the number of variables), e.g. using my function or an equivalent
function, or better, the data must be available for reanalysis by others.
Source code
The function is available from the psych2 repository at Github. The R code for this paper is available
on the Open Science Framework repository.
References
Simmons JP, Nelson LD, Simonsohn U. (2011). False-positive psychology: Undisclosed flexibility in
data collection and analysis allows presenting anything as significant. Psychological Science 22(11):
1359–1366.
Zigerell, L. J. (2015). Inferential selection bias in a study of racial bias: Revisiting ‘Working twice as
hard to get half as far’. Research & Politics, 2(1), 2053168015570996.
... For these analyses, the federal districts are excluded. Tables 12 and 13 show the beta matrix, generated using the method presented in Kirkegaard (2015c), for the regression results. For cognitive ability, the results show that we can improve our prediction slightly by adding a second ancestry variable: from .57 ...
... Second, researchers only try some possible models and may also not report all the models they tried. This makes it possible for them to bias the results, perhaps inadvertently (Kirkegaard, 2015c;Zigerell, 2015). ...
... R 2 fit measurement. It is easy to run all the regressions using a function developed by one of us (Kirkegaard, 2015c). Standardized betas and adjusted R 2 s are shown in Table 30 for all 31 models (5 predictors, 2 5 =32, and then we skip the null model). ...
Article
Full-text available
We conducted novel analyses regarding the association between continental racial ancestry, cognitive ability and socioeconomic outcomes across 6 datasets: states of Mexico, states of the United States, states of Brazil, departments of Colombia, sovereign nations and all units together. We find that European ancestry is consistently and usually strongly positively correlated with cognitive ability and socioeconomic outcomes (mean r for cognitive ability = .708; for socioeconomic well-being = .643) (Sections 3-8). In most cases, including another ancestry component, in addition to European ancestry, did not increase predictive power (Section 9). At the national level, the association between European ancestry and outcomes was robust to controls for natural-environmental factors (Section 10). This was not always the case at the regional level (Section 18). It was found that genetic distance did not have predictive power independent of European ancestry (Section 10). Automatic modeling using best subset selection and lasso regression agreed in most cases that European ancestry was a non-redundant predictor (Section 11). Results were robust across 4 different ways of weighting the analyses (Section 12). It was found that the effect of European ancestry on socioeconomic outcomes was mostly mediated by cognitive ability (Section 13). We failed to find evidence of international colorism or culturalism (i.e., neither skin reflectance nor self-reported race/ethnicity showed incremental predictive ability once genomic ancestry had been taken into account) (Section 14). The association between European ancestry and cognitive outcomes was robust across a number of alternative measures of cognitive ability (Section 15). It was found that the general socioeconomic factor was not structurally different in the American sample as compared to the worldwide sample, thus justifying the use of that measure. Using Jensen's method of correlated vectors, it was found that the association between European ancestry and socioeconomic outcomes was stronger on more S factor loaded outcomes, r = .75 (Section 16). There was some evidence that tourist expenditure helped explain the relatively high socioeconomic performance of Caribbean states (Section 17).
Article
Full-text available
The data includes measures collected for the two experiments reported in “False-Positive Psychology” [1] where listening to a randomly assigned song made people feel younger (Study 1) or actually be younger (Study 2). These data are useful because they illustrate inflations of false positive rates due to flexibility in data collection, analysis, and reporting of results. Data are useful for educational purposes.
Article
Full-text available
This study revisits an important recent article about racial bias and finds that many of its inferences are weakened when we analyze the data more completely. DeSante in 2013 reported evidence from a survey experiment indicating that Americans reward Whites more than Blacks for hard work but penalize Blacks more than Whites for laziness. However, the present study demonstrates that these inferences were based on an unrepresentative selection of possible analyses: the original article does not include all possible equivalent or relevant analyses, and when results from these additional analyses are combined with the results reported in the original article, the strength of inferences is weakened. Moreover, newly-reported evidence reveals heterogeneity in racial bias: respondents given a direct choice between equivalent targets of different races favored the Black target over the White target. These results illustrate how the presence of researcher degrees of freedom can foster production of inferences that are not representative of all inferences that a set of data could produce. This study thus highlights the value of preregistering research design protocols and required public posting of data.