ArticlePDF Available

Abstract and Figures

We present a simple mathematical technique that we call granularity-related inconsistency of means (GRIM) for verifying the summary statistics of research reports in psychology. This technique evaluates whether the reported means of integer data such as Likert-type scales are consistent with the given sample size and number of items. We tested this technique with a sample of 260 recent empirical articles in leading journals. Of the articles that we could test with the GRIM technique (N = 71), around half (N = 36) appeared to contain at least one inconsistent mean, and more than 20% (N = 16) contained multiple such inconsistencies. We requested the data sets corresponding to 21 of these articles, receiving positive responses in 9 cases. We confirmed the presence of at least one reporting error in all cases, with three articles requiring extensive corrections. The implications for the reliability and replicability of empirical psychology are discussed.
Content may be subject to copyright.
The GRIM test: A simple technique detects numerous anomalies in the reporting of results in 1
psychology 2
Nicholas J. L. Brown (*) 4
University Medical Center, University of Groningen, The Netherlands 5
James A. J. Heathers 7
Division of Cardiology and Intensive Therapy, Poznań University of Medical Sciences 8
University of Sydney 9
(*) Corresponding author. E-mail: 11
Nick Brown is a PhD candidate at the University Medical Center, University of Groningen, The 13
Netherlands. 14
James Heathers conducted the bulk of the work described in the attached document while a 15
postdoctoral fellow at the Poznań University of Medical Sciences in Poland. He is currently a 16
postdoctoral fellow at Northeastern University. 17
Acknowledgements 19
The authors wish to thank Tim Bates and Chris Chambers for their helpful comments on an 20
earlier draft of this article, as well as those authors of article that we examined who kindly 21
provided their data sets and help with the reanalysis of these. 22
Abstract 24
We present a simple mathematical technique that we call GRIM (Granularity-Related 25
Inconsistency of Means) for verifying the summary statistics of research reports in psychology. 26
This technique evaluates whether the reported means of integer data such as Likert-type scales 27
are consistent with the given sample size and number of items. We tested this technique with a 28
sample of 260 recent empirical articles in leading journals. Of the articles that we could test with 29
the GRIM technique (N=71), around half (N=36) appeared to contain at least one inconsistent 30
mean, and more than 20% (N=16) contained multiple such inconsistencies. We requested the 31
data sets corresponding to 21 of these articles, receiving positive responses in nine cases. We 32
confirmed the presence of at least one reporting error in all cases, with three articles requiring 33
extensive corrections. The implications for the reliability and replicability of empirical 34
psychology are discussed. 35
Consider the following (fictional) extract from a recent article in the Journal of Porcine Aviation 36
Potential: 37
Participants (N=55) were randomly assigned to drink 200ml of water that either contained 38
(experimental condition, N=28) or did not contain (control condition, N=27) 17g of 39
cherry flavor Kool-Aid
powder. Fifteen minutes after consuming the beverage, 40
participants responded to the question, “To what extent do you believe that pigs can fly?” 41
on a seven-point scale from 1 (Not at all) to 7 (Definitely). Participants in the “drank the 42
Kool-Aid” condition reported a significantly stronger belief in the ability of pigs to fly 43
(M=5.19, SD=1.34) than those in the control condition (M=3.86, SD=1.41), t(53)=3.59, 44
p<.001. 45
These results seem superficially reasonable, but are actually mathematically impossible. The 46
reported means represent either errors of transcription, some version of misreporting, or the 47
deliberate manipulation of results. Specifically, the mean of the 28 participants in the 48
experimental condition, reported as 5.19, cannot be correct. Since all responses were integers 49
between 1 and 7, the total of the response scores across all participants must fall in the range 28–50
196. The two integers that give a result closest to the reported mean of 5.19 are 145 and 146. 51
However, 145 divided by 28 is
85714217.5 , which conventional rounding returns as 5.18. 52
Likewise, 146 divided by 28 is 42857121.5 , which rounds to 5.21. That is, there is no 53
combination of responses that can give a mean of 5.19 when correctly rounded. Similar 54
considerations apply to the reported mean of 3.86 in the control condition: Multiplying this value 55
by the sample size (27) gives 104.22, suggesting that the total score across participants must 56
have been either 104 or 105. But 104 divided by 27 is 851.3 , which rounds to 3.85, and 105 57
divided by 27 is 888.3 , which rounds to 3.89. 58
In this article, we first introduce the general background to and calculation of what we 59
term the Granularity-Related Inconsistent Means (GRIM) test. Next, we report on the results of 60
an analysis using the GRIM test of a number of published articles from leading psychological 61
journals. Finally, we discuss the implications of these results for the published literature in 62
empirical psychology. 63
General description of the GRIM technique for reanalyzing published data 65
Participant response data collected in psychology are typically ordinal in nature—that is, the 66
recorded values have meaning in terms of their rank order, but the number representing them are 67
arbitrary, such that the value corresponding to any item has no significance beyond its ability to 68
establish a position on a continuum relative to the other numbers. For example, the seven-point 69
scale cited in our opening example, running from 1 to 7, could equally well have been coded 70
from 0 to 6, or from 6 to 0, or from 10 to 70 in steps of 10. However, while the limits of ordinal 71
data in measurement have been extensively discussed for many years (e.g., Carifio & Perla, 72
2007; Coombs, 1960; Jamieson, 2004; Thurstone, 1927), it remains common practice to treat 73
ordinal data composed of small integers as if they were measured on an interval scale, calculate 74
their means and standard deviations, and apply inferential statistics to those values. Other 75
common measures used in psychological research produce genuine interval-level data in the 76
form of integers; for example, one might count the number of anagrams unscrambled, or the 77
number of errors made on the Stroop test, within a given time interval. Thus, psychological data 78
often consist of integer totals, divided by the sample size. 79
One often-overlooked property of data derived from such non-continuous measures, 80
whether ordinal or interval, is their
—that is, the numerical separation between 81
possible values of the summary statistics. Here, we consider the example of the mean. With 82
typical Likert-type data, the smallest amount by which two means can differ is the reciprocal of 83
the product of the number of participants and the number of items (questions) that make up the 84
scale. For example, if we administer a three-item Likert-type measure to 10 people, the smallest 85
amount by which two mean scores can differ (the granularity of the mean) is
.303.0310/1 =× 86
If means are reported to two decimal places, then—although there are 100 possible numbers with 87
two decimal places in the range
(1.00, 1.01, 1.02, etc., up to 1.99)—the possible 88
values of the (rounded) mean are considerably fewer (1.00, 1.03, 1.07, 1.10, etc., up to 1.97). If 89
the number of participants
N is less than 100 and the measured quantity is an integer, then not 90
all of the possible sequences of two digits can occur after the decimal point in correctly rounded 91
fractions. We use the term
to refer to reported means of integer data whose value, 92
appropriately rounded, cannot be reconciled with the stated sample size. (More generally, if the 93
number of decimal places reported is
, then some combinations of digits will not be consistent 94
is less than
.) 95
This relation is always true for integer data that are recorded as single items, such as 96
participants’ ages in whole years, or a one-item Likert-type measure, as frequently used as a 97
manipulation check. In particular, the number of possible responses to each item is irrelevant; 98
that is, it makes no difference whether responses can range from 0 to 3, or from 1 to 100. When 99
a composite measure is used, such as one with three Likert-type items where the mean of the 100
item scores is taken as the value of the measure, this mean value will not necessarily be an 101
integer; instead, it will be some multiple of (1/
), where
is the number of items in the measure. 102
Similar considerations would apply to a hypothetical one-item measure where the possible 103
responses are simple fractions instead of integers. For example, a scale with possible responses 104
of 0.0, 0.5, 1.0, 1.5, 2.0, 2.5, and 3.0 would be equivalent to a two-item measure with integer 105
responses in the range 0–3. Alternatively, in a money game where participants play with 106
quarters, and the final amount won or lost is expressed in dollars, only values ending in .00, .25, 107
.50 or .75 are possible. However, the range of possible values that such means can take is still 108
constrained (in the example of the three-item Likert-type scale, assuming item scores starting at 109
1, this range will be 1.00, 33.1 , 66.1 , 2.00, 33.2 , etc.,) and so for any given sample size, the 110
range of possible values for the mean of all participants is also constrained. For example, with a 111
sample size of 20 and
=3, possible values for the mean are 1.00, 1.02 [rounded from 601.1 ], 112
1.03 [rounded from 303.1 ], 1.05, 1.07, etc. More generally, the range of means for a measure 113
items (or an interval scale with an implicit granularity of (1/
), where
is a small integer, 114
such as 4 in the example of the game played with quarters) and a sample size of
is identical to 115
the range of means for a measure with one item and a sample size of
Thus, by 116
multiplying the sample size by the number of items in the scale, composite measures can be 117
analyzed using the GRIM technique in the same way as single items, although as the number of 118
scale items increases, the maximum sample size for which this analysis is possible is 119
correspondingly reduced as the granularity decreases towards 0.01. We use the term
to refer to variables whose granularity (typically, one divided by the product of the 121
number of scale items and the number of participants) is sufficiently large that they can be tested 122
for inconsistencies with the GRIM technique. For example, a five-item measure with 25 123
participants has the same granularity (0.008) as a one-item measure with 125 participants, and 124
hence scores on this measure are not typically GRIM-testable. 125
Figure 1. Plot of consistent (white dots) and inconsistent (black dots) means, reported to 2 126
decimal places. 127
Notes: 129
1. As the sample size increases towards 100, the number of means that are consistent with that 130
sample size also increases, as shown by the greater number of white (versus black) dots. Thus, 131
GRIM works better with smaller sample sizes, as the chance of any individual incorrectly-132
reported mean being consistent by chance is lower. 133
2. The Y axis represents only the fractional portion of the mean (i.e., the part after the decimal 134
point), because the integer portion of the mean plays no role. That is, for any given sample size, 135
if a mean of 2.49 is consistent with the sample size, then means of 0.49 or 8.49 are also 136
consistent. 137
3. This figure assumes that means ending in 5 at the third decimal place (e.g., 10/80=0.125) are 138
always rounded up; if such means are allowed to be rounded up or down, a few extra white dots 139
will appear at sample sizes that are multiples of 8. 140
Figure 1 shows the distribution of consistent (shown in white) and inconsistent (shown in 141
black) means as a function of the sample size. Note that only the two-digit fractional portion of 142
each mean is linked to consistency; the integer portion plays no role. The overall pattern is clear: 143
As the sample size increases, the number of means that are consistent with that sample size also 144
increases, and so the chance that any single incorrectly-reported mean will be detected as 145
inconsistent is reduced. However, even with quite large sample sizes, it is still possible to detect 146
inconsistent means if an article contains multiple inconsistencies. For example, consider a study 147
=75 and six reported “means” whose values have, in fact, been chosen at random: There is 148
a 75% chance that any one random “mean” will be consistent, but only a 17.8% (0.75
) chance 149
that all six will be. 150
Our general formula, then, is that when the number of participants (
) is multiplied by 151
the number of items composing a measured quantity (
, commonly equal to 1), and the means 152
that are based on
are reported to
decimal places, then if
, there exists some 153
number of decimal fractions of length
that cannot occur if the means are reported correctly. 154
The number of inconsistent values is generally equal to (10
); however, in the analyses 155
reported in the present article, we conservatively allowed numbers ending in exactly 5 at the 156
third decimal place to be rounded either up or down without treating the resulting means as 157
inconsistent, so that some values of
have fewer possible inconsistent means than this formula 158
indicates. 159
Using the GRIM technique, it is possible to examine published reports of empirical 160
research to see whether the means have been reported correctly
. Psychological journals 161
We have provided a simple spreadsheet at that automates the steps of this
typically require the reporting of means to two decimal places, in which case the sample size 162
corresponding to each mean must be less than 100 in order for its consistency to be checked. 163
However, since the means of interest in experimental psychology are often those for subgroups 164
of the overall sample (for example, the numbers in each experimental condition), it can still be 165
possible to apply the GRIM technique to studies with overall sample sizes substantially above 166
100. (Note that percentages reported to only one decimal place can typically be tested for 167
consistency with a sample size of up to 1000, as they are, in effect, fractions reported to three 168
decimal places.) 169
We now turn to our pilot trial of the GRIM test. 170
We searched recently published (2011–2015) issues of
Psychological Science
Journal of
Experimental Psychology
), and
Journal of Personality and Social Psychology
) for articles containing the word “Likert” anywhere in the text. This strategy was chosen 175
because we expected to find Likert-type data reported in most of the articles containing that word 176
(although we also checked the consistency of the means of other integer data where possible). 177
We sorted the results with the most recent first and downloaded at most the first 100 matching 178
articles from each journal. Thus, our sample consisted of 100 articles from
published 179
between January 2011 and December 2015, 60 articles from
published between January 180
2011 and December 2015, and 100 articles from
published between October 2012 and 181
December 2015. 182
We examined the Method section of each study reported in these articles to see whether 183
GRIM-testable measures were used, and to determine the sample sizes for the study and, where 184
appropriate, each condition. A preliminary check was performed by the first author; if he did not 185
see evidence of either GRIM-testable measures, or any (sub)sample sizes less than 100, the 186
article was discarded. Subsequently, each author worked independently on the retained articles. 187
We examined the table of descriptives (if present), other result tables, and the text of the Results 188
section, looking for means or percentages that we could check using the GRIM technique. On 189
the basis of our tests, we assigned each article a subjective “inconsistency level” rating. A rating 190
of 0 (
no problems
) meant that all the means we were able to check were consistent, even if those 191
means represented only a small percentage of the reported data in the article. We assigned a 192
rating of 1 (
minor problems
) to articles that contained only one or two inconsistent numbers, 193
where we believed that these were most parsimoniously explained by typographical or 194
transcription errors, and where an incorrect value would have little effect on the main 195
conclusions of the article. Articles that had a small number of inconsistencies that might impact 196
the principal results were rated at level 2 (
moderate problems
); we also gave this rating to 197
articles in which the results seemed to be uninterpretable as described. Finally, we applied a 198
rating of 3 (
substantial problems
) to articles with a larger number of inconsistencies, especially if 199
these appeared at multiple points in the article. Finally, ratings were compared between the 200
authors and differences resolved by discussion. 201
The total number of articles examined from each journal, the number retained for GRIM 204
analysis, and the number to which we assigned each rating, are shown in Table 1. A total of 260 205
articles were initially examined. Of these, 189 (72.7%) were discarded, principally because 206
either they reported no GRIM-testable data or their sample sizes were all sufficiently large that 207
no inconsistent means were likely to be detected. Of the remaining 71 articles, 35 (49.3%) 208
reported all GRIM-testable data consistently and were assigned an inconsistency level rating of 209
0. That left us with 36 articles that appeared to contain one or more inconsistency. Of these, we 210
assigned a rating of 1 to 15 articles (21.1% of the 71 in total for which we performed a GRIM 211
analysis), a rating of 2 to five articles (7.0%), and a rating of 3 to 16 articles (22.5%). In some of 212
these “level 3” articles, over half of the GRIM-testable values were inconsistent with the stated 213
sample size. 214
Table 1 Journals and Articles Consulted 217
Journal PS JEP:G JPSP Total
Number of articles 100 60 100 260
Earliest article date January 2011 January 2011 October 2012
Articles with GRIM-testable data 29 15 27 71
Level 0 articles (no problems detected) 16 8 11 35
Level 1 articles (minor problems) 5 3 7 15
Level 2 articles (moderate problems) 1 1 3 5
Level 3 articles (substantial problems) 7 3 6 16
Psychological Science
Journal of Experimental Psychology
. 218
Journal of Personality and Social Psychology
. 219
Next, we e-mailed
the corresponding authors of the articles that were rated at level 2 or 3 222
asking for their data. In response to our 21 initial requests, we received 11 replies within two 223
weeks. At the end of that period, we sent follow-up requests to the 10 authors who had not 224
replied to our initial e-mail. In response to either the first or second e-mail, we obtained the 225
requested data from eight authors, while a ninth provided us with sufficient information about 226
the data in question to enable us to check the consistency of the means. Four authors promised 227
to send the requested data, but have not done so to date. Five authors either directly or 228
The text of our e-mails is available in the supplementary information for this article.
effectively refused to share their data, even after we explained the nature of our study; 229
interestingly, two of these refusals were identically worded. In another case, the corresponding 230
author’s personal e-mail address had been deleted; another author informed us that the 231
corresponding author had left academia, and that the location of the data was unknown. Finally, 232
two of our requests went completely unanswered after the second e-mail. 233
Our examination of the data that we received showed that the GRIM technique identified 234
one or more genuine problem in each case. We report the results of each analysis briefly here, in 235
the order in which the data were received. 236
Data set 1
. Our GRIM analysis had detected two inconsistent means in a table of descriptives, as 237
well as eight inconsistent standard deviations
. Examining the data, we found that the two 238
inconsistent means and one of the inconsistent SDs were caused by the sample size for that cell 239
not corresponding to the sample size for the column of data in question; five SDs had been 240
incorrectly rounded because the default (3 decimal places) setting of SPSS had caused a value of 241
1.2849 to be rounded to 1.285, which the authors had subsequently rounded manually to 1.29; 242
and two further SDs appeared to have been incorrectly transcribed, with values of 0.79 and 0.89 243
being reported as 0.76 and 0.86, respectively. All of these errors were minor and had no 244
substantive effect on the published results of the article. 245
Data set 2
. Our reading of the article in this case had detected several inconsistent means, as 246
well several inconsistently-reported degrees of freedom and apparent errors in the reporting of 247
some other statistics. Examination of the data confirmed most of these problems, and indeed 248
SDs exhibit granularity in an analogous way to means, but the determination of (in)consistency
for SDs is considerably more complicated. We hope to cover the topic of inconsistent SDs in a
future article.
revealed a number of additional errors in the authors’ analysis. We subsequently discovered that 249
the article in question had already been the subject of a correction in the journal, although that 250
had not addressed most of the problems that we found. We will write to the authors to suggest a 251
number of points that require (further) correction. 252
Data set 3
. In this case, our GRIM analysis had shown a large number of inconsistent means in 253
two tables of descriptives. The corresponding author provided us with an extensive version of 254
the data set, including some intermediate analysis steps. We identified that most of the entries in 255
the descriptives had been calculated using a Microsoft Excel formula that included an incorrect 256
selection of cells; for example, this resulted in the mean and SD of the first experimental 257
condition being included as data points in the calculation of the mean and SD of the second. The 258
author has assured us that a correction will be issued. 259
Data set 4
. In the e-mail accompanying their data, the authors of this article spontaneously 260
apologized in advance (even though we had not yet told them exactly why we were asking for 261
their data) for possible discrepancies between the sample sizes in the data and those reported in 262
the article. They stated that, due to computer-related issues, they had only been able to retrieve 263
an earlier version of the data set, rather than the final version on which the article was based. We 264
adjusted the published sample sizes using the notes that the authors provided, and found that this 265
adequately resolved the GRIM inconsistencies that we had identified. 266
Data set 5
. The GRIM analyses in this case found some inconsistent means in the reporting of 267
the data that were used as the input to a number of
tests, as well as in the descriptives for one of 268
the conditions in the study. Analysis revealed that the former problems were the result of the 269
authors having reported the
s from the output of a repeated-measures ANOVA in which some 270
cases were missing, so that these
s were smaller than those reported in the method section. The 271
problems in the descriptives were caused by incorrect reporting of the number of participants 272
who were excluded from analyses. We were unable to determine to what extent this difference 273
affected the results of the study. 274
Data set 6
. Here, the inconsistencies that we detected were mostly due to the misreporting by 275
the authors of their sample size. This was not easy to explain as a typographical error, as the 276
number was reported as a word at the start of a sentence (e.g., “Sixty undergraduates took part”). 277
Additionally, one inconsistent standard deviation turned out to have been incorrectly copied 278
during the drafting process. 279
Data set 7
. This data set confirmed numerous inconsistencies, including large errors in the 280
reported degrees of freedom for several
tests, from which we had inferred the per-cell sample 281
sizes. Furthermore, a number that was meant to be the result of subtracting one Likert-type item 282
score from another (thus giving an integer result) had the impossible value of 1.5. We reported 283
these inconsistencies to the corresponding author, but received no acknowledgement. 284
Data set 8
. The corresponding author indicated that providing the full data set could be 285
complicated, as the data were taken from a much larger longitudinal study. Instead, we provided 286
a detailed explanation of the specific inconsistencies we had found. The author checked these 287
and confirmed that the sample size of the study in question had been reported incorrectly, as 288
several participants had been excluded from the analyses but not from the reported count of 289
participants. The author thanked us for finding this minor (to us) inconsistency and described the 290
exercise as “a good lesson.” 291
Data set 9
. In this case, we asked for data for three studies from a multiple-study article. In the 292
first two studies, we found some reporting problems with standard deviations in the descriptives 293
and some other minor problems to do with the handling of missing values for some variables. 294
For the third study, however, the corresponding author reported that, during the process of 295
preparing the data set to send to us, an error in the analyses had been discovered that was 296
sufficiently serious as to warrant a correction to the published article. 297
For completeness, we should also mention that in one of the cases above, the data that we 298
received showed that we had failed to completely understand the original article; what we had 299
thought were inconsistencies in the means on a Likert-type measure were due to that measure 300
being a multiple-item composite, and we had overlooked that it was correctly reported as such. 301
While our analysis also discovered separate problems with the article in question, this 302
underscores how careful reading is always necessary when using the GRIM technique. 303
We identified a simple method for detecting discrepancies in the reporting of statistics derived 306
from integer-based data, and applied it to a sample of empirical articles published in leading 307
journals of psychology. Of the articles that we were able to test, around half appeared to contain 308
one or more errors in the summary statistics. (We have no way of knowing how many 309
inconsistencies might have been discovered in the articles with larger samples, had it been 310
standard practice to report means to three decimal places.) Nine datasets were examined in more 311
detail, and we confirmed the existence of reporting problems in all nine, with three articles 312
requiring formal corrections. 313
We anticipate that the GRIM technique could be a useful tool for reviewers and editors. 314
A GRIM check of the reported means of an article submitted for review ought to take only a few 315
minutes. (Indeed, we found that even when no inconsistencies were uncovered, simply 316
performing this check enhanced our understanding of the methods used in the articles that we 317
read.) When GRIM errors are discovered, depending on their extent and how the reviewer feels 318
they impact the article, actions could range from asking the authors to check a particular 319
calculation, to informing the action editor confidentially that there appear to be severe problems 320
with the manuscript. 321
When an inconsistent mean is uncovered by this method, we of course have no 322
information about the
mean value that was obtained; that can only be determined by a 323
reanalysis of the original data. But such an inconsistency does indicate, at a minimum, that a 324
mistake has been made. When multiple inconsistencies are demonstrated in the same article, we 325
feel that the reader is entitled to question what else might not have been reported accurately. 326
Note also that not all incorrectly reported means will be detected using the GRIM technique, 327
because such a mean can still be consistent by chance. With reporting to two decimal places, for 328
a sample size
<100, a random “mean” value will be consistent in approximately
% of cases. 329
Thus, the number of GRIM errors detected in an article is likely to be a conservative estimate of 330
the true number of such errors. 331
A limitation of the GRIM technique is that, with the standard reporting of means to two 332
decimal places, it cannot reveal inconsistencies with per-cell sample sizes of 100 or more, and its 333
ability to detect such inconsistencies decreases as the sample size (or the number of items in a 334
composite measure) increases. However, this still leaves a substantial percentage of the 335
literature that can be tested. Recall that we selected our articles from some of the highest-impact 336
journals in the field; it might be that other journals have a higher proportion of smaller studies. 337
Additionally, it might be the case that smaller studies are more prone to reporting errors (for 338
example, because they are run by laboratories that have fewer resources for professional data 339
management). 340
A further potential source of false positives is the case where one or more participants are 341
missing values for individual items in a composite measure, thus making the denominator for the 342
mean of that measure smaller than the overall sample size. However, in our admittedly modest 343
sample of articles, this issue only caused inconsistencies in one case. We believe that this 344
limitation is unlikely to be a major problem in practice because the GRIM test is typically not 345
applicable to measures with a large number of items, due to the requirement for the product of 346
the per-cell sample size and the number of items to be less than 100. 347
Concluding remarks
On its own, the discovery of one or more inconsistent means in a published article need not be a 350
cause for alarm; indeed, we discovered from our reanalysis of data sets that in many cases where 351
such inconsistencies were present, there was a straightforward explanation, such as a minor error 352
in the reported sample sizes, or a failure to report the exclusion of a participant. Sometimes, too, 353
the reader performing the GRIM analysis may make errors, such as not noticing that what looks 354
like a single Likert-type item is in fact a composite measure. 355
It might also be that psychologists are simply sometimes rather careless in retyping 356
numbers from statistical software packages into their articles. However, in such cases, we think 357
it is legitimate to ask how many other elementary mistakes might have been made in the analysis 358
of the data, and with what effects on the reported results. It is interesting to compare our 359
experiences with those of Wolins (1962), who asked 37 authors for their data, obtained these in 360
usable form from seven authors, and found “gross errors” in three cases. While the numbers of 361
studies in both Wolins’ and our cases are small, the percentage of severe problems is, at an 362
anecdotal level, worrying. Indeed, we wonder whether some proportion of the failures to 363
replicate published research in psychology (Open Science Collaboration, 2015) might simply be 364
due to the initial (or, conceivably, the replication) results being the products of erroneous 365
analyses. 366
Beyond inattention and poorly-designed analyses, however, we cannot exclude that in 367
some cases, a plausible explanation for GRIM inconsistencies is that some form of data 368
manipulation has taken place. For example, in the fictional extract at the start of this article, here 369
is what should have been written in the last sentence: 370
Participants in the “drank the Kool-Aid” condition did not report a significantly stronger 371
belief in the ability of pigs to fly (
=1.34) than those in the control condition 372
=.16. 373
In the “published” extract, compared to the above version, the first mean was “adjusted” by 374
adding 0.40, and the second by subtracting 0.40. This transformed a non-significant
value into 375
a significant one, thus making the results considerably easier to publish (cf. Kühberger, Fritz, & 376
Scherndl, 2014). 377
We are particularly concerned about the eight data sets (out of the 21 we requested) that 378
we believe we may never see (five due to refusals to share the data, two due to repeated non-379
response to our requests, and one due to the apparent disappearance of the corresponding author). 380
Refusing to share one’s data for reanalysis without giving a clear and relevant reason is, we feel, 381
professionally disrespectful at best, especially after authors have assented to such sharing as a 382
condition of publication, as is the case in (for example) APA journals such as
. 383
We support the principle, currently being adopted by several journals, that sharing of data ought 384
to be the default situation, with authors having to provide strong arguments why their data cannot 385
be shared in any given case. When accompanied by numerical evidence that the results of a 386
published article may be unreliable, a refusal to share data will inevitably cause speculation 387
about what those data might reveal. However, throughout the present article, we have refrained 388
from mentioning the titles, authors, or any other identifying features of the articles in which the 389
GRIM analysis identified apparent inconsistencies. There are three reasons for this. First, the 390
GRIM technique was exploratory when we started to examine the published articles, rather than 391
an established method. Second, there may be an innocent explanation for any or all of the 392
inconsistencies that we identified in any given article. Third, it is not our purpose here to 393
“expose” anything or anyone; we offer our results in the hope that they will stimulate discussion 394
within the field. It would appear, as a minimum, that we have identified an issue worthy of 395
further investigation, and produced a tool that might assist reviewers of future work, as well as 396
those who wish to check certain results in the existing literature. 397
References 398
Carifio, J., & Perla, R. J. (2007). Ten common misunderstandings, misconceptions, persistent 399
myths and urban legends about Likert scales and Likert response formats and their 400
Journal of Social Sciences
, 106–116. 401 402
Coombs, C. H. (1960). A theory of data.
Psychological Review
, 143–159. 403 404
Jamieson, S. (2004). Likert scales: How to (ab)use them.
Medical Education
, 1212–1218. 405 406
Kühberger, A., Fritz, A., & Scherndl, T. (2014). Publication bias in psychology: A diagnosis 407
based on the correlation between effect size and sample size.
(9), e105825. 408 409
Open Science Collaboration (2015). Estimating the reproducibility of psychological science. 410
, aac4716. 411
Thurstone, L. L. (1927). A law of comparative judgment.
Psychological Review
, 273–286. 412 413
Wolins, L. (1962). Responsibility for raw data.
American Psychologist
, 657–658. 414 415
Supplemental Information
A numerical demonstration of the GRIM technique 418
For readers who prefer to follow a worked example, we present here a simple method for 419
performing the GRIM test to check the consistency of a mean. We assume that some quantity 420
has been measured as an integer across a sample of participants and reported as a mean to two 421
decimal places. For example: 422
Participants (
= 52) responded to the manipulation check question, “To what extent did 423
you believe the research assistant’s story that the dog had eaten his homework?” on a 1–7 424
Likert-type scale. Results showed that they found the story convincing (
= 6.28, 425
= 1.22). 426
In terms of the formulae given earlier,
(the number of participants) is 52,
(the number of 427
Likert-type items) is 1, and
(the number of decimal places reported) is 2. Thus,
10 is 100, 428
which is greater than
(52), and so the means here are GRIM-testable. The first step is to 429
multiply the sample size (52) by the number of items (1), giving 52. Then, take that product and 430
multiply it by the reported mean. In this example, that gives (6.28 × 52) = 326.56. Next, round 431
that product to the nearest integer (here, we round up to 327). Now, divide that integer by the 432
sample size, rounding the result to two decimal places, giving (327 / 52) = 6.29. Finally, 433
compare this result with the original mean. If they are identical, then the mean is
with 434
the sample size and integer data; if they are different, as in this case (6.28 versus 6.29), the mean 435
. 436
When the quantity being measured is a composite Likert-type measure, or some other 437
simple fraction, it may still be GRIM-testable. For example: 438
Participants (
= 21) responded to three Likert-type items (0 =
not at all
, 4 =
) 439
asking them how rich, famous, and successful they felt. These items were averaged into 440
a single measure of fabulousness (
= 3.77,
= 0.63). 441
In this case, the measured quantity (the mean score for fabulousness) can take on the values 1.00, 442
33.1 , 66.1 , 2.00, 33.2 , 66.2 , 3.00, etc. The granularity of this quantity is thus finer than if it 443
had been reported as an integer (e.g., if the mean of the
scores for the three components, 444
rather than the mean of the means of the three components, had been reported). However, the 445
sample size is sufficiently small that we can still perform a GRIM test, by multiplying the sample 446
size by the number of items that were averaged to make the composite measure (i.e., three) 447
before performing the steps just indicated. In terms of our formulae,
is 21,
is 3, and
is 448
again 2.
is 100, which is greater than
(63), so the means here are again GRIM-449
testable. Thus, in this case, we multiply the sample size (21) by the number of items (3) to get 450
63; multiply this result by the reported mean (3.77), giving 237.51; round 237.51 to the nearest 451
integer (238); divide 238 by 63 to get ,777.3 which rounds to 3.78; and observe that, once again, 452
this mean is inconsistent with the reported sample size. 453
Professor Irving Herman (personal communication, August 11, 2016) has pointed out that 454
another (equivalent) method to test for inconsistency is to multiply the minimum and maximum 455
possible unrounded mean values by the sample size, and see if the integer parts of the results are 456
the same; if so, the mean is inconsistent, as there is no integer that it could represent. For 457
example, a reported mean of 5.19 could correspond to any value between 5.185 and 5.195. With 458
a sample size of 28, the range of possible values is from (5.185×28)=145.18 and 459
(5.195×28)=145.46; thus, there is no possible integer that, when divided by the sample size, 460
could give a mean that would round to 5.19. 461
E-mails sent to authors to request sharing of data 462
We show here the e-mails that were sent to the authors of the articles in which we found apparent 463
problems, to request that they share their data. In some cases there were minor variations in 464
wording or punctuation. 465
The first e-mail, sent in late January 2016: 467
Dear Dr. <name>, 468
We have read with interest your article “<title>”, published in <year> in <Journal>. 469
We are interested in reproducing the results from this article as part of an ongoing project 470
concerning the nature of published data. 471
Accordingly, we request you to provide us with a copy of the dataset for your article, in 472
order to allow us to verify the substantive claims of your article through reanalysis. We 473
can read files in SPSS, XLS[x], RTF, TXT, and most proprietary file types (e.g., .MAT). 474
Thank you for your time. 475
Sincerely, 476
Nicholas J. L. Brown 477
PhD candidate, University Medical Center, Groningen 478
James Heathers 479
Postdoctoral fellow, Poznań University of Medical Sciences 480
We took the words “verify the substantive claims [of your article] through reanalysis” directly 482
from article 8.14, “Sharing Research Data for Verification”, of the American Psychological 483
Association’s ethics code ( In the case of 484
articles published in
, we knew that the corresponding author had explicitly 485
agreed to these conditions by signing a copy of a document entitled “Certification of Compliance 486
With APA Ethical Principles” prior to publication. 487
The second e-mail, sent about 10 days after the first if we had received no reply to the first: 489
Dear Dr. <name>, 490
Not having received a reply to our first e-mail (see below), we are writing to you again. 491
We apologise if our first message was a little cryptic. 492
We are working on a technique that we hope will become part of the armoury of peer 493
reviewers when checking empirical papers, which we hope will allow certain kinds of 494
problems with the reporting of statistics to be detected from the text. Specifically, we 495
look at means that do not appear to be consistent with the reported sample size. From a 496
selection of articles that we have analysed, yours appears to be a case where our 497
technique might be helpful (if we have understood your method section correctly). 498
However, we are still refining our technique, which is why we are asking 20 or so authors 499
to provide us with data so that we can check that we have fully understood their methods, 500
and see how we should refine the description of our technique to make it as specific and 501
selective as possible. Comparing the results of its application with the numbers in the 502
dataset(s) corresponding to the articles that we have identified will hopefully enable us to 503
understand this process better. So if you could provide us with your data from this article, 504
that would be very helpful. 505
Kind regards, 506
Nick Brown 507
James Heathers 508

Supplementary resource (1)

... DQ research area has focused on defining different DQ aspects and on proposing techniques, methods and methodologies for measuring and dealing with DQ problems [1,9,53,57]. DQ concepts have been applied to different kinds of data belonging to different domains, such as financial, business and organizational [22,40,63], web portals [17], bio-medicine [24] and sensors [37], psychology and medical research [64,65,66] among others. In these domains, data producers and consumers have recognized DQ problems as an important matter that needs to be considered and attended [38,46,47]. ...
Full-text available
Context Data collected during software engineering experiments might contain quality problems, leading to wrong experimental conclusions. Objective We present a data quality (DQ) model and a methodology specific to software engineering experiments, which provides a systematic approach in order to analyze and improve data quality in this domain. Method Our proposal considers a multifaceted view of data quality suitable for this context, which enables the discovery of DQ problems that are not generally addressed. We successfully applied the model (DQMoS) and methodology (DQMeS) in four controlled experiments, detecting different quality problems that could impact the experimental results. We present, through a running example, how we applied the DQMoS and DQMeS to one of the four experimental data. Results We found that between 55% and 75% of the DQ metrics applied showed the presence of a DQ problem in all four experiments. In all cases, the experimental results had already been obtained before the DQMeS application. This means that the DQ problems we found, were not discovered by the experimenters during or before making their experiment's analysis. Results yield data quality problems that experimenters did not detect on their own analysis, and that affect the experimental response variables. Our proposal shows a formalized framework that measures and improves the quality of software engineering experimental data. The results of a survey distributed to the experiments’ responsibles show that they value the improvements introduced by the model and methodology, and that they intend to apply them again in future experiences. Conclusions DQMoS and DQMeS are useful to increase the confidence in the quality of data used in software engineering experiments, and improve the trust in experimental results.
Synthesizers of evidence are increasingly likely to encounter studies published in predatory journals during the evidence synthesis process. The evidence synthesis discipline is uniquely positioned to encounter novel concerns associated with predatory journals. The objective of this research was to explore the attitudes, opinions, and experiences of experts in the synthesis of evidence regarding predatory journals. Employing a descriptive survey-based cross-sectional study design, these experts were asked a series of questions regarding predatory journals to explore these attitudes, opinions, and experiences. Two hundred and sixty four evidence synthesis experts responded to this survey. Most respondents agreed with the definition of a predatory journal (86%), however several (19%) responded that this definition was difficult to apply practically. Many respondents believed that studies published in predatory journals are still eligible for inclusion into an evidence synthesis project. However, this was only after the study had been determined to be 'high-quality' (39%) or if the results were validated (13%). While many respondents could identify common characteristics of these journals, there was still hesitancy regarding the appropriate methods to follow when considering including these studies into an evidence synthesis project.
Traditional metric indicators of scientific productivity (e.g., journal impact factor; h-index) have been heavily criticized for being invalid and fueling a culture that focuses on the quantity, rather than the quality, of a person’s scientific output. There is now a wide-spread demand for specified alternatives to current academic evaluation practices. In a previous report, we laid out four basic principles of a more responsible research assessment in academic hiring and promotion processes (Schönbrodt et al., 2022). The present paper offers a specific proposal for how these principles may be implemented in practice: We argue in favor of broadening the range of relevant research contributions and thus propose concrete quality criteria (including ready-to-use online templates) for published research articles, data sets and research software. These criteria are supposed to be used primarily in the first phase of the assessment process. Their function is to help establish a minimum threshold of methodological rigor that candidates need to pass in order to be further considered for hiring and promotion. In contrast, the second phase of the assessment process will focus more on the actual content of candidates’ research output and necessarily use more narrative means of assessment. We hope that this proposal will help get our colleagues in the field engaged in a discussion over ways of replacing current invalid evaluation criteria with ones that relate more closely to scientific quality.
The replication crisis and credibility revolution in the 2010s brought a wave of doubts about the credibility of social and personality psychology. We argue that as a field, we must reckon with the concerns brought to light during this critical decade. How the field responds to this crisis will reveal our commitment to self-correction. If we do not take the steps necessary to address our problems and simply declare the crisis to be over or the problems to be fixed without evidence, we risk further undermining our credibility. To fully reckon with this crisis, we must empirically assess the state of the field to take stock of how credible our science actually is and whether it is improving. We propose an agenda for metascientific research, and we review approaches to empirically evaluate and track where we are as a field (e.g., analyzing the published literature, surveying researchers). We describe one such project (Surveying the Past and Present State of Published Studies in Social and Personality Psychology) underway in our research group. Empirical evidence about the state of our field is necessary if we are to take self-correction seriously and if we hope to avert future crises.
Issues with research quality and reproducibility have grown in the past few decades. Increasing statistical complexity, availability of “user-friendly software,” and lack of proper statistical training have been identified as possible causes in this multifactorial problem. Issues also stem from scientific misrepresentation, which has evolved from blatant fraud to new tactics such as selectively reporting outcomes and HARKing in order to produce “positive findings.” These veiled attempts, whether purposeful or not, may be driven by researchers’ desires for publication or other secondary gains. Any published error erodes the faith in scientific findings and the research community. Different solutions and shortcomings have been reviewed to help educate readers and researchers alike combat issues in their own work.KeywordsErrorsFraudHARKingMisconductMisrepresentationResearchSpinSignificant figuresStatisticsReporting
The mobile paradigm has played a fundamental role in memory development research. One key characteristic of the mobile paradigm literature is that across decades, researchers have faithfully followed a particular methodological protocol with its own unique definitions of learning and memory. To investigate the extent to which these methodological choices affected the results, the literature (77 publications and 505 statistical tests) was evaluated for four frequently encountered research biases. The results suggested that research using the paradigm was conducted with scientific rigor. However, methodological choices along with unique operational definitions of learning and memory accounted for more than half of the findings. Thus, the literature has been contaminated by methodological artifacts due to the opportunistic use of researcher degrees of freedom.
There are many types of Questionable Research Practices (QRPs) that all tend to generate statistical information that misrepresents reality. This chapter discusses some methods for detecting the presence of QRPs, mostly by looking for conflicts in different sources of information. These methods typically cannot identify precisely which QRPs were used, and sometimes the conflicts are due to typos or simple mistakes, but either way readers should be skeptical about the validity of studies with inconsistent statistical information. An appropriate mindset for identifying inconsistencies is that of a “data detective” who looks for patterns that do not make sense. We start by describing mathematical inconsistencies between sample sizes and the degrees of freedom in hypothesis tests, which are easy to detect and indicate either a QRP, unreported outlier removal, or sloppiness in reporting. A similarly easy check is the use of the STATCHECK program to identify inconsistencies between reported test statistics and p-values, which may indicate sloppiness in reporting or improper rounding to conclude statistical significance. Similar problems can also be discovered with the GRIM test, which identifies situations where reported means or proportions are impossible for the given measurement and sample size(s). Two additional tests explore inconsistencies across experiments. First, the Test for Excess Success compares the frequency of reported successful outcomes to the expected frequency if the tests were run properly, fully reported, and analyzed without QRPs. Too much success indicates a problem with the reported results (possibly because of QRPs). Second, the p-curve analysis examines the distribution of reported p-values for properties that indicate invalid data sets (that are perhaps the result of QRPs).KeywordsQuestionable research practiceClinical psychologyExcess successData detective methods for revealing questionable research practicesSTATCHECK programGRIM test
Increasing evidence indicates that many published findings in psychology may be overestimated or even false. An often-heard response to this “replication crisis” is to replicate more: replication studies should weed out false positives over time and increase the robustness of psychological science. However, replications take time and money – resources that are often scarce. In this chapter, I propose an efficient alternative strategy: a four-step robustness check that first focuses on verifying reported numbers through reanalysis before replicating studies in a new sample.KeywordsRobustness of psychological research findingsFour-step robustness checkReplication crisis
Objective Fraudulent research exists but can be difficult to spot. Made-up studies and results can affect systematic reviews and clinical guidelines, causing harm through incorrect treatments and practices. Our aim was to explore indicators of research fraud that could be included in a screening tool to identify potentially problematic studies warranting closer scrutiny. Study Design & Setting We conducted a qualitative international interview study, purposively recruiting participants with experience and/or expertise in: research integrity, systematic reviews, biomedical publishing, or whistle-blowing research fraud. We used thematic analysis to identify major concepts and ideas. Results We contacted 49 potential participants and interviewed 30 from 12 countries. Participants described research fraud as a growing concern, with a lack of widely accessible resources or education to assist in flagging problematic studies. They discussed early warning signs that could be contained in a screening tool for use either pre or post publication. We did not speak to participants from indexing services, information software/analytics companies, or the public. Our suggested screening tools are empirically derived but are preliminary and not validated. Conclusion A practical tool of early warning signs for research fraud would be useful for peer reviewers, editors, publishers and systematic reviewers.
Full-text available
Empirically analyzing empirical evidence One of the central goals in any scientific endeavor is to understand causality. Experiments that seek to demonstrate a cause/effect relation most often manipulate the postulated causal factor. Aarts et al. describe the replication of 100 experiments reported in papers published in 2008 in three high-ranking psychology journals. Assessing whether the replication and the original experiment yielded the same result according to several criteria, they find that about one-third to one-half of the original findings were also observed in the replication study. Science , this issue 10.1126/science.aac4716
Full-text available
Background The p value obtained from a significance test provides no information about the magnitude or importance of the underlying phenomenon. Therefore, additional reporting of effect size is often recommended. Effect sizes are theoretically independent from sample size. Yet this may not hold true empirically: non-independence could indicate publication bias. Methods We investigate whether effect size is independent from sample size in psychological research. We randomly sampled 1,000 psychological articles from all areas of psychological research. We extracted p values, effect sizes, and sample sizes of all empirical papers, and calculated the correlation between effect size and sample size, and investigated the distribution of p values. Results We found a negative correlation of r = −.45 [95% CI: −.53; −.35] between effect size and sample size. In addition, we found an inordinately high number of p values just passing the boundary of significance. Additional data showed that neither implicit nor explicit power analysis could account for this pattern of findings. Conclusion The negative correlation between effect size and samples size, and the biased distribution of p values indicate pervasive publication bias in the entire field of psychology.
Full-text available
A recent article by Jamieson in Medical Education outlined some of the (alleged) abuses of “Likert scales” with suggestions about how researchers can overcome some of the (alleged) methodological pitfalls and limitations[1]. However, many of the ideas advanced in the Jamison article, as well as a great many of articles it cited, and similar recent articles in medical, health, psychology, and educational journals and books, are themselves common misunderstandings, misconceptions, conceptual errors, persistent myths and “urban legends” about “Likert scales” and their characteristics and qualities that have been propagated and perpetuated across six decades, for a variety of differentreasons. This article identifies, analyses and traces many of these aforementioned problems and presents the arguments, counter arguments and empirical evidence that show these many persistent claims and myths about “Likert scales” to be factually incorrect and untrue. Many studies have shown that Likert Scales (as opposed to single Likert response format items) produce interval data and thatthe F-test is very robust to violations of the interval data assumption and moderate skewing and may be used to analyze “Likert data” (even if it is ordinal), but not on an item-by-item “shotgun” basis, which is simply a current research and analysis practice that must stop. After sixty years, it is more than time to dispel these particular research myths and urban legends as well as the various damage and problems they cause, and put them to bed and out of their misery once and for all.
Reproducibility is a defining feature of science, but the extent to which it characterizes current research is unknown. We conducted replications of 100 experimental and correlational studies published in three psychology journals using high-powered designs and original materials when available. Replication effects were half the magnitude of original effects, representing a substantial decline. Ninety-seven percent of original studies had statistically significant results. Thirty-six percent of replications had statistically significant results; 47% of original effect sizes were in the 95% confidence interval of the replication effect size; 39% of effects were subjectively rated to have replicated the original result; and if no bias in original results is assumed, combining original and replication results left 68% with statistically significant effects. Correlational tests suggest that replication success was better predicted by the strength of original evidence than by characteristics of the original and replication teams.
Comments on a Iowa State University graduate student's endeavor of requiring data of a particular kind in order to carry out a study for his master's thesis. This student wrote to 37 authors whose journal articles appeared in APA journals between 1959 and 1961. Of these authors, 32 replied. Twenty-one of those reported the data misplaced, lost, or inadvertently destroyed. Two of the remaining 11 offered their data on the conditions that they be notified of our intended use of their data, and stated that they have control of anything that we would publish involving these data. Errors were found in some of the raw data that was obtained which caused a dilemma of either reporting the errors or not. The commentator states that if it were clearly set forth by the APA that the responsibility for retaining raw data and submitting them for scrutiny upon request lies with the author, this dilemma would not exist. The commentator suggests that a possibly more effective means of controlling quality of publication would be to institute a system of quality control whereby random samples of raw data from submitted journal articles would be requested by editors and scrutinized for accuracy and the appropriateness of the analysis performed. (PsycINFO Database Record (c) 2012 APA, all rights reserved)
( This reprinted article originally appeared in Psychological Review, 1927, Vol 34, 273–286. The following is a modified version of the original abstract which appeared in PA, Vol 2:527. ) Presents a new psychological law, the law of comparative judgment, along with some of its special applications in the measurement of psychological values. This law is applicable not only to the comparison of physical stimulus intensities but also to qualitative judgments, such as those of excellence of specimens in an educational scale. The law is basic for work on Weber's and Fechner's laws, applies to the judgments of a single observer who compares a series of stimuli by the method of paired comparisons when no "equal" judgments are allowed, and is a rational equation for the method of constant stimuli.
An abstract theory of psychological data has been constructed for the purpose of organizing and systematizing the domain of psychological methodology. It is asserted that from the point of view of psychological measurement theories all behavioral observations satisfy, at the simplest level, each of three dichotomies, generating eight classes called octants which were organized into four quadrants. Any behavioral observations when mapped into data involve accepting a miniature behavioral theory implicit in the method used to analyze the data." (31 ref., brief glossary, appendix of axioms and definitions)