Content uploaded by James A J Heathers

Author content

All content in this area was uploaded by James A J Heathers on Oct 21, 2017

Content may be subject to copyright.

1

The GRIM test: A simple technique detects numerous anomalies in the reporting of results in 1

psychology 2

3

Nicholas J. L. Brown (*) 4

University Medical Center, University of Groningen, The Netherlands 5

6

James A. J. Heathers 7

Division of Cardiology and Intensive Therapy, Poznań University of Medical Sciences 8

University of Sydney 9

10

(*) Corresponding author. E-mail: nick.brown@free.fr 11

12

Nick Brown is a PhD candidate at the University Medical Center, University of Groningen, The 13

Netherlands. 14

James Heathers conducted the bulk of the work described in the attached document while a 15

postdoctoral fellow at the Poznań University of Medical Sciences in Poland. He is currently a 16

postdoctoral fellow at Northeastern University. 17

18

Acknowledgements 19

The authors wish to thank Tim Bates and Chris Chambers for their helpful comments on an 20

earlier draft of this article, as well as those authors of article that we examined who kindly 21

provided their data sets and help with the reanalysis of these. 22

23

2

Abstract 24

We present a simple mathematical technique that we call GRIM (Granularity-Related 25

Inconsistency of Means) for verifying the summary statistics of research reports in psychology. 26

This technique evaluates whether the reported means of integer data such as Likert-type scales 27

are consistent with the given sample size and number of items. We tested this technique with a 28

sample of 260 recent empirical articles in leading journals. Of the articles that we could test with 29

the GRIM technique (N=71), around half (N=36) appeared to contain at least one inconsistent 30

mean, and more than 20% (N=16) contained multiple such inconsistencies. We requested the 31

data sets corresponding to 21 of these articles, receiving positive responses in nine cases. We 32

confirmed the presence of at least one reporting error in all cases, with three articles requiring 33

extensive corrections. The implications for the reliability and replicability of empirical 34

psychology are discussed. 35

3

Consider the following (fictional) extract from a recent article in the Journal of Porcine Aviation 36

Potential: 37

Participants (N=55) were randomly assigned to drink 200ml of water that either contained 38

(experimental condition, N=28) or did not contain (control condition, N=27) 17g of 39

cherry flavor Kool-Aid

®

powder. Fifteen minutes after consuming the beverage, 40

participants responded to the question, “To what extent do you believe that pigs can fly?” 41

on a seven-point scale from 1 (Not at all) to 7 (Definitely). Participants in the “drank the 42

Kool-Aid” condition reported a significantly stronger belief in the ability of pigs to fly 43

(M=5.19, SD=1.34) than those in the control condition (M=3.86, SD=1.41), t(53)=3.59, 44

p<.001. 45

These results seem superficially reasonable, but are actually mathematically impossible. The 46

reported means represent either errors of transcription, some version of misreporting, or the 47

deliberate manipulation of results. Specifically, the mean of the 28 participants in the 48

experimental condition, reported as 5.19, cannot be correct. Since all responses were integers 49

between 1 and 7, the total of the response scores across all participants must fall in the range 28–50

196. The two integers that give a result closest to the reported mean of 5.19 are 145 and 146. 51

However, 145 divided by 28 is

85714217.5 , which conventional rounding returns as 5.18. 52

Likewise, 146 divided by 28 is 42857121.5 , which rounds to 5.21. That is, there is no 53

combination of responses that can give a mean of 5.19 when correctly rounded. Similar 54

considerations apply to the reported mean of 3.86 in the control condition: Multiplying this value 55

by the sample size (27) gives 104.22, suggesting that the total score across participants must 56

4

have been either 104 or 105. But 104 divided by 27 is 851.3 , which rounds to 3.85, and 105 57

divided by 27 is 888.3 , which rounds to 3.89. 58

In this article, we first introduce the general background to and calculation of what we 59

term the Granularity-Related Inconsistent Means (GRIM) test. Next, we report on the results of 60

an analysis using the GRIM test of a number of published articles from leading psychological 61

journals. Finally, we discuss the implications of these results for the published literature in 62

empirical psychology. 63

64

General description of the GRIM technique for reanalyzing published data 65

Participant response data collected in psychology are typically ordinal in nature—that is, the 66

recorded values have meaning in terms of their rank order, but the number representing them are 67

arbitrary, such that the value corresponding to any item has no significance beyond its ability to 68

establish a position on a continuum relative to the other numbers. For example, the seven-point 69

scale cited in our opening example, running from 1 to 7, could equally well have been coded 70

from 0 to 6, or from 6 to 0, or from 10 to 70 in steps of 10. However, while the limits of ordinal 71

data in measurement have been extensively discussed for many years (e.g., Carifio & Perla, 72

2007; Coombs, 1960; Jamieson, 2004; Thurstone, 1927), it remains common practice to treat 73

ordinal data composed of small integers as if they were measured on an interval scale, calculate 74

their means and standard deviations, and apply inferential statistics to those values. Other 75

common measures used in psychological research produce genuine interval-level data in the 76

form of integers; for example, one might count the number of anagrams unscrambled, or the 77

number of errors made on the Stroop test, within a given time interval. Thus, psychological data 78

often consist of integer totals, divided by the sample size. 79

5

One often-overlooked property of data derived from such non-continuous measures, 80

whether ordinal or interval, is their

granularity

—that is, the numerical separation between 81

possible values of the summary statistics. Here, we consider the example of the mean. With 82

typical Likert-type data, the smallest amount by which two means can differ is the reciprocal of 83

the product of the number of participants and the number of items (questions) that make up the 84

scale. For example, if we administer a three-item Likert-type measure to 10 people, the smallest 85

amount by which two mean scores can differ (the granularity of the mean) is

(

)

(

)

.303.0310/1 =× 86

If means are reported to two decimal places, then—although there are 100 possible numbers with 87

two decimal places in the range

2

1

<

≤

X

(1.00, 1.01, 1.02, etc., up to 1.99)—the possible 88

values of the (rounded) mean are considerably fewer (1.00, 1.03, 1.07, 1.10, etc., up to 1.97). If 89

the number of participants

(

)

N is less than 100 and the measured quantity is an integer, then not 90

all of the possible sequences of two digits can occur after the decimal point in correctly rounded 91

fractions. We use the term

inconsistent

to refer to reported means of integer data whose value, 92

appropriately rounded, cannot be reconciled with the stated sample size. (More generally, if the 93

number of decimal places reported is

D

, then some combinations of digits will not be consistent 94

if

N

is less than

D

10

.) 95

This relation is always true for integer data that are recorded as single items, such as 96

participants’ ages in whole years, or a one-item Likert-type measure, as frequently used as a 97

manipulation check. In particular, the number of possible responses to each item is irrelevant; 98

that is, it makes no difference whether responses can range from 0 to 3, or from 1 to 100. When 99

a composite measure is used, such as one with three Likert-type items where the mean of the 100

item scores is taken as the value of the measure, this mean value will not necessarily be an 101

integer; instead, it will be some multiple of (1/

L

), where

L

is the number of items in the measure. 102

6

Similar considerations would apply to a hypothetical one-item measure where the possible 103

responses are simple fractions instead of integers. For example, a scale with possible responses 104

of 0.0, 0.5, 1.0, 1.5, 2.0, 2.5, and 3.0 would be equivalent to a two-item measure with integer 105

responses in the range 0–3. Alternatively, in a money game where participants play with 106

quarters, and the final amount won or lost is expressed in dollars, only values ending in .00, .25, 107

.50 or .75 are possible. However, the range of possible values that such means can take is still 108

constrained (in the example of the three-item Likert-type scale, assuming item scores starting at 109

1, this range will be 1.00, 33.1 , 66.1 , 2.00, 33.2 , etc.,) and so for any given sample size, the 110

range of possible values for the mean of all participants is also constrained. For example, with a 111

sample size of 20 and

L

=3, possible values for the mean are 1.00, 1.02 [rounded from 601.1 ], 112

1.03 [rounded from 303.1 ], 1.05, 1.07, etc. More generally, the range of means for a measure 113

with

L

items (or an interval scale with an implicit granularity of (1/

L

), where

L

is a small integer, 114

such as 4 in the example of the game played with quarters) and a sample size of

N

is identical to 115

the range of means for a measure with one item and a sample size of

.NL

×

Thus, by 116

multiplying the sample size by the number of items in the scale, composite measures can be 117

analyzed using the GRIM technique in the same way as single items, although as the number of 118

scale items increases, the maximum sample size for which this analysis is possible is 119

correspondingly reduced as the granularity decreases towards 0.01. We use the term

GRIM-

120

testable

to refer to variables whose granularity (typically, one divided by the product of the 121

number of scale items and the number of participants) is sufficiently large that they can be tested 122

for inconsistencies with the GRIM technique. For example, a five-item measure with 25 123

participants has the same granularity (0.008) as a one-item measure with 125 participants, and 124

hence scores on this measure are not typically GRIM-testable. 125

7

Figure 1. Plot of consistent (white dots) and inconsistent (black dots) means, reported to 2 126

decimal places. 127

128

Notes: 129

1. As the sample size increases towards 100, the number of means that are consistent with that 130

sample size also increases, as shown by the greater number of white (versus black) dots. Thus, 131

GRIM works better with smaller sample sizes, as the chance of any individual incorrectly-132

reported mean being consistent by chance is lower. 133

8

2. The Y axis represents only the fractional portion of the mean (i.e., the part after the decimal 134

point), because the integer portion of the mean plays no role. That is, for any given sample size, 135

if a mean of 2.49 is consistent with the sample size, then means of 0.49 or 8.49 are also 136

consistent. 137

3. This figure assumes that means ending in 5 at the third decimal place (e.g., 10/80=0.125) are 138

always rounded up; if such means are allowed to be rounded up or down, a few extra white dots 139

will appear at sample sizes that are multiples of 8. 140

9

Figure 1 shows the distribution of consistent (shown in white) and inconsistent (shown in 141

black) means as a function of the sample size. Note that only the two-digit fractional portion of 142

each mean is linked to consistency; the integer portion plays no role. The overall pattern is clear: 143

As the sample size increases, the number of means that are consistent with that sample size also 144

increases, and so the chance that any single incorrectly-reported mean will be detected as 145

inconsistent is reduced. However, even with quite large sample sizes, it is still possible to detect 146

inconsistent means if an article contains multiple inconsistencies. For example, consider a study 147

with

N

=75 and six reported “means” whose values have, in fact, been chosen at random: There is 148

a 75% chance that any one random “mean” will be consistent, but only a 17.8% (0.75

6

) chance 149

that all six will be. 150

Our general formula, then, is that when the number of participants (

N

) is multiplied by 151

the number of items composing a measured quantity (

L

, commonly equal to 1), and the means 152

that are based on

N

are reported to

D

decimal places, then if

(

)

D

NL

10

<×

, there exists some 153

number of decimal fractions of length

D

that cannot occur if the means are reported correctly. 154

The number of inconsistent values is generally equal to (10

D

−

N

); however, in the analyses 155

reported in the present article, we conservatively allowed numbers ending in exactly 5 at the 156

third decimal place to be rounded either up or down without treating the resulting means as 157

inconsistent, so that some values of

N

have fewer possible inconsistent means than this formula 158

indicates. 159

Using the GRIM technique, it is possible to examine published reports of empirical 160

research to see whether the means have been reported correctly

1

. Psychological journals 161

1

We have provided a simple spreadsheet at https://osf.io/3fcbr that automates the steps of this

procedure.

10

typically require the reporting of means to two decimal places, in which case the sample size 162

corresponding to each mean must be less than 100 in order for its consistency to be checked. 163

However, since the means of interest in experimental psychology are often those for subgroups 164

of the overall sample (for example, the numbers in each experimental condition), it can still be 165

possible to apply the GRIM technique to studies with overall sample sizes substantially above 166

100. (Note that percentages reported to only one decimal place can typically be tested for 167

consistency with a sample size of up to 1000, as they are, in effect, fractions reported to three 168

decimal places.) 169

We now turn to our pilot trial of the GRIM test. 170

171

Method

172

We searched recently published (2011–2015) issues of

Psychological Science

(

PS

),

Journal of

173

Experimental Psychology

:

General

(

JEP

:

G

), and

Journal of Personality and Social Psychology

174

(

JPSP

) for articles containing the word “Likert” anywhere in the text. This strategy was chosen 175

because we expected to find Likert-type data reported in most of the articles containing that word 176

(although we also checked the consistency of the means of other integer data where possible). 177

We sorted the results with the most recent first and downloaded at most the first 100 matching 178

articles from each journal. Thus, our sample consisted of 100 articles from

PS

published 179

between January 2011 and December 2015, 60 articles from

JEP

:

G

published between January 180

2011 and December 2015, and 100 articles from

JPSP

published between October 2012 and 181

December 2015. 182

We examined the Method section of each study reported in these articles to see whether 183

GRIM-testable measures were used, and to determine the sample sizes for the study and, where 184

11

appropriate, each condition. A preliminary check was performed by the first author; if he did not 185

see evidence of either GRIM-testable measures, or any (sub)sample sizes less than 100, the 186

article was discarded. Subsequently, each author worked independently on the retained articles. 187

We examined the table of descriptives (if present), other result tables, and the text of the Results 188

section, looking for means or percentages that we could check using the GRIM technique. On 189

the basis of our tests, we assigned each article a subjective “inconsistency level” rating. A rating 190

of 0 (

no problems

) meant that all the means we were able to check were consistent, even if those 191

means represented only a small percentage of the reported data in the article. We assigned a 192

rating of 1 (

minor problems

) to articles that contained only one or two inconsistent numbers, 193

where we believed that these were most parsimoniously explained by typographical or 194

transcription errors, and where an incorrect value would have little effect on the main 195

conclusions of the article. Articles that had a small number of inconsistencies that might impact 196

the principal results were rated at level 2 (

moderate problems

); we also gave this rating to 197

articles in which the results seemed to be uninterpretable as described. Finally, we applied a 198

rating of 3 (

substantial problems

) to articles with a larger number of inconsistencies, especially if 199

these appeared at multiple points in the article. Finally, ratings were compared between the 200

authors and differences resolved by discussion. 201

202

Results

203

The total number of articles examined from each journal, the number retained for GRIM 204

analysis, and the number to which we assigned each rating, are shown in Table 1. A total of 260 205

articles were initially examined. Of these, 189 (72.7%) were discarded, principally because 206

either they reported no GRIM-testable data or their sample sizes were all sufficiently large that 207

12

no inconsistent means were likely to be detected. Of the remaining 71 articles, 35 (49.3%) 208

reported all GRIM-testable data consistently and were assigned an inconsistency level rating of 209

0. That left us with 36 articles that appeared to contain one or more inconsistency. Of these, we 210

assigned a rating of 1 to 15 articles (21.1% of the 71 in total for which we performed a GRIM 211

analysis), a rating of 2 to five articles (7.0%), and a rating of 3 to 16 articles (22.5%). In some of 212

these “level 3” articles, over half of the GRIM-testable values were inconsistent with the stated 213

sample size. 214

13

215

216

Table 1 Journals and Articles Consulted 217

Journal PS JEP:G JPSP Total

Number of articles 100 60 100 260

Earliest article date January 2011 January 2011 October 2012

Articles with GRIM-testable data 29 15 27 71

Level 0 articles (no problems detected) 16 8 11 35

Level 1 articles (minor problems) 5 3 7 15

Level 2 articles (moderate problems) 1 1 3 5

Level 3 articles (substantial problems) 7 3 6 16

Notes:

PS

=

Psychological Science

.

JEP

:

G

=

Journal of Experimental Psychology

:

General

. 218

JPSP

:

Journal of Personality and Social Psychology

. 219

220

221

Next, we e-mailed

2

the corresponding authors of the articles that were rated at level 2 or 3 222

asking for their data. In response to our 21 initial requests, we received 11 replies within two 223

weeks. At the end of that period, we sent follow-up requests to the 10 authors who had not 224

replied to our initial e-mail. In response to either the first or second e-mail, we obtained the 225

requested data from eight authors, while a ninth provided us with sufficient information about 226

the data in question to enable us to check the consistency of the means. Four authors promised 227

to send the requested data, but have not done so to date. Five authors either directly or 228

2

The text of our e-mails is available in the supplementary information for this article.

14

effectively refused to share their data, even after we explained the nature of our study; 229

interestingly, two of these refusals were identically worded. In another case, the corresponding 230

author’s personal e-mail address had been deleted; another author informed us that the 231

corresponding author had left academia, and that the location of the data was unknown. Finally, 232

two of our requests went completely unanswered after the second e-mail. 233

Our examination of the data that we received showed that the GRIM technique identified 234

one or more genuine problem in each case. We report the results of each analysis briefly here, in 235

the order in which the data were received. 236

Data set 1

. Our GRIM analysis had detected two inconsistent means in a table of descriptives, as 237

well as eight inconsistent standard deviations

3

. Examining the data, we found that the two 238

inconsistent means and one of the inconsistent SDs were caused by the sample size for that cell 239

not corresponding to the sample size for the column of data in question; five SDs had been 240

incorrectly rounded because the default (3 decimal places) setting of SPSS had caused a value of 241

1.2849 to be rounded to 1.285, which the authors had subsequently rounded manually to 1.29; 242

and two further SDs appeared to have been incorrectly transcribed, with values of 0.79 and 0.89 243

being reported as 0.76 and 0.86, respectively. All of these errors were minor and had no 244

substantive effect on the published results of the article. 245

Data set 2

. Our reading of the article in this case had detected several inconsistent means, as 246

well several inconsistently-reported degrees of freedom and apparent errors in the reporting of 247

some other statistics. Examination of the data confirmed most of these problems, and indeed 248

3

SDs exhibit granularity in an analogous way to means, but the determination of (in)consistency

for SDs is considerably more complicated. We hope to cover the topic of inconsistent SDs in a

future article.

15

revealed a number of additional errors in the authors’ analysis. We subsequently discovered that 249

the article in question had already been the subject of a correction in the journal, although that 250

had not addressed most of the problems that we found. We will write to the authors to suggest a 251

number of points that require (further) correction. 252

Data set 3

. In this case, our GRIM analysis had shown a large number of inconsistent means in 253

two tables of descriptives. The corresponding author provided us with an extensive version of 254

the data set, including some intermediate analysis steps. We identified that most of the entries in 255

the descriptives had been calculated using a Microsoft Excel formula that included an incorrect 256

selection of cells; for example, this resulted in the mean and SD of the first experimental 257

condition being included as data points in the calculation of the mean and SD of the second. The 258

author has assured us that a correction will be issued. 259

Data set 4

. In the e-mail accompanying their data, the authors of this article spontaneously 260

apologized in advance (even though we had not yet told them exactly why we were asking for 261

their data) for possible discrepancies between the sample sizes in the data and those reported in 262

the article. They stated that, due to computer-related issues, they had only been able to retrieve 263

an earlier version of the data set, rather than the final version on which the article was based. We 264

adjusted the published sample sizes using the notes that the authors provided, and found that this 265

adequately resolved the GRIM inconsistencies that we had identified. 266

Data set 5

. The GRIM analyses in this case found some inconsistent means in the reporting of 267

the data that were used as the input to a number of

t

tests, as well as in the descriptives for one of 268

the conditions in the study. Analysis revealed that the former problems were the result of the 269

authors having reported the

N

s from the output of a repeated-measures ANOVA in which some 270

cases were missing, so that these

N

s were smaller than those reported in the method section. The 271

16

problems in the descriptives were caused by incorrect reporting of the number of participants 272

who were excluded from analyses. We were unable to determine to what extent this difference 273

affected the results of the study. 274

Data set 6

. Here, the inconsistencies that we detected were mostly due to the misreporting by 275

the authors of their sample size. This was not easy to explain as a typographical error, as the 276

number was reported as a word at the start of a sentence (e.g., “Sixty undergraduates took part”). 277

Additionally, one inconsistent standard deviation turned out to have been incorrectly copied 278

during the drafting process. 279

Data set 7

. This data set confirmed numerous inconsistencies, including large errors in the 280

reported degrees of freedom for several

F

tests, from which we had inferred the per-cell sample 281

sizes. Furthermore, a number that was meant to be the result of subtracting one Likert-type item 282

score from another (thus giving an integer result) had the impossible value of 1.5. We reported 283

these inconsistencies to the corresponding author, but received no acknowledgement. 284

Data set 8

. The corresponding author indicated that providing the full data set could be 285

complicated, as the data were taken from a much larger longitudinal study. Instead, we provided 286

a detailed explanation of the specific inconsistencies we had found. The author checked these 287

and confirmed that the sample size of the study in question had been reported incorrectly, as 288

several participants had been excluded from the analyses but not from the reported count of 289

participants. The author thanked us for finding this minor (to us) inconsistency and described the 290

exercise as “a good lesson.” 291

Data set 9

. In this case, we asked for data for three studies from a multiple-study article. In the 292

first two studies, we found some reporting problems with standard deviations in the descriptives 293

and some other minor problems to do with the handling of missing values for some variables. 294

17

For the third study, however, the corresponding author reported that, during the process of 295

preparing the data set to send to us, an error in the analyses had been discovered that was 296

sufficiently serious as to warrant a correction to the published article. 297

For completeness, we should also mention that in one of the cases above, the data that we 298

received showed that we had failed to completely understand the original article; what we had 299

thought were inconsistencies in the means on a Likert-type measure were due to that measure 300

being a multiple-item composite, and we had overlooked that it was correctly reported as such. 301

While our analysis also discovered separate problems with the article in question, this 302

underscores how careful reading is always necessary when using the GRIM technique. 303

304

Discussion

305

We identified a simple method for detecting discrepancies in the reporting of statistics derived 306

from integer-based data, and applied it to a sample of empirical articles published in leading 307

journals of psychology. Of the articles that we were able to test, around half appeared to contain 308

one or more errors in the summary statistics. (We have no way of knowing how many 309

inconsistencies might have been discovered in the articles with larger samples, had it been 310

standard practice to report means to three decimal places.) Nine datasets were examined in more 311

detail, and we confirmed the existence of reporting problems in all nine, with three articles 312

requiring formal corrections. 313

We anticipate that the GRIM technique could be a useful tool for reviewers and editors. 314

A GRIM check of the reported means of an article submitted for review ought to take only a few 315

minutes. (Indeed, we found that even when no inconsistencies were uncovered, simply 316

performing this check enhanced our understanding of the methods used in the articles that we 317

18

read.) When GRIM errors are discovered, depending on their extent and how the reviewer feels 318

they impact the article, actions could range from asking the authors to check a particular 319

calculation, to informing the action editor confidentially that there appear to be severe problems 320

with the manuscript. 321

When an inconsistent mean is uncovered by this method, we of course have no 322

information about the

true

mean value that was obtained; that can only be determined by a 323

reanalysis of the original data. But such an inconsistency does indicate, at a minimum, that a 324

mistake has been made. When multiple inconsistencies are demonstrated in the same article, we 325

feel that the reader is entitled to question what else might not have been reported accurately. 326

Note also that not all incorrectly reported means will be detected using the GRIM technique, 327

because such a mean can still be consistent by chance. With reporting to two decimal places, for 328

a sample size

N

<100, a random “mean” value will be consistent in approximately

N

% of cases. 329

Thus, the number of GRIM errors detected in an article is likely to be a conservative estimate of 330

the true number of such errors. 331

A limitation of the GRIM technique is that, with the standard reporting of means to two 332

decimal places, it cannot reveal inconsistencies with per-cell sample sizes of 100 or more, and its 333

ability to detect such inconsistencies decreases as the sample size (or the number of items in a 334

composite measure) increases. However, this still leaves a substantial percentage of the 335

literature that can be tested. Recall that we selected our articles from some of the highest-impact 336

journals in the field; it might be that other journals have a higher proportion of smaller studies. 337

Additionally, it might be the case that smaller studies are more prone to reporting errors (for 338

example, because they are run by laboratories that have fewer resources for professional data 339

management). 340

19

A further potential source of false positives is the case where one or more participants are 341

missing values for individual items in a composite measure, thus making the denominator for the 342

mean of that measure smaller than the overall sample size. However, in our admittedly modest 343

sample of articles, this issue only caused inconsistencies in one case. We believe that this 344

limitation is unlikely to be a major problem in practice because the GRIM test is typically not 345

applicable to measures with a large number of items, due to the requirement for the product of 346

the per-cell sample size and the number of items to be less than 100. 347

348

Concluding remarks

349

On its own, the discovery of one or more inconsistent means in a published article need not be a 350

cause for alarm; indeed, we discovered from our reanalysis of data sets that in many cases where 351

such inconsistencies were present, there was a straightforward explanation, such as a minor error 352

in the reported sample sizes, or a failure to report the exclusion of a participant. Sometimes, too, 353

the reader performing the GRIM analysis may make errors, such as not noticing that what looks 354

like a single Likert-type item is in fact a composite measure. 355

It might also be that psychologists are simply sometimes rather careless in retyping 356

numbers from statistical software packages into their articles. However, in such cases, we think 357

it is legitimate to ask how many other elementary mistakes might have been made in the analysis 358

of the data, and with what effects on the reported results. It is interesting to compare our 359

experiences with those of Wolins (1962), who asked 37 authors for their data, obtained these in 360

usable form from seven authors, and found “gross errors” in three cases. While the numbers of 361

studies in both Wolins’ and our cases are small, the percentage of severe problems is, at an 362

anecdotal level, worrying. Indeed, we wonder whether some proportion of the failures to 363

replicate published research in psychology (Open Science Collaboration, 2015) might simply be 364

20

due to the initial (or, conceivably, the replication) results being the products of erroneous 365

analyses. 366

Beyond inattention and poorly-designed analyses, however, we cannot exclude that in 367

some cases, a plausible explanation for GRIM inconsistencies is that some form of data 368

manipulation has taken place. For example, in the fictional extract at the start of this article, here 369

is what should have been written in the last sentence: 370

Participants in the “drank the Kool-Aid” condition did not report a significantly stronger 371

belief in the ability of pigs to fly (

M

=4.79,

SD

=1.34) than those in the control condition 372

(

M

=4.26,

SD

=1.41),

t

(53)=1.43,

p

=.16. 373

In the “published” extract, compared to the above version, the first mean was “adjusted” by 374

adding 0.40, and the second by subtracting 0.40. This transformed a non-significant

p

value into 375

a significant one, thus making the results considerably easier to publish (cf. Kühberger, Fritz, & 376

Scherndl, 2014). 377

We are particularly concerned about the eight data sets (out of the 21 we requested) that 378

we believe we may never see (five due to refusals to share the data, two due to repeated non-379

response to our requests, and one due to the apparent disappearance of the corresponding author). 380

Refusing to share one’s data for reanalysis without giving a clear and relevant reason is, we feel, 381

professionally disrespectful at best, especially after authors have assented to such sharing as a 382

condition of publication, as is the case in (for example) APA journals such as

JPSP

and

JEP

:

G

. 383

We support the principle, currently being adopted by several journals, that sharing of data ought 384

to be the default situation, with authors having to provide strong arguments why their data cannot 385

be shared in any given case. When accompanied by numerical evidence that the results of a 386

published article may be unreliable, a refusal to share data will inevitably cause speculation 387

21

about what those data might reveal. However, throughout the present article, we have refrained 388

from mentioning the titles, authors, or any other identifying features of the articles in which the 389

GRIM analysis identified apparent inconsistencies. There are three reasons for this. First, the 390

GRIM technique was exploratory when we started to examine the published articles, rather than 391

an established method. Second, there may be an innocent explanation for any or all of the 392

inconsistencies that we identified in any given article. Third, it is not our purpose here to 393

“expose” anything or anyone; we offer our results in the hope that they will stimulate discussion 394

within the field. It would appear, as a minimum, that we have identified an issue worthy of 395

further investigation, and produced a tool that might assist reviewers of future work, as well as 396

those who wish to check certain results in the existing literature. 397

22

References 398

Carifio, J., & Perla, R. J. (2007). Ten common misunderstandings, misconceptions, persistent 399

myths and urban legends about Likert scales and Likert response formats and their 400

antidotes.

Journal of Social Sciences

,

3

, 106–116. 401

http://dx.doi.org/10.3844/jssp.2007.106.116 402

Coombs, C. H. (1960). A theory of data.

Psychological Review

,

67

, 143–159. 403

http://dx.doi.org/10.1037/h0047773 404

Jamieson, S. (2004). Likert scales: How to (ab)use them.

Medical Education

,

38

, 1212–1218. 405

http://dx.doi.org/10.1111/j.1365-2929.2004.02012.x 406

Kühberger, A., Fritz, A., & Scherndl, T. (2014). Publication bias in psychology: A diagnosis 407

based on the correlation between effect size and sample size.

PLoS ONE

,

9

(9), e105825. 408

http://dx.doi.org/10.1371/journal.pone.0105825 409

Open Science Collaboration (2015). Estimating the reproducibility of psychological science. 410

Science

,

349

, aac4716. http://dx.doi.org/10.1126/science.aac4716 411

Thurstone, L. L. (1927). A law of comparative judgment.

Psychological Review

,

34

, 273–286. 412

http://dx.doi.org/10.1037/h0070288 413

Wolins, L. (1962). Responsibility for raw data.

American Psychologist

,

17

, 657–658. 414

http://dx.doi.org/10.1037/h0038819 415

23

Supplemental Information

416

417

1.

A numerical demonstration of the GRIM technique 418

For readers who prefer to follow a worked example, we present here a simple method for 419

performing the GRIM test to check the consistency of a mean. We assume that some quantity 420

has been measured as an integer across a sample of participants and reported as a mean to two 421

decimal places. For example: 422

Participants (

N

= 52) responded to the manipulation check question, “To what extent did 423

you believe the research assistant’s story that the dog had eaten his homework?” on a 1–7 424

Likert-type scale. Results showed that they found the story convincing (

M

= 6.28, 425

SD

= 1.22). 426

In terms of the formulae given earlier,

N

(the number of participants) is 52,

L

(the number of 427

Likert-type items) is 1, and

D

(the number of decimal places reported) is 2. Thus,

D

10 is 100, 428

which is greater than

NL

×

(52), and so the means here are GRIM-testable. The first step is to 429

multiply the sample size (52) by the number of items (1), giving 52. Then, take that product and 430

multiply it by the reported mean. In this example, that gives (6.28 × 52) = 326.56. Next, round 431

that product to the nearest integer (here, we round up to 327). Now, divide that integer by the 432

sample size, rounding the result to two decimal places, giving (327 / 52) = 6.29. Finally, 433

compare this result with the original mean. If they are identical, then the mean is

consistent

with 434

the sample size and integer data; if they are different, as in this case (6.28 versus 6.29), the mean 435

is

inconsistent

. 436

When the quantity being measured is a composite Likert-type measure, or some other 437

simple fraction, it may still be GRIM-testable. For example: 438

24

Participants (

N

= 21) responded to three Likert-type items (0 =

not at all

, 4 =

extremely

) 439

asking them how rich, famous, and successful they felt. These items were averaged into 440

a single measure of fabulousness (

M

= 3.77,

SD

= 0.63). 441

In this case, the measured quantity (the mean score for fabulousness) can take on the values 1.00, 442

33.1 , 66.1 , 2.00, 33.2 , 66.2 , 3.00, etc. The granularity of this quantity is thus finer than if it 443

had been reported as an integer (e.g., if the mean of the

total

scores for the three components, 444

rather than the mean of the means of the three components, had been reported). However, the 445

sample size is sufficiently small that we can still perform a GRIM test, by multiplying the sample 446

size by the number of items that were averaged to make the composite measure (i.e., three) 447

before performing the steps just indicated. In terms of our formulae,

N

is 21,

L

is 3, and

D

is 448

again 2.

D

10

is 100, which is greater than

NL

×

(63), so the means here are again GRIM-449

testable. Thus, in this case, we multiply the sample size (21) by the number of items (3) to get 450

63; multiply this result by the reported mean (3.77), giving 237.51; round 237.51 to the nearest 451

integer (238); divide 238 by 63 to get ,777.3 which rounds to 3.78; and observe that, once again, 452

this mean is inconsistent with the reported sample size. 453

Professor Irving Herman (personal communication, August 11, 2016) has pointed out that 454

another (equivalent) method to test for inconsistency is to multiply the minimum and maximum 455

possible unrounded mean values by the sample size, and see if the integer parts of the results are 456

the same; if so, the mean is inconsistent, as there is no integer that it could represent. For 457

example, a reported mean of 5.19 could correspond to any value between 5.185 and 5.195. With 458

a sample size of 28, the range of possible values is from (5.185×28)=145.18 and 459

(5.195×28)=145.46; thus, there is no possible integer that, when divided by the sample size, 460

could give a mean that would round to 5.19. 461

25

2.

E-mails sent to authors to request sharing of data 462

We show here the e-mails that were sent to the authors of the articles in which we found apparent 463

problems, to request that they share their data. In some cases there were minor variations in 464

wording or punctuation. 465

466

The first e-mail, sent in late January 2016: 467

Dear Dr. <name>, 468

We have read with interest your article “<title>”, published in <year> in <Journal>. 469

We are interested in reproducing the results from this article as part of an ongoing project 470

concerning the nature of published data. 471

Accordingly, we request you to provide us with a copy of the dataset for your article, in 472

order to allow us to verify the substantive claims of your article through reanalysis. We 473

can read files in SPSS, XLS[x], RTF, TXT, and most proprietary file types (e.g., .MAT). 474

Thank you for your time. 475

Sincerely, 476

Nicholas J. L. Brown 477

PhD candidate, University Medical Center, Groningen 478

James Heathers 479

Postdoctoral fellow, Poznań University of Medical Sciences 480

481

We took the words “verify the substantive claims [of your article] through reanalysis” directly 482

from article 8.14, “Sharing Research Data for Verification”, of the American Psychological 483

Association’s ethics code (http://memforms.apa.org/apa/cli/interest/ethics1.cfm). In the case of 484

articles published in

JEP:G

and

JPSP

, we knew that the corresponding author had explicitly 485

agreed to these conditions by signing a copy of a document entitled “Certification of Compliance 486

With APA Ethical Principles” prior to publication. 487

26

488

The second e-mail, sent about 10 days after the first if we had received no reply to the first: 489

Dear Dr. <name>, 490

Not having received a reply to our first e-mail (see below), we are writing to you again. 491

We apologise if our first message was a little cryptic. 492

We are working on a technique that we hope will become part of the armoury of peer 493

reviewers when checking empirical papers, which we hope will allow certain kinds of 494

problems with the reporting of statistics to be detected from the text. Specifically, we 495

look at means that do not appear to be consistent with the reported sample size. From a 496

selection of articles that we have analysed, yours appears to be a case where our 497

technique might be helpful (if we have understood your method section correctly). 498

However, we are still refining our technique, which is why we are asking 20 or so authors 499

to provide us with data so that we can check that we have fully understood their methods, 500

and see how we should refine the description of our technique to make it as specific and 501

selective as possible. Comparing the results of its application with the numbers in the 502

dataset(s) corresponding to the articles that we have identified will hopefully enable us to 503

understand this process better. So if you could provide us with your data from this article, 504

that would be very helpful. 505

Kind regards, 506

Nick Brown 507

James Heathers 508

509

510