Content uploaded by Michael Chmielewski
Author content
All content in this area was uploaded by Michael Chmielewski on Oct 15, 2019
Content may be subject to copyright.
Running head: AN MTURK CRISIS? 1
Accepted for publication (August, 2019) in Social Psychological and Personality Science
An MTurk crisis? Shifts in data quality and the impact on study results.
Michael Chmielewski; Southern Methodist University
Michael Chmielewski is an Associate Professor of Psychology at Southern Methodist University.
His research focuses on measurement, construct validity, and quantitative models of individual
differences in personality and psychopathology.
Sarah C. Kucker; Oklahoma State University
Sarah Kucker is an Assistant Professor of Psychology at Oklahoma State University. Her
research examines individual differences in the cognitive mechanisms of language and category
development.
Acknowledgements: This research was supported by a Postdoctoral Fellowship from the Callier
Center for Communication Disorders at the University of Texas Dallas and a Faculty Small
Grant at the University of Wisconsin Oshkosh, both to SCK. We thank Jennifer Burson and
Aaron Bagley.
Manuscript Word Count:4950
AN MTURK CRISIS? 2
Abstract
Amazon’s Mechanical Turk (MTurk) is arguably one of the most important research tools of the
past decade. The ability to rapidly collect large amounts of high-quality human subjects’ data has
advanced multiple fields, including personality and social psychology. Beginning in summer
2018, concerns arose regarding MTurk data quality leading to questions about the utility of
MTurk for psychological research. We present empirical evidence of a substantial decrease in
data quality, using a four-wave naturalistic experimental design: pre-, during, and post- summer
2018. During, and to some extent post-summer 2018, we find significant increases in participants
failing response validity indicators, decreases in reliability and validity of a widely used
personality measure, and failures to replicate well-established findings. However, these
detrimental effects can be mitigated by using response validity indicators and screening the data.
We discuss implications and offer suggestions to ensure data quality.
Abstract word count: 142
Keywords: Amazon Mechanical Turk, TurkGate, data quality, replication, online samples
AN MTURK CRISIS? 3
An MTurk crisis? Shifts in data quality and the impact on study results
Amazon’s Mechanical Turk (MTurk) launched in 2005 as a crowdsourcing marketplace
allowing individuals (Turkers) to complete human intelligence tasks (HITS). Recent years have
seen an exponential increase in psychological studies being conducted on MTurk (Buhrmester,
Talaifar, & Gosling, 2018), likely because MTurk is an efficient method for collecting large
amounts of data (Buhrmester, Kwang, & Gosling, 2011; Buhrmester et al., 2018; Stewart et al.,
2015). Moreover, there has been considerable evidence that MTurk data are equivalent or
superior in quality to data collected in the lab, from professional online panels, and using
marketing research companies (Behrend, Sharek, Meade, & Wiebe, 2011; Buhrmester et al.,
2011; Casler, Bickel, & Hackett, 2013; Kees, Berry, Burton, & Sheehan, 2017; Paolacci &
Chandler, 2014). This has been demonstrated across a variety of study designs and numerous
types of data (Behrend et al., 2011; Buhrmester et al., 2011; Casler et al., 2013; Eriksson &
Simpson, 2010; Goodman, Cryder, & Cheema, 2013; Horton, Rand, & Zeckhauser, 2011; Kees
et al., 2017; Mason & Watts, 2009; Paolacci & Chandler, 2014; Shapiro, Chandler, & Mueller,
2013; Suri & Watts, 2011). MTurk samples are also more representative of the general
population than student samples (Buhrmester et al., 2011; Goodman et al., 2013). Furthermore,
compensation level appears to have no impact on data quality (Buhrmester et al., 2011; Marge,
Banerjee, & Rudnicky, 2010; Mason & Watts, 2009). Indeed, the cost effectiveness of MTurk
allows for a more diverse range of researchers, including those from smaller less-funded
universities, to provide meaningful contributions to psychological research.
There are longstanding recommendations to include response validity indicators when
using MTurk to remove invalid or low quality data (Barger, Behrend, Sharek, & Sinar, 2011;
Kittur, Chi, & Suh, 2008; Mason & Suri, 2011; Zhu & Carterette, 2010). However, such data
AN MTURK CRISIS? 4
screening attempts are rarely reported in the literature (Wood, Harms, Lowman, & DeSimone,
2017). This may be because prior studies lead researchers to believe data screening is
unnecessary and that any Mturk data will be high quality. In fact, Sheehan, (2018, p. 8) recently
concluded, “there is minimal cheating on MTurk”.
However, starting in summer 2018, social media and online discussions emerged
expressing concerns about “bots” (computer programs that automatically complete HITS) and/or
“farmers” (individuals using server farms to bypass MTurk location restrictions) on MTurk.
Bots had received minimal attention in the literature (McCreadie, Macdonald, & Ounis, 2010)
and farmers were previously unheard of with little evidence that either constituted a problem.
These new concerns raised alarms about fundamental shifts in MTurk data quality (Stokel-
Walker, 2018) and declarations of a “bot panic” (Dreyfuss, 2018). However, others suggested
the apparent shift in data quality is illusory and the result of poorly designed studies (Dreyfuss,
2018). Although potentially informative, social media or blog posts are not systematic empirical
evaluations. To date, there has been no peer reviewed empirical studies testing whether there has
been an increase in invalid or poor-quality responses and no examination of whether the results
of studies conducted on MTurk have been negatively affected.
Current Study
The aims of the current research are to 1) examine whether rates of participants providing
low quality data increased in summer 2018 and, if so, whether it persists, 2) test if results using
MTurk are negatively affected, and 3) determine the degree to which including response validity
indicators and screening data improves data quality. A naturalistic experiment, with the exact
same study conducted before and after concerns about MTurk data quality emerged, removes
new study flaws as the primary driver of purported issues. Including a widely used measure with
AN MTURK CRISIS? 5
strong psychometric properties enables comparisons with established psychometric baselines,
testing of potential differences between data collected prior to and following concerns, and
examination of the impact of data screening. Finally, testing well-established findings is another
way to determine if there has been a change in MTurk data quality. In fact, these approaches are
how researchers originally established MTurk’s high quality data (Behrend et al., 2011;
Buhrmester et al., 2011; Casler et al., 2013; Goodman et al., 2013; Kees et al., 2017; Mason &
Watts, 2009; Shapiro et al., 2013).
In the current research we re-analyzed data collected for a study on parental personality
and its role in child development (redacted, submitted for publication). That study was collected
in three waves coinciding with funding awards to the second author. Waves 1 and 2 were
conducted prior to, and wave 3 during, the period when concerns regarding MTurk surfaced
(summer 2018). The original exploratory analysis with waves 1-3 revealed differences in data
quality. Therefore, we collected a new 4th wave of data in spring 2019, hypothesizing that wave 4
would fall between the first two waves and wave 3 in terms of data quality (specific hypothesis,
analyses, and how results will be interpreted were preregistered
https://osf.io/4qgz6/?view_only=5df995d863714c0d841a4793a9306a96). Critically, the exact
same study was conducted at each wave.
First, we report on the frequency of failed validity indicators in a pre-screen survey and in
each wave of the primary study. Then, in each wave, we examine the reliability and the internal
validity of the Big Five Inventory (BFI; John, Donahue, & Kentle, 1991) and test whether well-
established associations with psychopathology replicate (Bagby et al., 1996; Clark & Watson,
1991; Kotov, Gamez, Schmidt, & Watson, 2010; Krueger & Tackett, 2006; Malouff,
Thorsteinsson, & Schutte, 2005; Widiger & Trull, 2007). Given most studies do not report
AN MTURK CRISIS? 6
screening MTurk data (Wood et al., 2017), analyses were first conducted using all participants.
Then, in order to test the impact of screening, we remove participants who failed response
validity indicators and reanalyze the data.
Methods
Procedure
Waves 1 (December 2015 - January 2016) and 2 (March 2017 - May 2017) were
conducted prior to MTurk data quality concerns. Wave 3 (July 2018 - September 2018) was
conducted during and Wave 4 (April 2019) following when concerns emerged. The exact same
study was conducted at each wave (i.e., identical Qualtrics surveys: screening questions,
demographic questions, measures; and recruitment: HIT posting, Turker requirements,
compensation, researcher account). The only differences were the dates and the consent wording
(pursuant to the second author’s affiliation at the time). In order to recruit enough parents of 8-
42 month-olds for the original study, HIT requirements were set to approval rate of ≥ 85%, ≥ 50
approved HITs, and within the US. Although it is common for studies to not report their HIT
requirements, current requirements were slightly lower than those that are reported. Therefore,
to examine the potential impact of lower HIT requirements, we collected an additional wave
(Wave 4a; N=110) immediately after wave 4 using requirements often used in the literature (≥
90% approval rate, ≥ 100 approved HITS). Results were very similar to wave 4
1
. We examine
data from waves 1-4 separately to allow for comparisons between waves and to determine
whether established psychometric properties and empirical findings from the broader literature
replicate.
1
34% of participants failed validity checks in wave 4a compared to 38% in Wave 4, χ2 (1,N=411)=.722,
p=.396, comparisons of 4a to all other waves were the same for as they were for 4. Other results in wave
4a were similar to wave 4; results for wave 4a are reported in online supplemental materials
(https://osf.io/t957e/?view_only=843912cfc81f4f34a678d75c7551014f).
AN MTURK CRISIS? 7
Participants
Prescreen. 5,266 participants (wave 1=1128; wave 2=1899; wave 3=1006; wave
4=1233) responded to the prescreen. Eligible participants were automatically forwarded to the
primary study.
Primary study. 1,338 participants completed the primary study; however, 107 within-
wave duplicate MTurk IDs were dropped which occurred despite Qualtrics Ballot Box stuffing
prevention being set. Turkers were not restricted from participating across waves, although less
than 2% of any wave had participated in prior waves. The final sample consisted of 1,231
participants (M age=31.7 years old, SD=6.17 years; 57% female; 79.0% White, 13.8% Black,
5.0% Asian, 2.2% mixed/other; median education, 4-year college degree). This sample size was
appropriate for the original study, which sought to get equal numbers of parents of children in
three age groups. Parents completed demographic and family history questions, a child
vocabulary inventory, a child temperament measure, and self-report personality measures. We
focus on the latter (BFI) because it is the most well validated measure and has strong
psychometric properties in MTurk samples
2
. 260 individuals participated during wave 1, 370 in
wave 2, 300 in wave 3, and 301 in wave 4. This equates to 80% power to detect a correlation of
r=.16 in the smallest wave.
Measures
Prescreen. Turkers who indicated they were English-speaking parents were asked a third
question, used as a validity indicator here, where they entered their number of children in each of
nine age ranges, in roughly two-year increments. Responses that were highly improbable or
2
A subset of participants also completed the IPIP-NEO; although results were similar, the sample size
was too small to justify inclusion. The child measures are not as established, rarely used on MTurk, and
the specific measure varied depending on the child’s age, resulting in small samples for each measure at
each wave.
AN MTURK CRISIS? 8
statistically illogical (i.e., >10 children, >4 children of the same age) were flagged because only
1.3% of the U.S. population meets such criteria (U.S. Census Bureau, 2018).
Primary study. Demographic, and psychopathology questions. Individuals completed
demographic questions, questions about their children (e.g. number, ages, birthdates), self/family
history of language disorders, and self/family history of mood disorders (“Do you or any of your
immediate family members currently have or have a history of any of the following?:
Depressive Disorder [e.g., Depression, Dysphoria, Dysthymia]; other mood disorder”). These
mood disorder items are particularly relevant because of their well-documented links with
personality.
Response validity indicators. We examined 3 traditional and 1 exploratory validity
indicators based on recommendations from the literature and MTurk forums.
Traditional.
Response inconsistency. Participants were flagged if they failed to consistently report
their child’s age (i.e., age in years/months, child’s birthdate, and selection of the child’s age
range from a drop-down menu) on at least two of three different Qualtrics pages approximately
10 minutes apart.
Statistically improbable responses. Participants entered the number of children they had
within several different age brackets (like the prescreen) and selected disorders their child had
from a list of ten. Statistically improbable numbers of children (>10 total children, >4 of a single
age) or improbable combinations of disorders (e.g. both William’s and Down’s syndrome; >4
disorders) were flagged.
AN MTURK CRISIS? 9
Disqualified. Participants who indicated that they were bilingual or did not have children
in the required age range were flagged because they must have provided different prescreen
answers to these questions, indicating random responding or repeated prescreen attempts.
Exploratory.
Unusual comments. Social media posts specifically noted an increase in unusual
responses to open-ended questions (Bai, 2018). Therefore, we examined two free response
questions: “What are the longest 2-3 sentences your child has produced?” and “Do you have any
final comments/questions?” Responses written in all capital letters, single words that did not
align with the question (e.g. “nice”, “good”), noticeably ungrammatical or nonsense phrases,
and/or phrases that repeated portions of the question (e.g. “yes longest phrase”) have been
reported as potential “bot-based/farmer” responding and were flagged (Dennis, Goodson, &
Pearson, 2018). Responses to the child’s longest sentence that were developmentally
inappropriate for a child under 42 months old to have said (e.g. “Most (but not all) toddlers can
say about 20 words by 18 months”)
3
were also flagged. Two undergraduate RAs with knowledge
of child development (blind to wave and hypothesis) and the second author (blind for the first
three waves) coded all responses. Responses were flagged if 2 out of 3 coders flagged it, inter-
coder reliability was almost perfect (Fleiss’s kappa = .82).
Big Five Inventory (BFI) (John et al., 1991; John & Srivastava, 1999). The BFI is a
widely used factor analytically derived measure of Big Five Personality traits. Participants rate
44 items on a five-point scale (1=strongly disagree; 5=strongly agree). The validity and
reliability of the BFI has been demonstrated in over 5,000 studies using a variety of samples
including MTurk. The internal consistency reliability of the BFI in Mturk samples is generally in
3
See supplement for a complete list.
AN MTURK CRISIS? 10
the mid .80s (e.g., mean α=.83-.87: Chmielewski, Sala, Tang, & Baldwin, 2016; Litman,
Robinson, & Rosenzweig, 2015).
Analyses. To examine whether there has been an increase in low quality data (Aim 1) we
compared the frequency of participants failing validity indicators, in the prescreen and primary
data, across each wave using a chi-square test. Next, to test whether the quality of data and
results obtained using MTurk have been adversely affected (Aim 2) we evaluated the basic
psychometric properties (i.e., Cronbach’s alpha for internal consistency reliability; correlations
between scales for internal validity) of the BFI in each wave. We also tested established
associations of personality (particularly neuroticism) with psychopathology by examining the
bivariate correlations of the BFI scales with self/family history of depressive disorder and with
self/family history of other mood disorders. Finally, to determine the degree to which response
validity indicators and screening improve data quality (Aim 3) we removed participants who
failed any validity indicator, recalculated the BFI’s psychometric properties, and retested the
association with mood disorders.
Results
Prescreen Data
The wave 3 (8.6%) and 4 (7.3%) prescreens had significantly greater rates of flagged
responses than waves 1 (0.98%) and 2 (.26%), χ2 (3, N=191)=198.66, p <.001.
Primary Data
Response validity indicators. A significantly higher proportion of individuals were
flagged in the “post-concern” (wave 3=62%, wave 4=38.2%; wave 3 > 4) than “pre-concern”
waves (wave 1=10.4%, wave 2=13.8%; wave 1=2), χ2(3, N=1,231)=245.93, p<.001. Regarding
specific validity indicators, there was a higher rates of response inconsistency in wave 3 (42.3%)
AN MTURK CRISIS? 11
than waves 1 (6.92%), 2 (11.08%), or 4 (10.3%), χ2 (3, N=1231)=168.68, p <.001. No
statistically improbable responses were flagged in waves 1 and 2; significantly more were
flagged in waves 3 (12.3%) and 4 (18.6%), χ2 (3, N=1231)=113.92, p <.001. The percentage of
disqualified participants in waves 3 (34.7%) and 4 (27.2%) was also significantly great than
waves 1 (3.08%) and 2 (2.97%), χ2 (3, N=1231)=178.87, p<.001. More unusual comments were
flagged during waves 3 (45.3%) and 4 (29.9%) than either of the prior waves (wave 1: 2.3%;
wave 2: 1.6%), χ2 (3, N=1231)=274.33, p<.001. Taken together, these data demonstrate a
significant increase in the number of MTurk participants providing invalid or low-quality data
despite the exact same study and HIT requirements being used. It is also worth noting that the
response validity indicators moderately correlated with each other (Table 1).
Table 1.
Correlation between Response Validity Indicators
Flag Reason
1
2
3
4
1. Response Inconsistency
2. Improbable Responses
.30*** (.23,.38)
3. Disqualified
.43*** (.37,.50)
.43***(.35,.50)
4. Unusual Comments
.42*** (.36,.49)
.45***(.38,.52)
.53***(.46,.59)
Note: N=1231. ***p<.001. Parenthesis = bootstrapped 95% confidence intervals (10,000
permutations)
Impact on psychometric properties and replicating established findings. In the
unscreened data, Cronbach’s alphas (Table 2) in wave 1 (mean=.85, range=.81-.87) and wave 2
(mean=.85, range=.81-.88), were line with the literature and past MTurk studies. However, this
was not the case in wave 3 (mean=.74, range=.71-.76) and 4 (mean=.75, range=.71-.80). In fact,
with the exception of openness, all alphas were significantly lower in waves 3 and 4 than they
were in waves 1 and 2
4
; there were no significant differences between waves 3 and 4 or between
4
All p-values reported in supplement
AN MTURK CRISIS? 12
waves 1 and 2. Because Cronbach’s alpha is an indicator of how consistently participants
respond to items assessing similar content, random responding lowers alpha levels.
Next, we removed all participants who failed any validity indicator and recalculated
alphas (Table 2)4. For waves 1 and 2 there were no significant differences between the screened
and unscreened data. However, all alphas in the screened wave 3 data were significantly higher
than in the unscreened data. In wave 4, 60% were higher in the screened versus the unscreened
data. Moreover, only 2 of the 30 cross-wave comparisons were significant in the screened data.
These results suggest that 1) data collected during 2017 and earlier tended to be high quality
even if the data were not screened, 2) data collected from summer 2018 through spring 2019 are
less reliable and, 3) data screening may help ameliorate these issues.
Table 2
Cronbach’s Alpha
Unscreened
Wave 1
Wave 2
Wave 3
Wave 4
Neuroticism
.87 (.85,.90)
.88 (.87,.90)
.76 (.71,.80)
.76 (.71,.80)
Extraversion
.88 (.85,.90)
.88 (.86,.90)
.71 (.65,.75)
.72 (.67,.77)
Openness
.81 (.77,.84)
.81 (.78,.83)
.76 (.72,.80)
.76 (.72,.80)
Agreeableness
.84 (.81,.87)
.83 (.81,.86)
.72 (.66,.76)
.71 (.66,.76)
Conscientiousness
.87 (.85,.89)
.85 (.82,.87)
.75 (.71,.79)
.80 (.76,.83)
Mean
.85
.85
.74
.75
Screened
Wave 1
Wave 2
Wave 3
Wave 4
Neuroticism
.88 (.85,.90)
.89 (.87,.91)
.88 (.84,.91)
.85 (.82,.88)
Extraversion
.88 (.85,.90)
.88 (.86,.90)
.86 (.82,.90)
.82 (.78,.86)
Openness
.82 (.78,.85)
.81 (.77,.84)
.85 (.80,.89)
.83 (.79,.87)
Agreeableness
.84 (.81,.87)
.83 (.80,.86)
.85 (.81,.89)
.78 (.73,.83)
Conscientiousness
.87 (.84,.89)
.84 (.81,.87)
.87 (.82,.90)
.84 (.80,.87)
Mean
.86
.85
.86
.83
Note: Parenthesis = 95% confidence interval. Unscreened Wave 1 N = 260, Unscreened Wave 2
N = 370, Unscreened Wave 3 N = 300, Unscreened Wave 4 N = 301; Screened Wave 1 N = 224,
Screened Wave 2 N = 288, Screened Wave 3 N = 102, Screened Wave 4 N = 167.
Next, we evaluated the BFI inter-correlations in the unscreened data (Table 3). It is
worth noting that inter-correlations across all waves were higher than anticipated based on the
AN MTURK CRISIS? 13
existing literature
5
. In particular, agreeableness and conscientiousness were very highly
correlated; associations in wave 3 (r=.72) were significantly higher than waves 1 (r=.56, p<.01)
and 2 (r=.58, p<.01) but not 4 (r=.66, p=.19). Considering correlations between all BFI scales6,
there were no significant differences between waves 1 and 2. However, 70% of correlations
were significantly different in wave 3 compared to either wave 1 or 2; 60% differed between
waves 1 and 4, 40% between waves 2 and 4, and 30% between waves 3 and 4. We then removed
participants who failed any validity indicator and reran the analyses. Results in the screened
wave 1 and wave 2 data were virtually identical to the unscreened data. However, the screened
waves 3 and 4 data aligned more closely with waves 1 and 2. There were no significant
differences between waves 1 and 2, between waves 1 and 3, or between waves 2 and 3. 30% of
correlations were different between waves 1 and 4, 20% between waves 2 and 4, and 10%
between waves 3 and 4
6
.
Table 3
Internal validity: Correlation between BFI scales
Wave 1
N
E
O
A
C
N
-.48**4(-.59,-.36)
-.23**(-.36,-.10)
-.48**(-.58,-.37)
-.58**4(-.66,-.49)
E
-.47**34(-.58,-.35)
.29**(.17,.42)
.30**(.17,.42)
.30**4(.17,.43)
O
-.23**4(-.35,-.10)
.29**(.16,.40)
.38**(.26,.49)
.28**(.15,.40)
A
-.46**34(-.56,-.36)
.27**3(.15,.38)
.40**34(.29,.51)
.53**(.42,.63)
C
-.57**(-.65,-.48)
.28**34(.16,.39)
.28**3(.15,.40)
.56**34(.45,.65)
Wave 2
N
E
O
A
C
N
-.50**4(-.59,-.39)
-.25**(-.37,-.13)
-.48**(-.57,-.38)
-.48**(-.56,-.40)
E
-.51**34(-.60,-.41)
.31**(.21,.41)
.22**(.11,.33)
.29**4(.18,.38)
O
-.24**4(-.35,-.13)
.33**(.24,.42)
.34**(.22,.45)
.33**(.22,.43)
A
-.47**3(-.56,-.38)
.23**(.13,.33)
.37**3(.26,.47)
.54**(.44,.63)
C
-.47**3(-.54,-39)
.29**34(.20,.38)
.36**34(.26,.46)
.58**3(.50,.67)
Wave 3
N
E
O
A
C
5
Acquiescent responding can increase correlations between BFI scales (Rammstedt & Farmer, 2013;
Soto, John, Gosling, & Potter, 2008). Controlling for acquiescence tended to decrease BFI inter-
correlations to expected levels (see supplemental data), associations with psychopathology were generally
unchanged.
6
All tests two-tailed, p values for all comparisons reported in supplement, results in screened data should
be interpreted considering sample sizes.
AN MTURK CRISIS? 14
N
-.43**(-.61,-.23)
-.33**4(-.51,-.12)
-.53**(-.67,-.38)
-.56**(-.68,-.42)
E
-.28**12(-.43,-.11)
.45**(.26,.61)
.18 (-.02,.36)
.21*(.02,.38)
O
-.11 (-.25,.05)
.43**4(.29,.55)
.35**(.14,.54)
.22*(.04,.40)
A
-.60**124(-.70,-.49)
.091 (-.07,.24)
.21**12(.06,.35)
.63**(.48,.76)
C
-.61**24(-.71,-.51)
.0912 (-.07,.25)
.0712 (-.08,.22)
.72**12(.62,.80)
Wave 4
N
E
O
A
C
N
-.26**12(-.42,-.10)
-.103 (-.26,.07)
-.43**(-.55,-.29)
-.41**1(-.53,-.29)
E
-.21**12(-.35,-.06)
.25**(.08,.41)
.23**(.08,.37)
.0612 (-.09,.21)
O
-.0512 (-.18,.09)
.27**3(.13,.40)
.37**(.24,.50)
.26**(.11,.41)
A
-.47**13(-.57,-.36)
.13* (-.01,.27)
.25**1(.13,.38)
.59**(.47,.69)
C
-.48**3(-.58,-.37)
.0112 (-.13,.16)
.17**2(.04,.30)
.66**1(.57,.74)
Note. *p<.05, **p<.01. Correlations in unscreened data are below the diagonal; correlations in
screened data are above the diagonal. Wave 1 N = 260 unscreened, 233 screened; Wave 2 N =
370 unscreened, 319 screened; Wave 3 N = 300 unscreened, 114 screened; Wave 4, N = 301
unscreened, 186 screened. Parenthesis = bootstrapped 95% confidence intervals (10,000
permutations). Subscripts indicate the specific waves in which the same association is
significantly different within screened or within unscreened data, p<.05, two-tailed.
Next, we tested the well-established associations of personality with psychopathology
(Table 4). Specifically, decades of research and multiple meta-analyses (Kotov et al., 2010;
Malouff et al., 2005) demonstrate that neuroticism has substantial links to the mood and anxiety
disorders; whereas extraversion, conscientiousness, and agreeableness tend to demonstrate
weaker (negative) associations. As expected, neuroticism was significantly correlated with
depressive and other mood disorders at anticipated magnitudes
7
in unscreened wave 1 (r=.45,
p<.01; r=.19, p<.01, respectively) and wave 2 (r=.33, p<.01; r=.15, p<.01) data. It was also
significantly associated in wave 4 (r=.22, p<.01; r=.17, p<.01). However, neuroticism was not
significantly associated with depressive (r=.10, p=.086) or other mood (r=.06, p=.274) disorders
in the unscreened wave 3 data. The association between neuroticism and depressive disorders
was significantly stronger in wave 1 than waves 3 or 4 (p<.01, two-tailed) and stronger in wave 2
than wave 3 (p<.01, two-tailed); no other differences were significant.
7
See preregistration
AN MTURK CRISIS? 15
The other Big Five traits demonstrated associations that generally aligned with the
literature in waves 1 and 2. However, in wave 3 agreeableness and conscientiousness were
significantly positively associated with depression, a finding that goes against the general
literature and is different from waves 1 (p<.01, p<.001, respectively) and 2 (p=.08, p<.001,
respectively). In wave 4 openness was significantly positively correlated with depression; which
goes against the literature and the other waves (p’s<.05). In other words, multiple findings in the
unscreened wave 3 data, and to some extent wave 4, are in contrast to decades of research on
links between the Big Five and psychopathology.
Results in the screened wave 1 and 2 data (Table 4) were nearly identical to the
unscreened data. In the screened wave 3 data, neuroticism was significantly correlated with
depressive disorders; agreeableness and conscientiousness were not. In the screened wave 4
data, extraversion was significantly negatively correlated with depressive disorders; however,
openness was unexpectedly significantly positively associated with depressive disorders. In
sum, screened waves 3 and 4 data, with the exception of wave 4 openness, were no longer
egregiously out of line with the established knowledgebase.
Table 4
Correlations between BFI and mood disorders
Wave 1
Unscreened
Screened
Depressive
Disorder
Other mood
disorder
Depressive
Disorder
Other mood
disorder
N
.45**34 (.35, .44)
.18** (.04, .31)
.48**24 (.37, .58)
.20** (.05, .33)
E
-.22** (-.33, -.12)
-.18** (-.30, -.03)
-.24** (-.35, -.12)
-.19** (-.32, -.05)
O
-.074 (-.19, .05)
-.02 (-.15, .10)
-.094 (-.22, .05)
-.02 (-.15, .11)
A
-.13*3 (-.25, -.03)
-.05 (-.19, .07)
-.16* (-.29, -.04)
-.06 (-.20, .07)
C
-.22**234 (-.33, -.11)
-.03 (-.12, .06)
-.24**34 (-.36, -.11)
-.04 (-.14, .06)
Wave 2
N
.33**3 (.23, .45)
.14** (.03, .25)
.33**1 (.22, .43)
.13* (.01, .23)
E
-.15** (-.25, -.04)
-.15** (-.27, -.04)
-.15** (-.25, -.04)
-.15** (-.27, -.03)
O
-.034 (-.12, .07)
.01 (-.10, .11)
-.084 (-.18, .03)
-.01 (-.12, .11)
A
-.02 (-.13, .08)
-.01 (-.12, .10)
-.04 (-.14, .06)
-.03 (-.14, .08)
C
-.0513 (-.16, .06)
-.04 (-.16, .07)
-.07 (-.18, .03)
-.05 (-.17, .07)
AN MTURK CRISIS? 16
Wave 3
N
.1012 (-.03, .22)
.06 (-.06, .18)
.30* (.10, .48)
.11 (-.14, .34)
E
-.19** (-.32, -.04)
-.08 (-.19, .02)
-.20* (-.38, -.00)
.15 (-.31, -.00)
O
-.024 (-.13, .09)
.08 (-.03, .18)
.07 (-.10, .23)
.07 (-.14, .26)
A
.12**1 (-.00, .24)
.00 (-.11, .12)
.02 (-.17, .20)
.06 (-.17, .27)
C
.15**12 (.03, .28)
-.02 (-.11, .10)
.041 (-.16, .23)
.084 (-.08, .24)
Wave 4
N
.22**1 (.10,.33)
.17** (.05, .27)
.31**1 (.16,.45)
.22** (.07,.36)
E
-.13* (-.26,.00)
-.05 (-.18, .08)
-.15* (-.31,.01)
.07 (-.23,.10)
O
.15*123 (.03, .26)
.04 (-.09, .16)
.18*12 (.02,.32)
.01 (-.15,.17)
A
.00 (-.12, .12)
-.08 (-.19, .04)
-.02 (-.17,.13)
-.10 (-.25,.07)
C
.001 (-.12, .12)
-.12* (-.22, -.03)
-.031 (-.19,.13)
-.18*3 (-.31,-.04)
Note. *p<.05, **p<.01. Parenthesis = bootstrapped 95% confidence intervals (10,000
permutations). Wave 1 N=260 unscreened, 233 screened; Wave 2 N=370 unscreened, 319
screened; Wave 3 N=300 unscreened, 114 screened; Wave 4 N=301 unscreened, 186 screened.
Subscripts indicate the specific waves in which the same association is significantly different
within screened or within unscreened data at p<.05 two-tailed, see supplement for all p values.
Discussion
The current naturalistic experiment, in which the exact same study was conducted
multiple times over four years supports informal concerns regarding MTurk data quality. It
provides empirical evidence of an increase in the percentage of MTurkers providing low quality
data, of a substantial negative impact on MTurk study results, and of failures to replicate well-
established findings. However, these can be mitigated, to some degree, by including validity
indicators and screening MTurk data. We discuss these findings below, providing
recommendations for MTurk research.
Low Quality Data
The percentage of participants failing at least one validity indicator in waves 3 (62%) and
4 (38.2%) is concerning, especially compared to waves 1 (10.4%) and 2 (13.8%). At a
minimum, this indicates researcher time, funds, and other resources are wasted. Our results also
suggest that the negative impact on data quality noted in summer 2018 appears to have persisted
to some degree into spring of 2019. In line with our hypothesis, wave 4 generally demonstrated
AN MTURK CRISIS? 17
worse data quality than waves 1 and 2, but better than wave 3. Perhaps awareness of Mturk data
issues have led to Turkers being banned or more HITS being rejected, thereby disincentivizing
farmers/bots somewhat; alternatively, farmers, bots, and other invalid responders may have
become more sophisticated at avoiding detection (Sylaska & Mayer, 2019).
Impact on Study Results
Given the tendency for researchers to not report using validity indicators and the
prevalence of studies using unscreened data (Wood et al., 2017), results in the unscreened wave
3 and 4 data are alarming. It is particularly concerning that well-established associations
between personality and psychopathology failed to replicate in the unscreened wave 3 data and
anomalies existed in unscreened wave 4. Moreover, the current results are important for
measurement research which often uses MTurk to achieve the necessary sample sizes. Clearly,
the use of unscreened data could lead to improper scale development decisions or inaccurate
conclusions about performance of existing measures. This is critical as optimal measurement is
essential for the continued advancement of science and poor measurement has been suggested as
one cause of the replication crisis (Chmielewski et al., 2016; Flake, Pek, & Hehman, 2017).
Taken together, our results suggest that starting sometime after spring 2017 and continuing
through at least April 2019 the use of unscreened Mturk data may have detrimental impacts on
study outcomes and conclusions.
Data collection. We echo past calls for including validity indicators (Aust et al., 2013;
Barger et al., 2011; Kennedy, Clifford, Burleigh, Jewell, & Waggoner, 2018; Kittur et al., 2008;
Mason & Suri, 2011; Wood et al., 2017; Zhu & Carterette, 2010) and recommend screening all
responses before approving HITS so likely bots, farmers and other invalid responders are
rejected. This will require constant management of studies to pay legitimate Turkers in a timely
AN MTURK CRISIS? 18
manner; however, considerable funds are wasted when paying invalid responders and rejecting
invalid responders will reduce their rating, making it less likely they qualify for other studies and
preventing reinforcement. Researchers can also create qualification block lists to prevent
Turkers who fail validity indicators from participating in their future studies. However, it is
critical that Turkers who provide valid data do not have their HIT rejected unjustly; doing so is
unethical and unfair. One solution is a two-tier screening approach: obvious bots and farmers are
rejected; less obvious cases are approved though removed from the final dataset.
We also recommend using multiple types of validity indicators-the included indicators
were moderately correlated with each other suggesting they all tap into related aspects of low-
quality responding. The current results also provide initial support for the “unusual comments”
validity indicator. Interestingly, many of the unusual comments were phrases that appear when
one googles the question asked. In addition to the validity indicators included, recent research
indicates that response time in seconds per item (SPI) and profile correlations are important
validity indicators (Wood et al., 2017); SPI in particular is easily implemented and places no
burden on participants. Indeed, the lack of SPI is one limitation of the current research as it
began prior to SPI’s publication/validation. Although not reported, SPI was included in waves 4
and 4a; total study completion time (which lacks precision) was available in the other waves. All
timing data significantly correlated with the other validity indicators (see supplemental
materials). In addition, captchas and honey pots (computer code invisible to humans) may help
eliminate bots, although care should be taken to not overwhelm participants.
It is worth noting that traditional attention check items (e.g., select “agree” for this item),
were not included in the current research, may no longer catch participants who provide invalid
data (Sylaska & Mayer, 2019). Moreover, Qualtrics recently recommended against using
AN MTURK CRISIS? 19
attention checks due to evidence that they result in participants providing lower quality data
(Vannette, 2017). Similarly, instructional manipulation checks (IMCs) which were previously
recommended as a way to ensure high quality data (see Oppenheimer, Meyvis, & Davidenko,
2009) have been found to alter participants’ responses in potentially problematic ways (see
Hauser & Schwarz, 2015). IMCs also remove actual participants with certain characteristics
(e.g., those lower in conscientiousness, individuals with lower cognitive ability, particular age
groups, specific demographic groups; see Berinsky, Margolis, & Sances, 2014; Vannette, 2016)
thereby biasing samples; this has lead Qualtrics to recommend against their use (Vannette,
2017). Relatedly, it is essential to recognize that overly stringent criteria (e.g., dropping
participants who only miss 10% of attention checks) or other burdensome validity tasks may also
bias samples. Such biased samples become particularly problematic when such individual
differences are the constructs of interest as it restricts their range and can remove the exact
participants researchers are interested in studying. As such, it is essential to balance screening
with the potential for creating biased/unrepresentative samples.
Clearly, more research on the performance of specific validity indicators, what screening
“cut points” or algorithms should be used, and how to ensure validity tasks or HIT requirements
do not bias the sample is necessary. In addition, the development of validity scales for online
samples could prove useful. Other options, such as third-party Mturk services claiming to
eliminate low quality data, tracking geolocations/IP addresses, incentivizing participants
providing high quality data (Barger et al., 2011), programs tracking whether participants are “on
task” (Permut, Fisher, & Oppenheimer, 2019), and other online sample sources (i.e., Prolific)
may offer additional tools for researchers.
AN MTURK CRISIS? 20
Publishing and evaluating research. In line with open science, we strongly recommend
authors report, and reviewers’ request, detailed information for MTurk studies, such as dates the
data were collected, HIT requirements, validity indicators, screening decisions, and number of
participants dropped. Researchers should also report the psychometric properties of the
measures in the studied sample and compare them to previous research, when available, as this
provides valuable information about the performance of the measure and the quality of data itself
(Chmielewski, Clark, Bagby, & Watson, 2015).
Limitations. Although researchers often to do not report HIT requirements, it is
important to note that the current research used HIT qualifications that were less stringent than
previous recommendations (Peer, Vosgerau, & Acquisti, 2014) which could have impacted the
generalizability of our results. However, the percentage of participants failing validity indicators
in waves 3 and 4 are similar to informal reports and recent publications (Aruguete et al., 2019;
Courrégé, Skeel, Feder, & Boress, 2019; Dreyfuss, 2018). Moreover, results were nearly
identical in an additional wave (4a) of data collected using HIT requirements commonly reported
in the literature. As such, higher HIT qualifications across all waves may have slightly reduced
percentages of low-quality data, but the general pattern and findings would likely remain.
Nevertheless, replicating the current pattern of results in studies using higher HIT requirements
(Peer et al., 2014) is important.
Conclusion
MTurk has been an important resource for psychological science. Nevertheless, there is
compelling evidence of a decrease in MTurk data quality, which can have a substantial negative
impact on study results and conclusions. Even if the current crisis passes, similar issues may
arise again (Kennedy et al., 2018). Therefore, to ensure the continued advancement of science
AN MTURK CRISIS? 21
and integrity of online studies, thoughtful data screening and detailed reporting of screening and
study designs must be the standard operating procedure.
AN MTURK CRISIS? 22
References
Aruguete, M. S., Huynh, H., Browne, B. L., Jurs, B., Flint, E., & McCutcheon, L. E. (2019).
How serious is the ‘carelessness’ problem on Mechanical Turk? International Journal of
Social Research Methodology, 22(5), 441–449.
https://doi.org/10.1080/13645579.2018.1563966
Aust, F., Diedenhofen, B., Ullrich, S., & Musch, J. (2013). Seriousness checks are useful to
improve data validity in online research. Behavior research methods, 45(2), 527-535.
Bagby, R. M., Young, L. T., Schuller, D. R., Bindseil, K. D., Cooke, R. G., Dickens, S. E., …
Joffe, R. T. (1996). Bipolar disorder, unipolar depression and the Five-Factor Model of
personality. Journal of Affective Disorders, 41(1), 25–32.
Bai, H. (2018). Evidence that a large amount of low quality responses on MTurk can be detected
with repeated GPS coordinates. Retrieved February 4, 2019, from Sights + Sounds
website: http://www.maxhuibai.com/1/post/2018/08/evidence-that-responses-from-
repeating-gps-are-random.html
Barger, P., Behrend, T. S., Sharek, D. J., & Sinar, E. F. (2011). IO and the crowd: Frequently
asked questions about using Mechanical Turk for research. The Industrial-Organizational
Psychologist, 11.
Behrend, T. S., Sharek, D. J., Meade, A. W., & Wiebe, E. N. (2011). The viability of
crowdsourcing for survey research. Behavior Research Methods, 43(3), 800.
Buhrmester, M., Kwang, T., & Gosling, S. D. (2011). Amazon’s Mechanical Turk a new source
of inexpensive, yet high-quality, data? Perspectives on Psychological Science, 6(1), 3–5.
AN MTURK CRISIS? 23
Buhrmester, M., Talaifar, S., & Gosling, S. D. (2018). An evaluation of Amazon’s Mechanical
Turk, its rapid rise, and its effective use. Perspectives on Psychological Science, 13(2),
149–154.
Casler, K., Bickel, L., & Hackett, E. (2013). Separate but equal? A comparison of participants
and data gathered via Amazon’s MTurk, social media, and face-to-face behavioral
testing. Computers in Human Behavior, 29(6), 2156–2160.
Chmielewski, M., Clark, L. A., Bagby, R. M., & Watson, D. (2015). Method matters:
Understanding diagnostic reliability in DSM-IV and DSM-5. Journal of Abnormal
Psychology, 124(3), 764.
Chmielewski, M., Sala, M., Tang, R., & Baldwin, A. (2016). Examining the construct validity of
affective judgments of physical activity measures. Psychological Assessment, 28(9),
1128–1141. https://doi.org/10.1037/pas0000322
Clark, L. A., & Watson, D. (1991). Tripartite model of anxiety and depression: Psychometric
evidence and taxonomic implications. Journal of Abnormal Psychology, 100(3), 316–
336. https://doi.org/10.1037/0021-843X.100.3.316
Courrégé, S. C., Skeel, R. L., Feder, A. H., & Boress, K. S. (2019). The ADHD Symptom
Infrequency Scale (ASIS): A novel measure designed to detect adult ADHD simulators.
Psychological Assessment, 31(7), 851-860.
Dennis, S. A., Goodson, B. M., & Pearson, C. (2018). MTurk Workers’ Use of Low-Cost
“Virtual Private Servers” to Circumvent Screening Methods: A Research Note (SSRN
Scholarly Paper No. ID 3233954). Retrieved from Social Science Research Network
website: https://papers.ssrn.com/abstract=3233954
AN MTURK CRISIS? 24
Dreyfuss, E. (2018, August 17). A bot panic hits Amazon’s Mechanical Turk. Wired. Retrieved
from https://www.wired.com/story/amazon-mechanical-turk-bot-panic/
Eriksson, K., & Simpson, B. (2010). Emotional reactions to losing explain gender differences in
entering a risky lottery. Judgment and Decision Making, 5(3), 159-163.
Flake, J. K., Pek, J., & Hehman, E. (2017). Construct validation in social and personality
research: Current practice and recommendations. Social Psychological and Personality
Science, 8(4), 370–378.
Goodman, J. K., Cryder, C. E., & Cheema, A. (2013). Data collection in a flat world: The
strengths and weaknesses of Mechanical Turk samples. Journal of Behavioral Decision
Making, 26(3), 213–224. https://doi.org/10.1002/bdm.1753
Hauser, D. J., & Schwarz, N. (2015). It's a trap! Instructional manipulation checks prompt
systematic thinking on “tricky” tasks. Sage Open, 5(2), 2158244015584617.
Horton, J. J., Rand, D. G., & Zeckhauser, R. J. (2011). The online laboratory: Conducting
experiments in a real labor market. Experimental Economics, 14(3), 399–425.
John, O. P., Donahue, E. M., & Kentle, R. L. (1991). The big five inventory—versions 4a and 54.
Berkeley, CA: Berkeley Institute of Personality and Social Research, University of
California.
John, O. P., & Srivastava, S. (1999). The Big Five Trait taxonomy: History, measurement, and
theoretical perspectives. In Handbook of personality: Theory and research (2nd ed.). (pp.
102–138). New York, NY, US: Guilford Press.
Kees, J., Berry, C., Burton, S., & Sheehan, K. (2017). An analysis of data quality: Professional
panels, student subject pools, and Amazon’s Mechanical Turk. Journal of Advertising,
46(1), 141–155.
AN MTURK CRISIS? 25
Kennedy, R., Clifford, S., Burleigh, T., Jewell, R., & Waggoner, P. (2018). The Shape of and
Solutions to the MTurk Quality Crisis (SSRN Scholarly Paper No. ID 3272468).
Retrieved from Social Science Research Network website:
https://papers.ssrn.com/abstract=3272468
Kittur, A., Chi, E. H., & Suh, B. (2008). Crowdsourcing user studies with Mechanical Turk.
Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Florence, Italy. 453–456. DOI: 10.1145/1357054.1357127
Kotov, R., Gamez, W., Schmidt, F., & Watson, D. (2010). Linking “big” personality traits to
anxiety, depressive, and substance use disorders: A meta-analysis. Psychological
Bulletin, 136(5), 768–821. https://doi.org/10.1037/a0020327
Krueger, R. F., & Tackett, J. L. (Eds.). (2006). Personality and psychopathology. New York:
The Guilford Press.
Malouff, J. M., Thorsteinsson, E. B., & Schutte, N. S. (2005). The relationship between the Five-
Factor Model of personality and symptoms of clinical disorders: A meta-analysis.
Journal of Psychopathology and Behavioral Assessment, 27(2), 101–114.
https://doi.org/10.1007/s10862-005-5384-y
Marge, M., Banerjee, S., & Rudnicky, A. I. (2010). Using the Amazon Mechanical Turk for
transcription of spoken language. IEEE International Conference on Acoustics, Speech
and Signal Processing, Dallas, TX, 5270–5273. DOI: 10.1109/ICASSP.2010.5494979
Mason, W., & Suri, S. (2011). Conducting behavioral research on Amazon’s Mechanical Turk.
Behavior Research Methods, 44(1), 1–23. https://doi.org/10.3758/s13428-011-0124-6
AN MTURK CRISIS? 26
Mason, W., & Watts, D. J. (2009). Financial incentives and the performance of crowds.
Proceedings of the ACM SIGKDD Workshop on Human Computation, Paris, France,
77–85. DOI: 10.1145/1600150.1600175
McCreadie, R. M., Macdonald, C., & Ounis, I. (2010). Crowdsourcing a news query
classification dataset. CSE, Geneva, Switzerland, 31–38..
Oppenheimer, D. M., Meyvis, T., & Davidenko, N. (2009). Instructional manipulation checks:
Detecting satisficing to increase statistical power. Journal of Experimental Social
Psychology, 45(4), 867-872.
Paolacci, G., & Chandler, J.(2014). Inside the Turk: Understanding Mechanical Turk as a
participant pool. Current Directions in Psychological Science, 23(3), 184-188.
Peer, E., Vosgerau, J., & Acquisti, A. (2014). Reputation as a sufficient condition for data quality
on Amazon Mechanical Turk. Behavior Research Methods, 46(4), 1023–1031.
Permut, S., Fisher, M., & Oppenheimer, D. M. (2019). TaskMaster: A tool for determining when
subjects are on task. Advances in Methods and Practices in Psychological Science, 2(2),
188–196. https://doi.org/10.1177/2515245919838479
Rammstedt, B., & Farmer, R. F. (2013). The impact of acquiescence on the evaluation of
personality structure. Psychological Assessment, 25(4), 1137.
Shapiro, D. N., Chandler, J., & Mueller, P. A. (2013). Using Mechanical Turk to study clinical
populations. Clinical Psychological Science, 2167702612469015.
Sheehan, K. B. (2018). Crowdsourcing research: Data collection with Amazon’s Mechanical
Turk. Communication Monographs, 85(1), 140–156.
AN MTURK CRISIS? 27
Soto, C. J., John, O. P., Gosling, S. D., & Potter, J. (2008). The developmental psychometrics of
big five self-reports: Acquiescence, factor structure, coherence, and differentiation from
ages 10 to 20. Journal of Personality and Social Psychology, 94(4), 718.
Stewart, N., Ungemach, C., Harris, A. J., Bartels, D. M., Newell, B. R., Paolacci, G., &
Chandler, J. (2015). The average laboratory samples a population of 7,300 Amazon
Mechanical Turk workers. Judgment and Decision Making, 10(5), 479–491.
Stokel-Walker, C. (2018, October 1). Bots on Amazon’s Mechanical Turk are ruining
psychology studies. Retrieved February 4, 2019, from New Scientist website:
https://www.newscientist.com/article/2176436-bots-on-amazons-mechanical-turk-are-
ruining-psychology-studies/
Suri, S., & Watts, D. J. (2011). Cooperation and contagion in web-based, networked public
goods experiments. PloS One, 6(3), e16836.
Sylaska, K., & Mayer, J. D. (2019, June 28). It’s 2019: Do We Need Super Attention Check Items
to Conduct Web-Based Survey Research? The Evolution of MTurk Survey Respondents.
Presented at the Association for Research in Personality, Grand Rapids MI.
U.S. Census Bureau. (2018). Historical Households Tables, Households by size. Retrieved
February 4, 2019, from https://www.census.gov/data/tables/time-
series/demo/families/households.html
Vannette, D. (2017, June 29). Using attention checks in your surveys may harm data quality.
Retrieved July 18, 2019, from Qualtrics website: https://www.qualtrics.com/blog/using-
attention-checks-in-your-surveys-may-harm-data-quality/
AN MTURK CRISIS? 28
Widiger, T. A., & Trull, T. J. (2007). Plate tectonics in the classification of personality disorder:
Shifting to a dimensional model. American Psychologist, 62(2), 71–83.
https://doi.org/10.1037/0003-066X.62.2.71
Wood, D., Harms, P. D., Lowman, G. H., & DeSimone, J. A. (2017). Response speed and
response consistency as mutually validating indicators of data quality in online samples.
Social Psychological and Personality Science, 8(4), 454–464.
https://doi.org/10.1177/1948550617703168
Zhu, D., & Carterette, B. (2010). An analysis of assessor behavior in crowdsourced preference
judgments.Presented at the SIGIR 2010 Workshop on Crowdsourcing for Search
Evaluations, Geneva, Switzerland, 17–20.