ArticlePDF Available

Abstract and Figures

Amazon’s Mechanical Turk (MTurk) is arguably one of the most important research tools of the past decade. The ability to rapidly collect large amounts of high-quality human subjects data has advanced multiple fields, including personality and social psychology. Beginning in summer 2018, concerns arose regarding MTurk data quality leading to questions about the utility of MTurk for psychological research. We present empirical evidence of a substantial decrease in data quality using a four-wave naturalistic experimental design: pre-, during, and post-summer 2018. During and to some extent post-summer 2018, we find significant increases in participants failing response validity indicators, decreases in reliability and validity of a widely used personality measure, and failures to replicate well-established findings. However, these detrimental effects can be mitigated by using response validity indicators and screening the data. We discuss implications and offer suggestions to ensure data quality.
Content may be subject to copyright.
Running head: AN MTURK CRISIS? 1
Accepted for publication (August, 2019) in Social Psychological and Personality Science
An MTurk crisis? Shifts in data quality and the impact on study results.
Michael Chmielewski; Southern Methodist University
Michael Chmielewski is an Associate Professor of Psychology at Southern Methodist University.
His research focuses on measurement, construct validity, and quantitative models of individual
differences in personality and psychopathology.
Sarah C. Kucker; Oklahoma State University
Sarah Kucker is an Assistant Professor of Psychology at Oklahoma State University. Her
research examines individual differences in the cognitive mechanisms of language and category
development.
Acknowledgements: This research was supported by a Postdoctoral Fellowship from the Callier
Center for Communication Disorders at the University of Texas Dallas and a Faculty Small
Grant at the University of Wisconsin Oshkosh, both to SCK. We thank Jennifer Burson and
Aaron Bagley.
Manuscript Word Count:4950
AN MTURK CRISIS? 2
Abstract
Amazon’s Mechanical Turk (MTurk) is arguably one of the most important research tools of the
past decade. The ability to rapidly collect large amounts of high-quality human subjects data has
advanced multiple fields, including personality and social psychology. Beginning in summer
2018, concerns arose regarding MTurk data quality leading to questions about the utility of
MTurk for psychological research. We present empirical evidence of a substantial decrease in
data quality, using a four-wave naturalistic experimental design: pre-, during, and post- summer
2018. During, and to some extent post-summer 2018, we find significant increases in participants
failing response validity indicators, decreases in reliability and validity of a widely used
personality measure, and failures to replicate well-established findings. However, these
detrimental effects can be mitigated by using response validity indicators and screening the data.
We discuss implications and offer suggestions to ensure data quality.
Abstract word count: 142
Keywords: Amazon Mechanical Turk, TurkGate, data quality, replication, online samples
AN MTURK CRISIS? 3
An MTurk crisis? Shifts in data quality and the impact on study results
Amazon’s Mechanical Turk (MTurk) launched in 2005 as a crowdsourcing marketplace
allowing individuals (Turkers) to complete human intelligence tasks (HITS). Recent years have
seen an exponential increase in psychological studies being conducted on MTurk (Buhrmester,
Talaifar, & Gosling, 2018), likely because MTurk is an efficient method for collecting large
amounts of data (Buhrmester, Kwang, & Gosling, 2011; Buhrmester et al., 2018; Stewart et al.,
2015). Moreover, there has been considerable evidence that MTurk data are equivalent or
superior in quality to data collected in the lab, from professional online panels, and using
marketing research companies (Behrend, Sharek, Meade, & Wiebe, 2011; Buhrmester et al.,
2011; Casler, Bickel, & Hackett, 2013; Kees, Berry, Burton, & Sheehan, 2017; Paolacci &
Chandler, 2014). This has been demonstrated across a variety of study designs and numerous
types of data (Behrend et al., 2011; Buhrmester et al., 2011; Casler et al., 2013; Eriksson &
Simpson, 2010; Goodman, Cryder, & Cheema, 2013; Horton, Rand, & Zeckhauser, 2011; Kees
et al., 2017; Mason & Watts, 2009; Paolacci & Chandler, 2014; Shapiro, Chandler, & Mueller,
2013; Suri & Watts, 2011). MTurk samples are also more representative of the general
population than student samples (Buhrmester et al., 2011; Goodman et al., 2013). Furthermore,
compensation level appears to have no impact on data quality (Buhrmester et al., 2011; Marge,
Banerjee, & Rudnicky, 2010; Mason & Watts, 2009). Indeed, the cost effectiveness of MTurk
allows for a more diverse range of researchers, including those from smaller less-funded
universities, to provide meaningful contributions to psychological research.
There are longstanding recommendations to include response validity indicators when
using MTurk to remove invalid or low quality data (Barger, Behrend, Sharek, & Sinar, 2011;
Kittur, Chi, & Suh, 2008; Mason & Suri, 2011; Zhu & Carterette, 2010). However, such data
AN MTURK CRISIS? 4
screening attempts are rarely reported in the literature (Wood, Harms, Lowman, & DeSimone,
2017). This may be because prior studies lead researchers to believe data screening is
unnecessary and that any Mturk data will be high quality. In fact, Sheehan, (2018, p. 8) recently
concluded, “there is minimal cheating on MTurk”.
However, starting in summer 2018, social media and online discussions emerged
expressing concerns about “bots” (computer programs that automatically complete HITS) and/or
“farmers” (individuals using server farms to bypass MTurk location restrictions) on MTurk.
Bots had received minimal attention in the literature (McCreadie, Macdonald, & Ounis, 2010)
and farmers were previously unheard of with little evidence that either constituted a problem.
These new concerns raised alarms about fundamental shifts in MTurk data quality (Stokel-
Walker, 2018) and declarations of a “bot panic” (Dreyfuss, 2018). However, others suggested
the apparent shift in data quality is illusory and the result of poorly designed studies (Dreyfuss,
2018). Although potentially informative, social media or blog posts are not systematic empirical
evaluations. To date, there has been no peer reviewed empirical studies testing whether there has
been an increase in invalid or poor-quality responses and no examination of whether the results
of studies conducted on MTurk have been negatively affected.
Current Study
The aims of the current research are to 1) examine whether rates of participants providing
low quality data increased in summer 2018 and, if so, whether it persists, 2) test if results using
MTurk are negatively affected, and 3) determine the degree to which including response validity
indicators and screening data improves data quality. A naturalistic experiment, with the exact
same study conducted before and after concerns about MTurk data quality emerged, removes
new study flaws as the primary driver of purported issues. Including a widely used measure with
AN MTURK CRISIS? 5
strong psychometric properties enables comparisons with established psychometric baselines,
testing of potential differences between data collected prior to and following concerns, and
examination of the impact of data screening. Finally, testing well-established findings is another
way to determine if there has been a change in MTurk data quality. In fact, these approaches are
how researchers originally established MTurk’s high quality data (Behrend et al., 2011;
Buhrmester et al., 2011; Casler et al., 2013; Goodman et al., 2013; Kees et al., 2017; Mason &
Watts, 2009; Shapiro et al., 2013).
In the current research we re-analyzed data collected for a study on parental personality
and its role in child development (redacted, submitted for publication). That study was collected
in three waves coinciding with funding awards to the second author. Waves 1 and 2 were
conducted prior to, and wave 3 during, the period when concerns regarding MTurk surfaced
(summer 2018). The original exploratory analysis with waves 1-3 revealed differences in data
quality. Therefore, we collected a new 4th wave of data in spring 2019, hypothesizing that wave 4
would fall between the first two waves and wave 3 in terms of data quality (specific hypothesis,
analyses, and how results will be interpreted were preregistered
https://osf.io/4qgz6/?view_only=5df995d863714c0d841a4793a9306a96). Critically, the exact
same study was conducted at each wave.
First, we report on the frequency of failed validity indicators in a pre-screen survey and in
each wave of the primary study. Then, in each wave, we examine the reliability and the internal
validity of the Big Five Inventory (BFI; John, Donahue, & Kentle, 1991) and test whether well-
established associations with psychopathology replicate (Bagby et al., 1996; Clark & Watson,
1991; Kotov, Gamez, Schmidt, & Watson, 2010; Krueger & Tackett, 2006; Malouff,
Thorsteinsson, & Schutte, 2005; Widiger & Trull, 2007). Given most studies do not report
AN MTURK CRISIS? 6
screening MTurk data (Wood et al., 2017), analyses were first conducted using all participants.
Then, in order to test the impact of screening, we remove participants who failed response
validity indicators and reanalyze the data.
Methods
Procedure
Waves 1 (December 2015 - January 2016) and 2 (March 2017 - May 2017) were
conducted prior to MTurk data quality concerns. Wave 3 (July 2018 - September 2018) was
conducted during and Wave 4 (April 2019) following when concerns emerged. The exact same
study was conducted at each wave (i.e., identical Qualtrics surveys: screening questions,
demographic questions, measures; and recruitment: HIT posting, Turker requirements,
compensation, researcher account). The only differences were the dates and the consent wording
(pursuant to the second author’s affiliation at the time). In order to recruit enough parents of 8-
42 month-olds for the original study, HIT requirements were set to approval rate of 85%, 50
approved HITs, and within the US. Although it is common for studies to not report their HIT
requirements, current requirements were slightly lower than those that are reported. Therefore,
to examine the potential impact of lower HIT requirements, we collected an additional wave
(Wave 4a; N=110) immediately after wave 4 using requirements often used in the literature (
90% approval rate, ≥ 100 approved HITS). Results were very similar to wave 4
1
. We examine
data from waves 1-4 separately to allow for comparisons between waves and to determine
whether established psychometric properties and empirical findings from the broader literature
replicate.
1
34% of participants failed validity checks in wave 4a compared to 38% in Wave 4, χ2 (1,N=411)=.722,
p=.396, comparisons of 4a to all other waves were the same for as they were for 4. Other results in wave
4a were similar to wave 4; results for wave 4a are reported in online supplemental materials
(https://osf.io/t957e/?view_only=843912cfc81f4f34a678d75c7551014f).
AN MTURK CRISIS? 7
Participants
Prescreen. 5,266 participants (wave 1=1128; wave 2=1899; wave 3=1006; wave
4=1233) responded to the prescreen. Eligible participants were automatically forwarded to the
primary study.
Primary study. 1,338 participants completed the primary study; however, 107 within-
wave duplicate MTurk IDs were dropped which occurred despite Qualtrics Ballot Box stuffing
prevention being set. Turkers were not restricted from participating across waves, although less
than 2% of any wave had participated in prior waves. The final sample consisted of 1,231
participants (M age=31.7 years old, SD=6.17 years; 57% female; 79.0% White, 13.8% Black,
5.0% Asian, 2.2% mixed/other; median education, 4-year college degree). This sample size was
appropriate for the original study, which sought to get equal numbers of parents of children in
three age groups. Parents completed demographic and family history questions, a child
vocabulary inventory, a child temperament measure, and self-report personality measures. We
focus on the latter (BFI) because it is the most well validated measure and has strong
psychometric properties in MTurk samples
2
. 260 individuals participated during wave 1, 370 in
wave 2, 300 in wave 3, and 301 in wave 4. This equates to 80% power to detect a correlation of
r=.16 in the smallest wave.
Measures
Prescreen. Turkers who indicated they were English-speaking parents were asked a third
question, used as a validity indicator here, where they entered their number of children in each of
nine age ranges, in roughly two-year increments. Responses that were highly improbable or
2
A subset of participants also completed the IPIP-NEO; although results were similar, the sample size
was too small to justify inclusion. The child measures are not as established, rarely used on MTurk, and
the specific measure varied depending on the child’s age, resulting in small samples for each measure at
each wave.
AN MTURK CRISIS? 8
statistically illogical (i.e., >10 children, >4 children of the same age) were flagged because only
1.3% of the U.S. population meets such criteria (U.S. Census Bureau, 2018).
Primary study. Demographic, and psychopathology questions. Individuals completed
demographic questions, questions about their children (e.g. number, ages, birthdates), self/family
history of language disorders, and self/family history of mood disorders (“Do you or any of your
immediate family members currently have or have a history of any of the following?:
Depressive Disorder [e.g., Depression, Dysphoria, Dysthymia]; other mood disorder”). These
mood disorder items are particularly relevant because of their well-documented links with
personality.
Response validity indicators. We examined 3 traditional and 1 exploratory validity
indicators based on recommendations from the literature and MTurk forums.
Traditional.
Response inconsistency. Participants were flagged if they failed to consistently report
their child’s age (i.e., age in years/months, child’s birthdate, and selection of the child’s age
range from a drop-down menu) on at least two of three different Qualtrics pages approximately
10 minutes apart.
Statistically improbable responses. Participants entered the number of children they had
within several different age brackets (like the prescreen) and selected disorders their child had
from a list of ten. Statistically improbable numbers of children (>10 total children, >4 of a single
age) or improbable combinations of disorders (e.g. both William’s and Down’s syndrome; >4
disorders) were flagged.
AN MTURK CRISIS? 9
Disqualified. Participants who indicated that they were bilingual or did not have children
in the required age range were flagged because they must have provided different prescreen
answers to these questions, indicating random responding or repeated prescreen attempts.
Exploratory.
Unusual comments. Social media posts specifically noted an increase in unusual
responses to open-ended questions (Bai, 2018). Therefore, we examined two free response
questions: “What are the longest 2-3 sentences your child has produced?” and “Do you have any
final comments/questions?” Responses written in all capital letters, single words that did not
align with the question (e.g. “nice”, “good”), noticeably ungrammatical or nonsense phrases,
and/or phrases that repeated portions of the question (e.g. “yes longest phrase”) have been
reported as potential bot-based/farmer” responding and were flagged (Dennis, Goodson, &
Pearson, 2018). Responses to the child’s longest sentence that were developmentally
inappropriate for a child under 42 months old to have said (e.g. “Most (but not all) toddlers can
say about 20 words by 18 months”)
3
were also flagged. Two undergraduate RAs with knowledge
of child development (blind to wave and hypothesis) and the second author (blind for the first
three waves) coded all responses. Responses were flagged if 2 out of 3 coders flagged it, inter-
coder reliability was almost perfect (Fleiss’s kappa = .82).
Big Five Inventory (BFI) (John et al., 1991; John & Srivastava, 1999). The BFI is a
widely used factor analytically derived measure of Big Five Personality traits. Participants rate
44 items on a five-point scale (1=strongly disagree; 5=strongly agree). The validity and
reliability of the BFI has been demonstrated in over 5,000 studies using a variety of samples
including MTurk. The internal consistency reliability of the BFI in Mturk samples is generally in
3
See supplement for a complete list.
AN MTURK CRISIS? 10
the mid .80s (e.g., mean α=.83-.87: Chmielewski, Sala, Tang, & Baldwin, 2016; Litman,
Robinson, & Rosenzweig, 2015).
Analyses. To examine whether there has been an increase in low quality data (Aim 1) we
compared the frequency of participants failing validity indicators, in the prescreen and primary
data, across each wave using a chi-square test. Next, to test whether the quality of data and
results obtained using MTurk have been adversely affected (Aim 2) we evaluated the basic
psychometric properties (i.e., Cronbach’s alpha for internal consistency reliability; correlations
between scales for internal validity) of the BFI in each wave. We also tested established
associations of personality (particularly neuroticism) with psychopathology by examining the
bivariate correlations of the BFI scales with self/family history of depressive disorder and with
self/family history of other mood disorders. Finally, to determine the degree to which response
validity indicators and screening improve data quality (Aim 3) we removed participants who
failed any validity indicator, recalculated the BFI’s psychometric properties, and retested the
association with mood disorders.
Results
Prescreen Data
The wave 3 (8.6%) and 4 (7.3%) prescreens had significantly greater rates of flagged
responses than waves 1 (0.98%) and 2 (.26%), χ2 (3, N=191)=198.66, p <.001.
Primary Data
Response validity indicators. A significantly higher proportion of individuals were
flagged in the “post-concern” (wave 3=62%, wave 4=38.2%; wave 3 > 4) than “pre-concern”
waves (wave 1=10.4%, wave 2=13.8%; wave 1=2), χ2(3, N=1,231)=245.93, p<.001. Regarding
specific validity indicators, there was a higher rates of response inconsistency in wave 3 (42.3%)
AN MTURK CRISIS? 11
than waves 1 (6.92%), 2 (11.08%), or 4 (10.3%), χ2 (3, N=1231)=168.68, p <.001. No
statistically improbable responses were flagged in waves 1 and 2; significantly more were
flagged in waves 3 (12.3%) and 4 (18.6%), χ2 (3, N=1231)=113.92, p <.001. The percentage of
disqualified participants in waves 3 (34.7%) and 4 (27.2%) was also significantly great than
waves 1 (3.08%) and 2 (2.97%), χ2 (3, N=1231)=178.87, p<.001. More unusual comments were
flagged during waves 3 (45.3%) and 4 (29.9%) than either of the prior waves (wave 1: 2.3%;
wave 2: 1.6%), χ2 (3, N=1231)=274.33, p<.001. Taken together, these data demonstrate a
significant increase in the number of MTurk participants providing invalid or low-quality data
despite the exact same study and HIT requirements being used. It is also worth noting that the
response validity indicators moderately correlated with each other (Table 1).
Table 1.
Correlation between Response Validity Indicators
Flag Reason
1
2
3
4
1. Response Inconsistency
2. Improbable Responses
.30*** (.23,.38)
3. Disqualified
.43*** (.37,.50)
.43***(.35,.50)
4. Unusual Comments
.42*** (.36,.49)
.45***(.38,.52)
.53***(.46,.59)
Note: N=1231. ***p<.001. Parenthesis = bootstrapped 95% confidence intervals (10,000
permutations)
Impact on psychometric properties and replicating established findings. In the
unscreened data, Cronbach’s alphas (Table 2) in wave 1 (mean=.85, range=.81-.87) and wave 2
(mean=.85, range=.81-.88), were line with the literature and past MTurk studies. However, this
was not the case in wave 3 (mean=.74, range=.71-.76) and 4 (mean=.75, range=.71-.80). In fact,
with the exception of openness, all alphas were significantly lower in waves 3 and 4 than they
were in waves 1 and 2
4
; there were no significant differences between waves 3 and 4 or between
4
All p-values reported in supplement
AN MTURK CRISIS? 12
waves 1 and 2. Because Cronbach’s alpha is an indicator of how consistently participants
respond to items assessing similar content, random responding lowers alpha levels.
Next, we removed all participants who failed any validity indicator and recalculated
alphas (Table 2)4. For waves 1 and 2 there were no significant differences between the screened
and unscreened data. However, all alphas in the screened wave 3 data were significantly higher
than in the unscreened data. In wave 4, 60% were higher in the screened versus the unscreened
data. Moreover, only 2 of the 30 cross-wave comparisons were significant in the screened data.
These results suggest that 1) data collected during 2017 and earlier tended to be high quality
even if the data were not screened, 2) data collected from summer 2018 through spring 2019 are
less reliable and, 3) data screening may help ameliorate these issues.
Table 2
Cronbach’s Alpha
Unscreened
Wave 1
Wave 2
Wave 3
Wave 4
Neuroticism
.87 (.85,.90)
.88 (.87,.90)
.76 (.71,.80)
.76 (.71,.80)
Extraversion
.88 (.85,.90)
.88 (.86,.90)
.71 (.65,.75)
.72 (.67,.77)
Openness
.81 (.77,.84)
.81 (.78,.83)
.76 (.72,.80)
.76 (.72,.80)
Agreeableness
.84 (.81,.87)
.83 (.81,.86)
.72 (.66,.76)
.71 (.66,.76)
Conscientiousness
.87 (.85,.89)
.85 (.82,.87)
.75 (.71,.79)
.80 (.76,.83)
Mean
.85
.85
.74
.75
Screened
Wave 1
Wave 2
Wave 3
Wave 4
Neuroticism
.88 (.85,.90)
.89 (.87,.91)
.88 (.84,.91)
.85 (.82,.88)
Extraversion
.88 (.85,.90)
.88 (.86,.90)
.86 (.82,.90)
.82 (.78,.86)
Openness
.82 (.78,.85)
.81 (.77,.84)
.85 (.80,.89)
.83 (.79,.87)
Agreeableness
.84 (.81,.87)
.83 (.80,.86)
.85 (.81,.89)
.78 (.73,.83)
Conscientiousness
.87 (.84,.89)
.84 (.81,.87)
.87 (.82,.90)
.84 (.80,.87)
Mean
.86
.85
.86
.83
Note: Parenthesis = 95% confidence interval. Unscreened Wave 1 N = 260, Unscreened Wave 2
N = 370, Unscreened Wave 3 N = 300, Unscreened Wave 4 N = 301; Screened Wave 1 N = 224,
Screened Wave 2 N = 288, Screened Wave 3 N = 102, Screened Wave 4 N = 167.
Next, we evaluated the BFI inter-correlations in the unscreened data (Table 3). It is
worth noting that inter-correlations across all waves were higher than anticipated based on the
AN MTURK CRISIS? 13
existing literature
5
. In particular, agreeableness and conscientiousness were very highly
correlated; associations in wave 3 (r=.72) were significantly higher than waves 1 (r=.56, p<.01)
and 2 (r=.58, p<.01) but not 4 (r=.66, p=.19). Considering correlations between all BFI scales6,
there were no significant differences between waves 1 and 2. However, 70% of correlations
were significantly different in wave 3 compared to either wave 1 or 2; 60% differed between
waves 1 and 4, 40% between waves 2 and 4, and 30% between waves 3 and 4. We then removed
participants who failed any validity indicator and reran the analyses. Results in the screened
wave 1 and wave 2 data were virtually identical to the unscreened data. However, the screened
waves 3 and 4 data aligned more closely with waves 1 and 2. There were no significant
differences between waves 1 and 2, between waves 1 and 3, or between waves 2 and 3. 30% of
correlations were different between waves 1 and 4, 20% between waves 2 and 4, and 10%
between waves 3 and 4
6
.
Table 3
Internal validity: Correlation between BFI scales
Wave 1
N
E
O
A
C
N
-.48**4(-.59,-.36)
-.23**(-.36,-.10)
-.48**(-.58,-.37)
-.58**4(-.66,-.49)
E
-.47**34(-.58,-.35)
.29**(.17,.42)
.30**(.17,.42)
.30**4(.17,.43)
O
-.23**4(-.35,-.10)
.29**(.16,.40)
.38**(.26,.49)
.28**(.15,.40)
A
-.46**34(-.56,-.36)
.27**3(.15,.38)
.40**34(.29,.51)
.53**(.42,.63)
C
-.57**(-.65,-.48)
.28**34(.16,.39)
.28**3(.15,.40)
.56**34(.45,.65)
Wave 2
N
E
O
A
C
N
-.50**4(-.59,-.39)
-.25**(-.37,-.13)
-.48**(-.57,-.38)
-.48**(-.56,-.40)
E
-.51**34(-.60,-.41)
.31**(.21,.41)
.22**(.11,.33)
.29**4(.18,.38)
O
-.24**4(-.35,-.13)
.33**(.24,.42)
.34**(.22,.45)
.33**(.22,.43)
A
-.47**3(-.56,-.38)
.23**(.13,.33)
.37**3(.26,.47)
.54**(.44,.63)
C
-.47**3(-.54,-39)
.29**34(.20,.38)
.36**34(.26,.46)
.58**3(.50,.67)
Wave 3
N
E
O
A
C
5
Acquiescent responding can increase correlations between BFI scales (Rammstedt & Farmer, 2013;
Soto, John, Gosling, & Potter, 2008). Controlling for acquiescence tended to decrease BFI inter-
correlations to expected levels (see supplemental data), associations with psychopathology were generally
unchanged.
6
All tests two-tailed, p values for all comparisons reported in supplement, results in screened data should
be interpreted considering sample sizes.
AN MTURK CRISIS? 14
N
-.43**(-.61,-.23)
-.33**4(-.51,-.12)
-.53**(-.67,-.38)
-.56**(-.68,-.42)
E
-.28**12(-.43,-.11)
.45**(.26,.61)
.18 (-.02,.36)
.21*(.02,.38)
O
-.11 (-.25,.05)
.43**4(.29,.55)
.35**(.14,.54)
.22*(.04,.40)
A
-.60**124(-.70,-.49)
.091 (-.07,.24)
.21**12(.06,.35)
.63**(.48,.76)
C
-.61**24(-.71,-.51)
.0912 (-.07,.25)
.0712 (-.08,.22)
.72**12(.62,.80)
Wave 4
N
E
O
A
C
N
-.26**12(-.42,-.10)
-.103 (-.26,.07)
-.43**(-.55,-.29)
-.41**1(-.53,-.29)
E
-.21**12(-.35,-.06)
.25**(.08,.41)
.23**(.08,.37)
.0612 (-.09,.21)
O
-.0512 (-.18,.09)
.27**3(.13,.40)
.37**(.24,.50)
.26**(.11,.41)
A
-.47**13(-.57,-.36)
.13* (-.01,.27)
.25**1(.13,.38)
.59**(.47,.69)
C
-.48**3(-.58,-.37)
.0112 (-.13,.16)
.17**2(.04,.30)
.66**1(.57,.74)
Note. *p<.05, **p<.01. Correlations in unscreened data are below the diagonal; correlations in
screened data are above the diagonal. Wave 1 N = 260 unscreened, 233 screened; Wave 2 N =
370 unscreened, 319 screened; Wave 3 N = 300 unscreened, 114 screened; Wave 4, N = 301
unscreened, 186 screened. Parenthesis = bootstrapped 95% confidence intervals (10,000
permutations). Subscripts indicate the specific waves in which the same association is
significantly different within screened or within unscreened data, p<.05, two-tailed.
Next, we tested the well-established associations of personality with psychopathology
(Table 4). Specifically, decades of research and multiple meta-analyses (Kotov et al., 2010;
Malouff et al., 2005) demonstrate that neuroticism has substantial links to the mood and anxiety
disorders; whereas extraversion, conscientiousness, and agreeableness tend to demonstrate
weaker (negative) associations. As expected, neuroticism was significantly correlated with
depressive and other mood disorders at anticipated magnitudes
7
in unscreened wave 1 (r=.45,
p<.01; r=.19, p<.01, respectively) and wave 2 (r=.33, p<.01; r=.15, p<.01) data. It was also
significantly associated in wave 4 (r=.22, p<.01; r=.17, p<.01). However, neuroticism was not
significantly associated with depressive (r=.10, p=.086) or other mood (r=.06, p=.274) disorders
in the unscreened wave 3 data. The association between neuroticism and depressive disorders
was significantly stronger in wave 1 than waves 3 or 4 (p<.01, two-tailed) and stronger in wave 2
than wave 3 (p<.01, two-tailed); no other differences were significant.
7
See preregistration
AN MTURK CRISIS? 15
The other Big Five traits demonstrated associations that generally aligned with the
literature in waves 1 and 2. However, in wave 3 agreeableness and conscientiousness were
significantly positively associated with depression, a finding that goes against the general
literature and is different from waves 1 (p<.01, p<.001, respectively) and 2 (p=.08, p<.001,
respectively). In wave 4 openness was significantly positively correlated with depression; which
goes against the literature and the other waves (p’s<.05). In other words, multiple findings in the
unscreened wave 3 data, and to some extent wave 4, are in contrast to decades of research on
links between the Big Five and psychopathology.
Results in the screened wave 1 and 2 data (Table 4) were nearly identical to the
unscreened data. In the screened wave 3 data, neuroticism was significantly correlated with
depressive disorders; agreeableness and conscientiousness were not. In the screened wave 4
data, extraversion was significantly negatively correlated with depressive disorders; however,
openness was unexpectedly significantly positively associated with depressive disorders. In
sum, screened waves 3 and 4 data, with the exception of wave 4 openness, were no longer
egregiously out of line with the established knowledgebase.
Table 4
Correlations between BFI and mood disorders
Wave 1
Unscreened
Screened
Depressive
Disorder
Other mood
disorder
Depressive
Disorder
Other mood
disorder
N
.45**34 (.35, .44)
.18** (.04, .31)
.48**24 (.37, .58)
.20** (.05, .33)
E
-.22** (-.33, -.12)
-.18** (-.30, -.03)
-.24** (-.35, -.12)
-.19** (-.32, -.05)
O
-.074 (-.19, .05)
-.02 (-.15, .10)
-.094 (-.22, .05)
-.02 (-.15, .11)
A
-.13*3 (-.25, -.03)
-.05 (-.19, .07)
-.16* (-.29, -.04)
-.06 (-.20, .07)
C
-.22**234 (-.33, -.11)
-.03 (-.12, .06)
-.24**34 (-.36, -.11)
-.04 (-.14, .06)
Wave 2
N
.33**3 (.23, .45)
.14** (.03, .25)
.33**1 (.22, .43)
.13* (.01, .23)
E
-.15** (-.25, -.04)
-.15** (-.27, -.04)
-.15** (-.25, -.04)
-.15** (-.27, -.03)
O
-.034 (-.12, .07)
.01 (-.10, .11)
-.084 (-.18, .03)
-.01 (-.12, .11)
A
-.02 (-.13, .08)
-.01 (-.12, .10)
-.04 (-.14, .06)
-.03 (-.14, .08)
C
-.0513 (-.16, .06)
-.04 (-.16, .07)
-.07 (-.18, .03)
-.05 (-.17, .07)
AN MTURK CRISIS? 16
Wave 3
N
.1012 (-.03, .22)
.06 (-.06, .18)
.30* (.10, .48)
.11 (-.14, .34)
E
-.19** (-.32, -.04)
-.08 (-.19, .02)
-.20* (-.38, -.00)
.15 (-.31, -.00)
O
-.024 (-.13, .09)
.08 (-.03, .18)
.07 (-.10, .23)
.07 (-.14, .26)
A
.12**1 (-.00, .24)
.00 (-.11, .12)
.02 (-.17, .20)
.06 (-.17, .27)
C
.15**12 (.03, .28)
-.02 (-.11, .10)
.041 (-.16, .23)
.084 (-.08, .24)
Wave 4
N
.22**1 (.10,.33)
.17** (.05, .27)
.31**1 (.16,.45)
.22** (.07,.36)
E
-.13* (-.26,.00)
-.05 (-.18, .08)
-.15* (-.31,.01)
.07 (-.23,.10)
O
.15*123 (.03, .26)
.04 (-.09, .16)
.18*12 (.02,.32)
.01 (-.15,.17)
A
.00 (-.12, .12)
-.08 (-.19, .04)
-.02 (-.17,.13)
-.10 (-.25,.07)
C
.001 (-.12, .12)
-.12* (-.22, -.03)
-.031 (-.19,.13)
-.18*3 (-.31,-.04)
Note. *p<.05, **p<.01. Parenthesis = bootstrapped 95% confidence intervals (10,000
permutations). Wave 1 N=260 unscreened, 233 screened; Wave 2 N=370 unscreened, 319
screened; Wave 3 N=300 unscreened, 114 screened; Wave 4 N=301 unscreened, 186 screened.
Subscripts indicate the specific waves in which the same association is significantly different
within screened or within unscreened data at p<.05 two-tailed, see supplement for all p values.
Discussion
The current naturalistic experiment, in which the exact same study was conducted
multiple times over four years supports informal concerns regarding MTurk data quality. It
provides empirical evidence of an increase in the percentage of MTurkers providing low quality
data, of a substantial negative impact on MTurk study results, and of failures to replicate well-
established findings. However, these can be mitigated, to some degree, by including validity
indicators and screening MTurk data. We discuss these findings below, providing
recommendations for MTurk research.
Low Quality Data
The percentage of participants failing at least one validity indicator in waves 3 (62%) and
4 (38.2%) is concerning, especially compared to waves 1 (10.4%) and 2 (13.8%). At a
minimum, this indicates researcher time, funds, and other resources are wasted. Our results also
suggest that the negative impact on data quality noted in summer 2018 appears to have persisted
to some degree into spring of 2019. In line with our hypothesis, wave 4 generally demonstrated
AN MTURK CRISIS? 17
worse data quality than waves 1 and 2, but better than wave 3. Perhaps awareness of Mturk data
issues have led to Turkers being banned or more HITS being rejected, thereby disincentivizing
farmers/bots somewhat; alternatively, farmers, bots, and other invalid responders may have
become more sophisticated at avoiding detection (Sylaska & Mayer, 2019).
Impact on Study Results
Given the tendency for researchers to not report using validity indicators and the
prevalence of studies using unscreened data (Wood et al., 2017), results in the unscreened wave
3 and 4 data are alarming. It is particularly concerning that well-established associations
between personality and psychopathology failed to replicate in the unscreened wave 3 data and
anomalies existed in unscreened wave 4. Moreover, the current results are important for
measurement research which often uses MTurk to achieve the necessary sample sizes. Clearly,
the use of unscreened data could lead to improper scale development decisions or inaccurate
conclusions about performance of existing measures. This is critical as optimal measurement is
essential for the continued advancement of science and poor measurement has been suggested as
one cause of the replication crisis (Chmielewski et al., 2016; Flake, Pek, & Hehman, 2017).
Taken together, our results suggest that starting sometime after spring 2017 and continuing
through at least April 2019 the use of unscreened Mturk data may have detrimental impacts on
study outcomes and conclusions.
Data collection. We echo past calls for including validity indicators (Aust et al., 2013;
Barger et al., 2011; Kennedy, Clifford, Burleigh, Jewell, & Waggoner, 2018; Kittur et al., 2008;
Mason & Suri, 2011; Wood et al., 2017; Zhu & Carterette, 2010) and recommend screening all
responses before approving HITS so likely bots, farmers and other invalid responders are
rejected. This will require constant management of studies to pay legitimate Turkers in a timely
AN MTURK CRISIS? 18
manner; however, considerable funds are wasted when paying invalid responders and rejecting
invalid responders will reduce their rating, making it less likely they qualify for other studies and
preventing reinforcement. Researchers can also create qualification block lists to prevent
Turkers who fail validity indicators from participating in their future studies. However, it is
critical that Turkers who provide valid data do not have their HIT rejected unjustly; doing so is
unethical and unfair. One solution is a two-tier screening approach: obvious bots and farmers are
rejected; less obvious cases are approved though removed from the final dataset.
We also recommend using multiple types of validity indicators-the included indicators
were moderately correlated with each other suggesting they all tap into related aspects of low-
quality responding. The current results also provide initial support for the “unusual comments
validity indicator. Interestingly, many of the unusual comments were phrases that appear when
one googles the question asked. In addition to the validity indicators included, recent research
indicates that response time in seconds per item (SPI) and profile correlations are important
validity indicators (Wood et al., 2017); SPI in particular is easily implemented and places no
burden on participants. Indeed, the lack of SPI is one limitation of the current research as it
began prior to SPI’s publication/validation. Although not reported, SPI was included in waves 4
and 4a; total study completion time (which lacks precision) was available in the other waves. All
timing data significantly correlated with the other validity indicators (see supplemental
materials). In addition, captchas and honey pots (computer code invisible to humans) may help
eliminate bots, although care should be taken to not overwhelm participants.
It is worth noting that traditional attention check items (e.g., select “agree” for this item),
were not included in the current research, may no longer catch participants who provide invalid
data (Sylaska & Mayer, 2019). Moreover, Qualtrics recently recommended against using
AN MTURK CRISIS? 19
attention checks due to evidence that they result in participants providing lower quality data
(Vannette, 2017). Similarly, instructional manipulation checks (IMCs) which were previously
recommended as a way to ensure high quality data (see Oppenheimer, Meyvis, & Davidenko,
2009) have been found to alter participants’ responses in potentially problematic ways (see
Hauser & Schwarz, 2015). IMCs also remove actual participants with certain characteristics
(e.g., those lower in conscientiousness, individuals with lower cognitive ability, particular age
groups, specific demographic groups; see Berinsky, Margolis, & Sances, 2014; Vannette, 2016)
thereby biasing samples; this has lead Qualtrics to recommend against their use (Vannette,
2017). Relatedly, it is essential to recognize that overly stringent criteria (e.g., dropping
participants who only miss 10% of attention checks) or other burdensome validity tasks may also
bias samples. Such biased samples become particularly problematic when such individual
differences are the constructs of interest as it restricts their range and can remove the exact
participants researchers are interested in studying. As such, it is essential to balance screening
with the potential for creating biased/unrepresentative samples.
Clearly, more research on the performance of specific validity indicators, what screening
“cut points” or algorithms should be used, and how to ensure validity tasks or HIT requirements
do not bias the sample is necessary. In addition, the development of validity scales for online
samples could prove useful. Other options, such as third-party Mturk services claiming to
eliminate low quality data, tracking geolocations/IP addresses, incentivizing participants
providing high quality data (Barger et al., 2011), programs tracking whether participants are “on
task” (Permut, Fisher, & Oppenheimer, 2019), and other online sample sources (i.e., Prolific)
may offer additional tools for researchers.
AN MTURK CRISIS? 20
Publishing and evaluating research. In line with open science, we strongly recommend
authors report, and reviewers’ request, detailed information for MTurk studies, such as dates the
data were collected, HIT requirements, validity indicators, screening decisions, and number of
participants dropped. Researchers should also report the psychometric properties of the
measures in the studied sample and compare them to previous research, when available, as this
provides valuable information about the performance of the measure and the quality of data itself
(Chmielewski, Clark, Bagby, & Watson, 2015).
Limitations. Although researchers often to do not report HIT requirements, it is
important to note that the current research used HIT qualifications that were less stringent than
previous recommendations (Peer, Vosgerau, & Acquisti, 2014) which could have impacted the
generalizability of our results. However, the percentage of participants failing validity indicators
in waves 3 and 4 are similar to informal reports and recent publications (Aruguete et al., 2019;
Courrégé, Skeel, Feder, & Boress, 2019; Dreyfuss, 2018). Moreover, results were nearly
identical in an additional wave (4a) of data collected using HIT requirements commonly reported
in the literature. As such, higher HIT qualifications across all waves may have slightly reduced
percentages of low-quality data, but the general pattern and findings would likely remain.
Nevertheless, replicating the current pattern of results in studies using higher HIT requirements
(Peer et al., 2014) is important.
Conclusion
MTurk has been an important resource for psychological science. Nevertheless, there is
compelling evidence of a decrease in MTurk data quality, which can have a substantial negative
impact on study results and conclusions. Even if the current crisis passes, similar issues may
arise again (Kennedy et al., 2018). Therefore, to ensure the continued advancement of science
AN MTURK CRISIS? 21
and integrity of online studies, thoughtful data screening and detailed reporting of screening and
study designs must be the standard operating procedure.
AN MTURK CRISIS? 22
References
Aruguete, M. S., Huynh, H., Browne, B. L., Jurs, B., Flint, E., & McCutcheon, L. E. (2019).
How serious is the ‘carelessness’ problem on Mechanical Turk? International Journal of
Social Research Methodology, 22(5), 441449.
https://doi.org/10.1080/13645579.2018.1563966
Aust, F., Diedenhofen, B., Ullrich, S., & Musch, J. (2013). Seriousness checks are useful to
improve data validity in online research. Behavior research methods, 45(2), 527-535.
Bagby, R. M., Young, L. T., Schuller, D. R., Bindseil, K. D., Cooke, R. G., Dickens, S. E., …
Joffe, R. T. (1996). Bipolar disorder, unipolar depression and the Five-Factor Model of
personality. Journal of Affective Disorders, 41(1), 2532.
Bai, H. (2018). Evidence that a large amount of low quality responses on MTurk can be detected
with repeated GPS coordinates. Retrieved February 4, 2019, from Sights + Sounds
website: http://www.maxhuibai.com/1/post/2018/08/evidence-that-responses-from-
repeating-gps-are-random.html
Barger, P., Behrend, T. S., Sharek, D. J., & Sinar, E. F. (2011). IO and the crowd: Frequently
asked questions about using Mechanical Turk for research. The Industrial-Organizational
Psychologist, 11.
Behrend, T. S., Sharek, D. J., Meade, A. W., & Wiebe, E. N. (2011). The viability of
crowdsourcing for survey research. Behavior Research Methods, 43(3), 800.
Buhrmester, M., Kwang, T., & Gosling, S. D. (2011). Amazon’s Mechanical Turk a new source
of inexpensive, yet high-quality, data? Perspectives on Psychological Science, 6(1), 35.
AN MTURK CRISIS? 23
Buhrmester, M., Talaifar, S., & Gosling, S. D. (2018). An evaluation of Amazon’s Mechanical
Turk, its rapid rise, and its effective use. Perspectives on Psychological Science, 13(2),
149154.
Casler, K., Bickel, L., & Hackett, E. (2013). Separate but equal? A comparison of participants
and data gathered via Amazon’s MTurk, social media, and face-to-face behavioral
testing. Computers in Human Behavior, 29(6), 21562160.
Chmielewski, M., Clark, L. A., Bagby, R. M., & Watson, D. (2015). Method matters:
Understanding diagnostic reliability in DSM-IV and DSM-5. Journal of Abnormal
Psychology, 124(3), 764.
Chmielewski, M., Sala, M., Tang, R., & Baldwin, A. (2016). Examining the construct validity of
affective judgments of physical activity measures. Psychological Assessment, 28(9),
11281141. https://doi.org/10.1037/pas0000322
Clark, L. A., & Watson, D. (1991). Tripartite model of anxiety and depression: Psychometric
evidence and taxonomic implications. Journal of Abnormal Psychology, 100(3), 316
336. https://doi.org/10.1037/0021-843X.100.3.316
Courrégé, S. C., Skeel, R. L., Feder, A. H., & Boress, K. S. (2019). The ADHD Symptom
Infrequency Scale (ASIS): A novel measure designed to detect adult ADHD simulators.
Psychological Assessment, 31(7), 851-860.
Dennis, S. A., Goodson, B. M., & Pearson, C. (2018). MTurk Workers’ Use of Low-Cost
“Virtual Private Servers” to Circumvent Screening Methods: A Research Note (SSRN
Scholarly Paper No. ID 3233954). Retrieved from Social Science Research Network
website: https://papers.ssrn.com/abstract=3233954
AN MTURK CRISIS? 24
Dreyfuss, E. (2018, August 17). A bot panic hits Amazon’s Mechanical Turk. Wired. Retrieved
from https://www.wired.com/story/amazon-mechanical-turk-bot-panic/
Eriksson, K., & Simpson, B. (2010). Emotional reactions to losing explain gender differences in
entering a risky lottery. Judgment and Decision Making, 5(3), 159-163.
Flake, J. K., Pek, J., & Hehman, E. (2017). Construct validation in social and personality
research: Current practice and recommendations. Social Psychological and Personality
Science, 8(4), 370378.
Goodman, J. K., Cryder, C. E., & Cheema, A. (2013). Data collection in a flat world: The
strengths and weaknesses of Mechanical Turk samples. Journal of Behavioral Decision
Making, 26(3), 213224. https://doi.org/10.1002/bdm.1753
Hauser, D. J., & Schwarz, N. (2015). It's a trap! Instructional manipulation checks prompt
systematic thinking on “tricky” tasks. Sage Open, 5(2), 2158244015584617.
Horton, J. J., Rand, D. G., & Zeckhauser, R. J. (2011). The online laboratory: Conducting
experiments in a real labor market. Experimental Economics, 14(3), 399425.
John, O. P., Donahue, E. M., & Kentle, R. L. (1991). The big five inventoryversions 4a and 54.
Berkeley, CA: Berkeley Institute of Personality and Social Research, University of
California.
John, O. P., & Srivastava, S. (1999). The Big Five Trait taxonomy: History, measurement, and
theoretical perspectives. In Handbook of personality: Theory and research (2nd ed.). (pp.
102138). New York, NY, US: Guilford Press.
Kees, J., Berry, C., Burton, S., & Sheehan, K. (2017). An analysis of data quality: Professional
panels, student subject pools, and Amazon’s Mechanical Turk. Journal of Advertising,
46(1), 141155.
AN MTURK CRISIS? 25
Kennedy, R., Clifford, S., Burleigh, T., Jewell, R., & Waggoner, P. (2018). The Shape of and
Solutions to the MTurk Quality Crisis (SSRN Scholarly Paper No. ID 3272468).
Retrieved from Social Science Research Network website:
https://papers.ssrn.com/abstract=3272468
Kittur, A., Chi, E. H., & Suh, B. (2008). Crowdsourcing user studies with Mechanical Turk.
Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Florence, Italy. 453456. DOI: 10.1145/1357054.1357127
Kotov, R., Gamez, W., Schmidt, F., & Watson, D. (2010). Linking “big” personality traits to
anxiety, depressive, and substance use disorders: A meta-analysis. Psychological
Bulletin, 136(5), 768821. https://doi.org/10.1037/a0020327
Krueger, R. F., & Tackett, J. L. (Eds.). (2006). Personality and psychopathology. New York:
The Guilford Press.
Malouff, J. M., Thorsteinsson, E. B., & Schutte, N. S. (2005). The relationship between the Five-
Factor Model of personality and symptoms of clinical disorders: A meta-analysis.
Journal of Psychopathology and Behavioral Assessment, 27(2), 101114.
https://doi.org/10.1007/s10862-005-5384-y
Marge, M., Banerjee, S., & Rudnicky, A. I. (2010). Using the Amazon Mechanical Turk for
transcription of spoken language. IEEE International Conference on Acoustics, Speech
and Signal Processing, Dallas, TX, 52705273. DOI: 10.1109/ICASSP.2010.5494979
Mason, W., & Suri, S. (2011). Conducting behavioral research on Amazon’s Mechanical Turk.
Behavior Research Methods, 44(1), 123. https://doi.org/10.3758/s13428-011-0124-6
AN MTURK CRISIS? 26
Mason, W., & Watts, D. J. (2009). Financial incentives and the performance of crowds.
Proceedings of the ACM SIGKDD Workshop on Human Computation, Paris, France,
7785. DOI: 10.1145/1600150.1600175
McCreadie, R. M., Macdonald, C., & Ounis, I. (2010). Crowdsourcing a news query
classification dataset. CSE, Geneva, Switzerland, 3138..
Oppenheimer, D. M., Meyvis, T., & Davidenko, N. (2009). Instructional manipulation checks:
Detecting satisficing to increase statistical power. Journal of Experimental Social
Psychology, 45(4), 867-872.
Paolacci, G., & Chandler, J.(2014). Inside the Turk: Understanding Mechanical Turk as a
participant pool. Current Directions in Psychological Science, 23(3), 184-188.
Peer, E., Vosgerau, J., & Acquisti, A. (2014). Reputation as a sufficient condition for data quality
on Amazon Mechanical Turk. Behavior Research Methods, 46(4), 10231031.
Permut, S., Fisher, M., & Oppenheimer, D. M. (2019). TaskMaster: A tool for determining when
subjects are on task. Advances in Methods and Practices in Psychological Science, 2(2),
188196. https://doi.org/10.1177/2515245919838479
Rammstedt, B., & Farmer, R. F. (2013). The impact of acquiescence on the evaluation of
personality structure. Psychological Assessment, 25(4), 1137.
Shapiro, D. N., Chandler, J., & Mueller, P. A. (2013). Using Mechanical Turk to study clinical
populations. Clinical Psychological Science, 2167702612469015.
Sheehan, K. B. (2018). Crowdsourcing research: Data collection with Amazon’s Mechanical
Turk. Communication Monographs, 85(1), 140156.
AN MTURK CRISIS? 27
Soto, C. J., John, O. P., Gosling, S. D., & Potter, J. (2008). The developmental psychometrics of
big five self-reports: Acquiescence, factor structure, coherence, and differentiation from
ages 10 to 20. Journal of Personality and Social Psychology, 94(4), 718.
Stewart, N., Ungemach, C., Harris, A. J., Bartels, D. M., Newell, B. R., Paolacci, G., &
Chandler, J. (2015). The average laboratory samples a population of 7,300 Amazon
Mechanical Turk workers. Judgment and Decision Making, 10(5), 479491.
Stokel-Walker, C. (2018, October 1). Bots on Amazon’s Mechanical Turk are ruining
psychology studies. Retrieved February 4, 2019, from New Scientist website:
https://www.newscientist.com/article/2176436-bots-on-amazons-mechanical-turk-are-
ruining-psychology-studies/
Suri, S., & Watts, D. J. (2011). Cooperation and contagion in web-based, networked public
goods experiments. PloS One, 6(3), e16836.
Sylaska, K., & Mayer, J. D. (2019, June 28). It’s 2019: Do We Need Super Attention Check Items
to Conduct Web-Based Survey Research? The Evolution of MTurk Survey Respondents.
Presented at the Association for Research in Personality, Grand Rapids MI.
U.S. Census Bureau. (2018). Historical Households Tables, Households by size. Retrieved
February 4, 2019, from https://www.census.gov/data/tables/time-
series/demo/families/households.html
Vannette, D. (2017, June 29). Using attention checks in your surveys may harm data quality.
Retrieved July 18, 2019, from Qualtrics website: https://www.qualtrics.com/blog/using-
attention-checks-in-your-surveys-may-harm-data-quality/
AN MTURK CRISIS? 28
Widiger, T. A., & Trull, T. J. (2007). Plate tectonics in the classification of personality disorder:
Shifting to a dimensional model. American Psychologist, 62(2), 7183.
https://doi.org/10.1037/0003-066X.62.2.71
Wood, D., Harms, P. D., Lowman, G. H., & DeSimone, J. A. (2017). Response speed and
response consistency as mutually validating indicators of data quality in online samples.
Social Psychological and Personality Science, 8(4), 454464.
https://doi.org/10.1177/1948550617703168
Zhu, D., & Carterette, B. (2010). An analysis of assessor behavior in crowdsourced preference
judgments.Presented at the SIGIR 2010 Workshop on Crowdsourcing for Search
Evaluations, Geneva, Switzerland, 1720.
... Existing research also indicates that mental health disorders can be found in MTurk samples (Arditte et al., 2016;Kim & Hodgins, 2017) and that the psychometric properties of some clinical measures are adequate when completed by MTurk participants (Arditte et al., 2016;Shapiro et al., 2013). However, some have questioned the recent data quality of psychological research obtained through MTurk (Chmielewski & Kucker, 2020). ...
... Can high-quality psychotherapy research data currently be collected through MTurk using the strategies suggested by Chmielewski and Kucker (2020)? ...
... The specific steps that were taken in the current study included the use of a screening instrument, including cognizant responding items, setting a minimum response time and screening the data for completeness and erroneous response patterns (e.g., answering all items at the highest value). In recent years, the quality of data collected on MTurk has been called into question (Chmielewski & Kucker, 2020). Specifically, Chmielewski and Kucker presented evidence that MTurk data quality decreased significantly beginning in summer 2018. ...
Article
Objective Although researchers have begun to use MTurk for psychotherapy‐related studies, little is known about the quality of psychotherapy research data that can be obtained through this platform. The fourfold purpose of this study was to (1) compare data collection times between clients recruited through MTurk and those recruited through traditional clinic sources, (2) test for differences in demographic and treatment use variables between these two samples, (3) examine whether psychotherapy research data gathered through these different strategies shows comparable properties and (4) test whether recruitment decisions within MTurk can impact the quality of the data obtained. Method Clients (828 recruited through MTurk and 62 recruited through outpatient clinics) completed an online survey that included treatment history questions and six measures that are frequently used in psychotherapy research. Results Highly similar demographics, treatment history and properties for the measures (i.e., means, standard deviations, internal consistency, factor structure and convergent validity) were found between the samples. For the most part, the quality of the MTurk data did not differ depending on the compensation amount or participant restrictions. Conclusion These results suggest that MTurk can be a quick, reliable and valid resource for participant recruitment when conducting some types of psychotherapy research.
... Researchers have most frequently noted these data quality issues when using social media platforms, such as Twitter, Facebook, and Reddit to recruit survey participants (Chung et al. 2019;Dermody 2022;Levi et al. 2021;Pozzar et al. 2020;Storozuk et al. 2020;Vu et al. 2021). Amazon's Mechanical Turk platform put a qualifications system in place after an influx of bots and survey scammers infiltrated it (Chmielewski and Kucker 2019;Amazon Mechanical Turk 2021), but most other online survey platforms have not implemented adequate measures to protect against fraudulent data. Running online scams has become a lucrative business, and step-by-step directions for how to make money through online scams have permeated social media sites such as Tik Tok (Greenwood 2021). ...
... When cleaning data to search for fraudulent entries, researchers must set criteria to determine which responses will be considered valid. Some exclusion criteria noted in previous studies include the quality of response to open-ended questions, IP address or location, and survey duration or time at completion (Chmielewski and Kucker 2019;Griffin et al. 2021;Storozuk et al. 2020;Pozzar et al. 2020;Sterzing et al. 2018;Levi et al. 2021;Vu et al. 2021). Additionally, some studies screen participants' email addresses for those that match the patterns of purchased, bulk email accounts (Griffin et al. 2021;Storozuk et al. 2020). ...
... Other studies have used unlikely responses as an indicator of fraudulent data, such as reporting having greater than 10 children, or reporting having immigrated in a year prior to the reported birth year (Chmielewski and Kucker, 2019;Vu et al. 2021). Although helpful for ensuring better data quality, using these strategies requires time to screen for potential scammers and systems for compensating only those deemed "real" participants (Storozuk et al. 2020;Pozzar et al. 2020;Sterzing et al. 2018;Vu et al. 2021). ...
Article
Full-text available
Online recruitment methods for survey-based studies have become increasingly common in social science research. However, they are susceptible to a high rate of participation by fraudulent research subjects. The current study identified fraudulent (i.e., “fake”) participants in an online research study of parents of 13 to 18-year-old adolescents, and compared demographic, anthropometric, and subjective health data between “fake” (N = 1084) and “real” (N = 197) participants. Of 1,281 subjects who started the eligibility survey, 84.6% were coded as “fake.” “Fake” participants were less diverse in race/ethnicity and more diverse in gender. Their depression symptoms were inflated, but ratings of perceived health were comparable to “real” participants. Well-established correlations, such as that between BMI and perceived health, were not replicated with “fake” participants. Online surveys are highly vulnerable to fraudulent research subjects whose participation compromises the validity and interpretability of results. The discussion provides a guide and recommendations for improving data quality in online survey research.
... Generally, MTurk respondents have been found to be better representatives of the general population than participants from convenient samples, such as college student samples (Buhrmester et al., 2011;Berinsky et al., 2012) or professional panels (Kees et al., 2017;Zhang & Gearhart, 2020;Berry et al., 2022). However, recent studies have found that data quality from respondents on MTurk and other crowdsourcing platforms has decreased since 2015 (Chmielewski & Kucker, 2020;Kennedy et al., 2020;Peer et al., 2022). While MTurk participants are more attentive to instructions and in answering survey questions than college students , it is therefore crucial to vet respondents and to include screening criteria to improve data quality (Kees et al., 2017;Chmielewski & Kucker, 2020;Agley et al., 2022;Berry et al., 2022). ...
... However, recent studies have found that data quality from respondents on MTurk and other crowdsourcing platforms has decreased since 2015 (Chmielewski & Kucker, 2020;Kennedy et al., 2020;Peer et al., 2022). While MTurk participants are more attentive to instructions and in answering survey questions than college students , it is therefore crucial to vet respondents and to include screening criteria to improve data quality (Kees et al., 2017;Chmielewski & Kucker, 2020;Agley et al., 2022;Berry et al., 2022). We consequently required our study's participants to sign an informed consent question, to answer an attention check question, and to provide a unique code at the end of the survey (Aguinis et al., 2021;Agley et al., 2022). ...
Article
Full-text available
This study first examines the influence of educational factors on a consumer’s willingness to buy green products and on building a brand’s green image. Second, it explores the effects of environmental concern and perceived consumer effectiveness in mediating the relationships between educational factors and green buying behavior. Third, it takes a cross-country perspective by investigating green buying behavior under distinct cultural contexts (collectivism versus individualism). The hypothesized model was tested with data collected in the United States and Brazil and using structural equation modeling analysis. Findings reveal that sustainability education, whether initiated by the consumer or by the organization, contributes positively to promote a brand’s green image. Environmental concern and perceived consumer effectiveness both mediate the relationships between educational factors and green buying behavior. Lastly, the moderating effects of culture highlight the importance of environmental concern in a collectivist country and perceived consumer effectiveness in an individualist country.
... 14 The characteristics, risks, and benefits of using MTurk for academic research have been reported on extensively. [14][15][16][17][18][19] MTurk was used in the present study to recruit adult respondents to complete a general health survey that included demographics, health history, and the Patient-Reported Outcomes Measurement Information System (PROMIS Ò ) Global-10 measure. 20 This study focuses on respondents who reported having LBP at baseline and 3-month followup based on the following question: ''Do you currently have back pain?'' ...
... Several quality control measures were implemented to address potential data quality concerns. 17 Studies have shown that including MTurk participants with a 95% completion rate on at least 500 previous jobs improves response quality and sample representativeness, so this threshold was included as a requirement for respondents in the present study. 18 Additional quality control measures included: (1) Participants were not told that this study was targeting individuals with LBP; (2) Small batches of surveys were deployed hourly over several weeks to reduce selection bias; and (3) two fake conditions were inserted in the health conditions checklist (Syndomitis and Chekalism)-those endorsing either of these conditions were excluded from the study. ...
Article
Full-text available
Objective: To evaluate the associations between baseline demographics, health conditions, pain management strategies, and health-related quality-of-life (HRQoL) measures with pain management strategies at 3-month follow-up in respondents reporting current low-back pain (LBP). Study design: Cohort study of survey data collected from adults with LBP sampled from Amazon Mechanical Turk crowdsourcing panel. Methods: Demographics, health conditions, and the Patient-Reported Outcomes Measurement Information System (PROMIS)-10 were included in the baseline survey. Respondents reporting LBP completed a more comprehensive survey inquiring about pain management strategies and several HRQoL measures. Bivariate then multivariate logistic regression estimated odds ratios (ORs) with 95% confidence intervals (CIs) for the association between baseline characteristics and pain management utilization at 3-month follow-up. Model fit statistics were evaluated to assess the predictive value. Results: The final cohort included 717 respondents with completed surveys. The most prevalent pain management strategy at follow-up was other care (n = 474), followed by no care (n = 94), conservative care only (n = 76), medical care only (n = 51), and medical and conservative care combined (n = 22). The conservative care only group had higher (better) mental and physical health PROMIS-10 scores as opposed to the medical care only and combination care groups, which had lower (worse) physical health scores. In multivariate models, estimated ORs (95% CIs) for the association between baseline and follow-up pain management ranged from 4.6 (2.7–7.8) for conservative care only to 16.8 (6.9–40.7) for medical care only. Additional significant baseline predictors included age, income, education, workman's compensation claim, Oswestry Disability Index score, and Global Chronic Pain Scale grade. Conclusions: This study provides important information regarding the association between patient characteristics, HRQoL measures, and LBP-related pain management utilization.
... Bots and farms: Similarly, Chmielewski et al. [30] reported a decrease in data quality, citing bot and farm activity. ...
Conference Paper
Full-text available
For nearly two decades, CAPTCHAS have been widely used as a means of protection against bots. Throughout the years, as their use grew, techniques to defeat or bypass CAPTCHAS have continued to improve. Meanwhile, CAPTCHAS have also evolved in terms of sophistication and diversity, becoming increasingly difficult to solve for both bots (machines) and humans. Given this long-standing and still-ongoing arms race, it is critical to investigate how long it takes legitimate users to solve modern CAPTCHAS, and how they are perceived by those users. In this work, we explore CAPTCHAS in the wild by evaluating users' solving performance and perceptions of unmodified currently-deployed CAPTCHAS. We obtain this data through manual inspection of popular websites and user studies in which 1, 400 participants collectively solved 14, 000 CAPTCHAS. Results show significant differences between the most popular types of CAPTCHAS: surprisingly, solving time and user perception are not always correlated. We performed a comparative study to investigate the effect of experimental context-specifically the difference between solving CAPTCHAS directly versus solving them as part of a more natural task, such as account creation. Whilst there were several potential confounding factors, our results show that experimental context could have an impact on this task, and must be taken into account in future CAPTCHA studies. Finally, we investigate CAPTCHA-induced user task abandonment by analyzing participants who start and do not complete the task.
Article
Background Subjective socioeconomic status is robustly associated with many measures of health and well-being. The MacArthur Scale of Subjective Social Status (i.e., the MacArthur ladder) is the most widely used measure of this construct, but it remains unclear what exactly the MacArthur ladder measures. Purpose The present research sought to explore the social and economic factors that underlie responses to the MacArthur ladder and its relationship to health. Methods We investigated this issue by examining the relationship between scores on the MacArthur ladder and measures of economic circumstances and noneconomic social status, as well as health and well-being measures, in healthy adults in the USA. Results In three studies (total N = 1,310) we found evidence that economic circumstances and social status are distinct constructs that have distinct associations with scores on the MacArthur ladder. We found that both factors exhibit distinct associations with measures of health and well-being and accounted for the association between the MacArthur ladder and each measure of health and well-being. Conclusions Our findings suggest that the MacArthur ladder’s robust predictive validity may result from the fact that it measures two factors—economic circumstances and social status—that are each independently associated with health outcomes. These findings provide a novel perspective on the large body of literature that uses the MacArthur ladder and suggests health researchers should do more to disentangle the social and economic aspects of subjective socioeconomic status.
Article
Purpose The purpose of this paper is to discuss the issues in studying hard-to-reach or dispersed populations, with particular focus on methodologies used to collect data and to investigate dispersed migrant entrepreneurs, illustrating shortcomings, pitfalls and potentials of accessing and disseminating research to hard-to-reach populations of migrant entrepreneurs. Design/methodology/approach A mixed methodology is proposed to access hard-to-reach or dispersed populations, and this paper explores these using a sample of Brazilian migrants settled in different countries of the world. Findings This paper explores empirical challenges, illustrating shortcomings, pitfalls and potentials of accessing and disseminating research to hard-to-reach populations of migrant entrepreneurs. It provides insights by reporting research experiences developed over time by this group of researchers, reflecting a “mixing” of methods for accessing respondents, contrasting to a more rigid, a-priori, mixed methods approach. Originality/value The main contribution of this paper is to showcase experiences from, and suitability of, remote data collection, especially for projects that cannot accommodate the physical participation of researchers, either because of time or cost constraints. It reports on researching migrant entrepreneurship overseas. Remote digital tools and online data collection are highly relevant due to time- and cost-efficiency, but also represent solutions for researching dispersed populations. These approaches presented allow for overcoming several barriers to data collection and present instrumental characteristics for migrant research.
Article
Purpose The rampant toxic gaming environment in most major esports games has become a challenge in maintaining gamers’ loyalty to the game. Guided by the theory of stress and coping, this study aims to investigate how and under what condition esports gamers’ perceived risk of toxicity may affect game brand loyalty through the moderated mediation effects of game brand identification, self-efficacy, and perceived support from game brand. Design/methodology/approach The moderated mediation model was tested using the conditional process analysis ( N = 311). The moderating effects of game brand identification on the mediated processes were tested in the model. Findings The authors found that self-efficacy and perceived support from game brand were critical mediators between the perceived risk of toxicity and game brand loyalty. However, these mediating effects varied depending on the level of game brand identification. Originality/value This study took the step forward by theorizing and empirically examining the relationship between perceived risk of toxicity and consumption outcome by considering both internal and social coping resources and game brand identification, among Generation Z and Millennial gamers in the esports context.
Article
Full-text available
The current project outlines the development of the Attention-Deficit/Hyperactivity Disorder (ADHD) Symptom Infrequency Scale (ASIS), a stand-alone measure designed to identify individuals feigning or exaggerating symptoms to receive a diagnosis of ADHD. Over the course of 3 studies, valid data was collected from 402 participants assigned to control, simulator, ADHD diagnosed, or possible undiagnosed ADHD groups. Group assignment was based on self-reported history of ADHD diagnosis including information about the credentials of diagnosing professional and methods used. The ASIS includes an Infrequency Scale (INF) designed to detect rarely reported symptoms of ADHD and several clinical scales designed to measure genuine symptoms. The final version of the ASIS demonstrated high internal consistency for the INF (α = .96) and the ADHD Total scales (α= .96). Convergent validity for the ADHD Total was established through a strong correlation with Barkley Adult ADHD Rating Scale-IV (r = .92). Initial validation of the INF yielded high discriminability between groups (d = 2.76; 95% confidence interval [2.17, 3.36]). The final INF scale demonstrated strong sensitivity (.79-.86) and excellent specificity (.89). Using our study's malingering base rate of 29%, positive and negative predictive values were strong (.71-.79 and .92-.93, respectively). Additional information is provided for a range of base rates. Current results suggest that the ASIS has potential as a reliable and valid measure of ADHD that is sensitive to malingering when compared to a sample of individuals self-reporting a history of ADHD diagnosis. (PsycINFO Database Record (c) 2019 APA, all rights reserved).
Article
Full-text available
This study compared the quality of survey data collected from Mechanical Turk (MTurk) workers and college students. Three groups of participants completed the same survey. MTurk respondents completed the survey as paid workers using the Mechanical Turk crowdsourcing platform. Student Online respondents also completed the survey online after having been recruited in class. Finally, Student Paper-and-Pencil respondents completed the survey on paper in a classroom setting. Validity checks embedded in the survey were designed to gauge participants’ haste and carelessness in survey completion. MTurk respondents were significantly more likely to fail validity checks by contradicting their own answers or simply completing the survey too quickly. Student groups showed fewer careless mistakes and longer completion times. The MTurk sample tended to be older, more educated, and more ethnically diverse than student samples. Results suggest that researchers should pay special attention to the use of validity checks when recruiting MTurk samples.
Article
Full-text available
Over the past 2 decades, many social scientists have expanded their data-collection capabilities by using various online research tools. In the 2011 article “Amazon’s Mechanical Turk: A new source of inexpensive, yet high-quality, data?” in Perspectives on Psychological Science, Buhrmester, Kwang, and Gosling introduced researchers to what was then considered to be a promising but nascent research platform. Since then, thousands of social scientists from seemingly every field have conducted research using the platform. Here, we reflect on the impact of Mechanical Turk on the social sciences and our article’s role in its rise, provide the newest data-driven recommendations to help researchers effectively use the platform, and highlight other online research platforms worth consideration.
Article
Individual differences have become increasingly important in the study of child development and language. However, despite the important role parents play in children’s language, no work has examined how parent personality impacts language development. The current study examines the impact of parent personality as well as child temperament on language development in 460 16‐ to 30‐month‐old children and 328 31‐ to 42‐month‐old children. Findings from both groups suggest multiple aspects of children’s language abilities are correlated with their parent’s personality. Specifically, parent consciousness, openness, and agreeableness positively correlate with child vocabulary size and other language abilities. Results also replicate and expand research on child temperament and language – child effortful control and surgency were positively correlated, and negative affect negatively correlated with most language abilities even after controlling for parent personality. Critically, parent and child traits appear to impact a child’s language abilities above and beyond well‐known predictors of language, such as age.
Article
In this Tutorial, we introduce a tool that allows researchers to track subjects’ on- and off-task activity on Qualtrics’ online survey platform. Our TaskMaster tool uses JavaScript to create several arrayed variables representing the frequency with which subjects enter and leave an active survey window and how long they remain within a given window. We provide code and instructions that will allow researchers to both implement the TaskMaster and analyze its output. We detail several potential applications, such as in studies of persistence and cheating, and studies that require sustained attention to experimental outcomes. The TaskMaster is designed to be accessible to researchers who are comfortable designing studies in Qualtrics, but who may have limited experience using programming languages such as JavaScript.
Article
Researchers in a variety of disciplines use Amazon’s crowdsourcing platform called Mechanical Turk as a way to collect data from a respondent pool that is much more diverse than a typical student sample. The platform also provides cost efficiencies over other online panel services and data can be collected very quickly. However, some researchers have been slower to try the platform, perhaps because of a lack of awareness of its functions or concerns with validity. This article provides an overview of Mechanical Turk as an academic research platform and a critical examination of its strengths and weaknesses for research. Guidelines for collecting data that address issues of validity, reliability, and ethics are presented.
Article
We examine the appropriateness of response speed and response consistency as data quality indicators within online samples. Across several inventories, results show that response consistency decreases dramatically at response rates faster than 1 second per item. Our results suggest that careless responding may be fairly common in online samples and often functions to increase the expected correlation between items in a survey, which has implications for the likelihood of false positives and the analysis of factor structure. Given how careless responding can influence estimated associations between variables, we strongly recommend that researchers include response speed and consistency screens in their research and provide empirically informed cut points for data screens that should be useful across a wide range of instruments and settings.
Article
Data collection using Internet-based samples has become increasingly popular in many social science disciplines, including advertising. This research examines whether one popular Internet data source, Amazon's Mechanical Turk (MTurk), is an appropriate substitute for other popular samples utilized in advertising research. Specifically, a five-sample between-subjects experiment was conducted to help researchers who utilize MTurk in advertising experiments understand the strengths and weaknesses of MTurk relative to student samples and professional panels. In comparisons across five samples, results show that the MTurk data outperformed panel data procured from two separate professional marketing research companies across various measures of data quality. The MTurk data were also compared to two different student samples, and results show the data were at least comparable in quality. While researchers may consider MTurk samples as a viable alternative to student samples when testing theory-driven outcomes, precautions should be taken to ensure the quality of data regardless of the source. Best practices for ensuring data quality are offered for advertising researchers who utilize MTurk for data collection.