Content uploaded by Barnabas Szászi
Author content
All content in this area was uploaded by Barnabas Szászi on Dec 12, 2018
Content may be subject to copyright.
Full Terms & Conditions of access and use can be found at
http://www.tandfonline.com/action/journalInformation?journalCode=ptar20
Download by: [Columbia University Libraries] Date: 01 March 2017, At: 13:15
Thinking & Reasoning
ISSN: 1354-6783 (Print) 1464-0708 (Online) Journal homepage: http://www.tandfonline.com/loi/ptar20
The cognitive reflection test revisited: exploring
the ways individuals solve the test
B. Szaszi, A. Szollosi, B. Palfi & B. Aczel
To cite this article: B. Szaszi, A. Szollosi, B. Palfi & B. Aczel (2017): The cognitive reflection test
revisited: exploring the ways individuals solve the test, Thinking & Reasoning
To link to this article: http://dx.doi.org/10.1080/13546783.2017.1292954
Published online: 01 Mar 2017.
Submit your article to this journal
View related articles
View Crossmark data
The cognitive reflection test revisited: exploring the
ways individuals solve the test
B. Szaszi
a
,
b
, A. Szollosi
b
,
c
, B. Palfi
b
and B. Aczel
b
a
Doctoral School of Psychology, E€
otv€
os Lorand University, Budapest, Hungary;
b
Institute of
Psychology, E€
otv€
os Lor
and University, Budapest, Hungary;
c
School of Psychology, The
University of New South Wales, Sydney, Australia
ABSTRACT
Individuals’propensity not to override the firstanswerthatcomestomindis
thought to be a crucial cause behind many failures in reasoning. In the present
study, we aimed to explore the strategies used and the abilities employed when
individuals solve the cognitive reflection test (CRT), the most widely used measure
of this tendency. Alongside individual differences measures, protocol analysis was
employed to unfold the steps of the reasoning process in solving the CRT. This
exploration revealed that there are several ways people solve or fail the test.
Importantly, 77% of the cases in which reasoners gave the correct final answer in
our protocol analysis, they started their responsewiththecorrectanswerorwitha
line of thought which led to the correct answer. We also found that 39% of the
incorrect responders reflected on their first response. The findings indicate that the
suppression of the firstanswermaynotbetheonlycrucialfeatureofreflectivity in
the CRT and that the lack of relevant knowledge is a prominent cause of the
reasoning errors. Additionally, we confirmed that the CRT is a multi-faceted
construct: both numeracy and reflectivity account for performance. The results can
help to better apprehend the “whys and whens”of the decision errors in heuristics
and biases tasks and to further refine existing explanatory models.
ARTICLE HISTORY Received 6 June 2016; Accepted 27 January 2017
KEYWORDS Cognitive reflection test; process-tracing; reasoning; thinking errors; heuristics and biases
Introduction
In the decades-long aim of psychological research to understand errors in
human thinking, the cognitive reflection test (CRT; Frederick, 2005) has
become a pivotal tool to measure a unique dimension of individual differen-
ces. The three-item test was originally created to assess one type of cognitive
ability or disposition: the capacity to suppress the “incorrect intuitive”answer
and substitute it with the correct one.
1
The bat and the ball problem is the
CONTACT B. Szaszi szaszi.barnabas@gmail.com
1
The responses in the CRT are often grouped into three categories: “intuitive incorrect”(10 cents, 100
machines, 24 days); “non-intuitive incorrect”(any other answer); and “non-intuitive correct”(5 cents, 5
machines, 47 days).
© 2017 Informa UK Limited, trading as Taylor & Francis Group
THINKING & REASONING, 2017
http://dx.doi.org/10.1080/13546783.2017.1292954
most well-known example from the test: A bat and a ball cost $1.10 in total.
The bat costs $1 more than the ball. How much does the ball cost? The task can
trigger a misleading answer (in this case, 10 cents), which the participants
need to overcome before engaging in further reflection to arrive at the cor-
rect solution (5 cents). These supposed steps of the reasoning process make
the CRT a paradigmatic demonstration of the fallibility of human thinking.
Since its publication, the original paper introducing the CRT (Frederick,
2005) has been cited over 1900 times.
2
The cause of its popularity is multifac-
eted: it possesses high face validity, it is easy to administer, it predicts decision
performance in many different situations, and it correlates with a great num-
ber of other measures. Just to highlight a few examples, individuals with
higher CRT scores are more disposed to avoid decision biases (Toplak, West,
& Stanovich, 2011,2014) and perform better on general ability measures
(Liberali, Reyna, Furlan, Stein, & Pardo, 2012; Stupple, Ball, & Ellis, 2013). The
CRT also predicts intertemporal behaviour (Frederick, 2005), risky choice
(Cokely & Kelley, 2009; Frederick, 2005), utilitarian moral judgement (Paxton,
Ungar, & Greene, 2012), conservatism (Pennycook, Cheyne, Seli, Koehler, &
Fugelsang, 2012), and belief in the supernatural (Gervais & Norenzayan,
2012). Extended versions of the CRT have been created (e.g., Baron, Scott,
Fincher, & Metz, 2014; Primi, Morsanyi, Chiesi, Donati, & Hamilton, 2015;
Thomson & Oppenheimer, 2016; Toplak et al., 2014), as the original three
items of the CRT became increasingly well known to the public.
Besides its growing popularity in empirical studies, the theoretical founda-
tions of the test have been repeatedly questioned. Two closely related sets of
issues prevail in the current discussions: first, what does the CRT measure?
And second, what are the steps of the reasoning process when people try to
solve the test?
Regarding the first issue, most researchers argue that the CRT assesses
reflectivity. Two views dominate the literature about the interpretation of
reflectivity. The most popular interpretation was proposed by Frederick
(2005), conceptualising cognitive reflection as “the ability or disposition to
resist reporting the response that first comes to mind”(p. 35). This approach
of reflectivity has been promoted by, among others, Toplak et al. (2011) who
considered the CRT as a measure of miserly processing, referring to people’s
tendency to rely on heuristic processing instead of using more cognitively
expensive analytical processes. The explanation of both of these research
groups builds on the assumption that the key property of the CRT is that first
an “incorrect intuitive”answer comes to the mind, and then late suppression
mechanisms need to intervene and override the heuristic answer to be able
to reach a normative solution by further deliberation.
2
Based on Google Scholar, January 2017.
2B. SZASZI ET AL.
Cokely and Kelley (2009) were the first to extend the dominant theoretical
framework that only emphasised the role of late suppression mechanisms.
They argued that early selection control mechanisms (Jacoby, Kelley, &
McElree, 1999) may play an important role in the reflective behaviour. They
proposed that people scoring higher on the CRT process information more
elaborately and tend to use more thorough search processes. Baron et al.
(2014) provided evidence for this hypothesis. In their study, they created no-
lure versions of the CRT
3
and found that these items loaded on the same fac-
tor as the standard CRT items. Additionally, both types of items (lure, no-lure)
correlated to a similar extent with other measures, such as the actively open-
minded thinking (AOT; Baron, 1993) or belief bias syllogisms (BBS; Evans,
Barston, & Pollard, 1983). As the authors did not find evidence to support the
claim that the suppression of an initial response tendency is relevant in the
CRT, but observed that the test assesses the extensiveness of search, they
concluded that the CRT is a measure of reflection-impulsivity (RI; Kagan,
Rosman, Day, Albert, & Phillips, 1964). This, in turn, is an indicator of cognitive
style where there is a relative preference for impulsivity (speed) versus reflec-
tion (accuracy).
There is a parallel discussion concerning the CRT as a measurement tool. It
has been argued that the CRT measures solely numeracy
4
as its items are
numerical tasks. Moderate-to-strong correlations have been found between
the CRT and other assessments of numeracy (Finucane & Gullion, 2010; Liberali
et al., 2012). Welsh, Burns, and Delfabbro (2013) observed that the CRT has pre-
dictive power only on those heuristics and biases tasks where numeracy plays
a role in arriving at the correct solution. They concluded that the CRT assesses
numerical abilities rather than the inhibition of a prepotent response. Other
studies, employing factor analysis techniques, found that the CRT items loaded
on the same factor as other numerical items (Baron et al., 2014;La
g, Bauger,
Lindberg, & Friborg, 2014;Study1inLiberalietal.,2012; Weller et al., 2013).
Sinayev and Peters (2015) studied whether numeric abilities or cognitive reflec-
tion are responsible for the predictive power of the CRT. Based on the
observed performance on the CRT, they estimated two variables: the numerical
score was calculated as the proportion of correct responses, while the cognitive
reflection score was computed as the proportion of “non-intuitive”answers.
They observed that only the numerical scores in the CRT accounted for perfor-
mance on other decision-making and heuristics and biases tasks.
However, other results support the idea that in addition to numeracy,
reflective ability is also involved in solving the CRT successfully. In contrast to
3
No lure CRT tasks are CRT-like arithmetic problems that supposedly do not trigger an “intuitive incor-
rect”response. For example, “If it takes 1 nurse 5 min to measure the blood pressure of 6 patients, how
many minutes would it take 100 nurses to measure the blood pressure of 300 patients?”(Baron, Scott,
Fincher, & Metz, 2014)
4
Numeracy is one’s ability to store, represent and process mathematical operations (Peters, 2012).
THINKING & REASONING 3
Welsh et al.’s(2013)findings, Campitelli and Labolita (2010) observed that the
CRT correlates with tasks without mathematical component. Pennycook and
Ross (2016) reviewed evidence that the CRT was predictive of a diverse range
of variables even after controlling for numeracy. Liberali et al. (2012) found
that the bivariate correlations between the CRT and the numeracy scales
were not high and the CRT items loaded on a numeracy-independent factor
based on the results of the factor analysis. The authors concluded that the
CRT is not just another test of numeracy, but also added that the CRT and
objective numeracy are, in fact, related. Campitelli and Gerrans (2014) applied
a mathematical modelling approach to tackle the conundrum. They esti-
mated an inhibition parameter employing BBS and the AOT. They also
assessed a numerical parameter using a numeracy scale. The results indicated
that the models including both an inhibition parameter and a mathematical
component fitted the data better than a model including only a mathematical
parameter.
Most studies using the CRT employed some explicit or tacit assumptions
about the steps involved in the reasoning process of the CRT. Although a few
studies tried to explore these assumptions, the analyses were based on aggre-
gated data (e.g., Mata, Ferreira, & Sherman, 2013; Travers, Rolison, & Feeney,
2016), giving rise to methodological limitations. More specifically, data aggre-
gation can overshadow the existence of subgroups that may follow different
strategies when solving the test (Fific, 2014).
According to the most common understanding of the CRT, suppression of
afirst answer is a necessary step for good performance. This view about the
task relies on two important assumptions. First, it assumes that even those
who give the correct answer start their thinking with an “incorrect intuitive”
response, although they are able to suppress it. Frederick (2005) postulated
that even the correct responders consider first the incorrect answer, based on
the observation that the “10 cents”answer was often crossed out next to the
“5 cents”answer in the bat and the ball problem. Mata et al. (2013) found evi-
dence that a majority of the correct responders were aware of the “intuitive
response”. Nevertheless, the authors did not control in their study for the
time-course assumption of the reasoning process which is theoretically cru-
cial, as it is possible that those who indicated awareness of the “intuitive
response”may have had a correct first response and only later, during the
deliberation period, did they take into account the incorrect alternative
response. Travers et al. (2016) used a computer-mouse tracking paradigm,
where participants were asked to choose an answer on each CRT task by click-
ing on one of four response options on the screen. The authors observed that
individuals who solved the tasks correctly tended to move the mouse more
slowly away from the “incorrect intuitive”response than from other “non-intu-
itive incorrect”response options before clicking on the correct answer. Never-
theless, based on these findings, it is difficult to conclude whether or not
4B. SZASZI ET AL.
there were responders whose first answer was correct. The results imply only
that, on average, correct responders are more likely to start their thinking
with the “intuitive incorrect”response than with other incorrect answers, and
not that they never start their thinking with the correct response. Further-
more, the results of some recent studies suggest that there are individuals
with correct intuitions. For example, Peters (2012) argues that people with
higher numeracy “rely on their superior number intuitions”(p. 32) and based
on the Fuzzy Trace theory (Reyna, Nelson, Han, & Dieckmann, 2009), she also
claims that they may “derive a richer gist from numbers”(Peters, 2012, p. 32).
Supporting this idea, Thompson and Johnson (2014) reported that some indi-
viduals responded normatively on reasoning tasks when they were asked to
report the initial answer that comes to mind. These tasks –similarly to the
CRT –are thought to trigger an incorrect response that needs to be sup-
pressed in order to arrive at the correct answer. The authors argued that cog-
nitive capacity drove the production of the initial correct response.
Svedholm-Hakkinen’s(2015) experiments provided more evidence for the
same idea: when solving BBS, high-ability people did not show the sign of
belief-inhibition; that is, they seemed to start to think using normative logic.
According to the second underlying assumption of the suppression-focused
interpretation of the CRT, those who give the incorrect heuristic answer do not
reflect on it. Otherwise, as Frederick (2005,p.27)argues,“even a moment”of
reflection would lead to the recognition of the failure. Previous studies have
found that people spend more time (Johnson, Tubau, & De Neys, 2016)and
show longer distances travelled by the mouse cursor (Travers et al., 2016)on
correct responses than on the “intuitive incorrect”answers. However, these
results only support the idea that, on average, people deliberate more before
producing the correct responses and one cannot conclude that the incorrect
responders did not reflect. Furthermore, the fact that incorrect responders
were not aware of the correct response (Mata et al., 2013; Travers et al., 2016)
does not imply that these individuals did not reason analytically (Elqayam &
Evans, 2011). In contrast to this assumption, Meyer, Spunt, and Frederick (2015)
observed that many of their participants failed to solve the bat and the ball
problem despite the fact that they had been warned to think carefully about it.
Moreover, previous findings have also brought evidence that deliberation does
not necessarily lead to the change of the initial incorrect intuition: for instance,
it has been repeatedly shown that people use reflective reasoning to rationalise
or justify their first thoughts in the Wason selection task (Evans, 1996; Evans &
Ball, 2010;Wason&Evans,1975).
The current research
Our study includes both exploratory and confirmatory research. First, we
aimed to explore the skills required to solve the CRT successfully. To identify
THINKING & REASONING 5
the crucial individual differences behind good performance on the CRT, we
used one numeracy and four reflectivity tests. The rationale for using several
measures of reflectivity is that there are competing theoretical concepts of
reflectivity and there is no agreement on a single and valid assessment
approach. Consequently, one of our aims was to find which reflectivity mea-
sure predicts best the performance on the CRT since this analysis can help us
reveal which conceptualisation of reflectivity is captured by the CRT.
Furthermore, we aimed to explore the strategies employed when individu-
als solve the CRT. Here, we focused on two crucial questions concerning the
above-detailed assumptions of the most widely used interpretation of the
CRT. First, we aimed to explore the proportion of correct responses in the CRT
in which the reasoners start their response with the correct answer or with a
line of thought which led to the correct answer. Second, we studied the pro-
portion of the incorrect responses in which the reasoners reflect on the
answer that first comes to their mind. Note that the first and second questions
focus on the correct and incorrect cases, respectively. To investigate the strat-
egies employed, we used protocol analysis (Ericsson & Simon, 1980), which
has been found to be a valid method for studying thought processes without
altering performance (Fox, Ericsson, & Best, 2011; for limitations see: De Neys
& Glumicic, 2008; Reisen, Hoffrage, & Mast, 2008). Besides the fact that this
method has been used in several studies in the decision-making literature to
track thinking processes (e.g., Brandst€
atter & Gussmack, 2013; Cokely & Kelley,
2009; Tor & Bazerman, 2003), we used protocol analysis as it provided some
unique advantages. For instance, with the use of this method, we could differ-
entiate individuals who deliberated after reporting a first answer from those
who did not deliberate, without interrupting the reasoning process, and while
still being able to keep the CRT tasks open ended and not reducing the num-
ber of alternative answer options.
We formulated a number of additional hypotheses to test the validity of
the findings of the protocol analysis. First, we hypothesised that it takes more
time to solve the problems correctly in cases where the responders start their
response with the incorrect answer or with a line of thought leading to the
incorrect answer (Incorrect start) than when they start their response with the
correct answer or with a line of thought leading to the correct answer (Correct
start). Second, we expected that there would be no significant difference in
terms of reaction time (RT) and social desirability between the “Correct start”
and “Incorrect start”cases. Finding that individuals in the “Correct start”cases
have longer RTs or are more socially desirable would indicate the presence of
a confound in our data: that is, “Correct start”people may also suppress their
first thought but do not verbalise it in our protocol analysis. Third, we
expected that incorrect responders spend more time on solving the problems
when they reflect on the first answer that comes to their mind (Reflective)
compared to when they do not deliberate on it (Non-reflective).
6B. SZASZI ET AL.
Finally, based on the assumption that individual differences can predict
the usage of different reasoning strategies (e.g., Peters, 2012; Thompson &
Johnson, 2014), we aimed to test two confirmatory hypotheses. First, we
hypothesised that individuals with higher numeracy scores more often
have “Correct start”than their less numerate counterparts. Second, we
hypothesised that individuals who score higher on the reflectivity scale will
more often deliberate after the first answer that comes to their mind than
people who score lower on the same scale. Prior to data collection, the deci-
sion was made that for the purpose of testing the hypothesis about reflectiv-
ity and deliberation, we would use the reflectivity scale that had been found
to best predict the CRT performance.
Method
Participants
Two hundred and nineteen students (75% female, M= 22.04 years, SD = 2.28)
participated in our study. The participants were recruited through the univer-
sity subject pool and they received course credit in exchange for their partici-
pation. All participants were native speakers of Hungarian and signed an
informed ethical consent form. As nine participants indicated after the proto-
col analysis that they were familiar with the CRT questions, they were
excluded from the online session and the analysis.
Procedure
The study consisted of an offline and an online session. For the offline session,
participants were invited to the lab to participate in a personal interview. First,
they were informed that the session would be recorded and later analysed.
This was followed by the detailed verbal instruction of the protocol and a
warm-up session. After that, participants were asked to solve the three items
of the CRT
5
in the standard order whilst thinking aloud. Not to have any unde-
sired influence, the experimenter was seated behind the participants and pro-
vided no feedback regarding the participant’s performance on the CRT.
Participants were asked to read aloud the tasks, and then to think aloud while
working on the questions but not to explain their thoughts. They were also
requested to indicate when they felt that they are finished with the problems.
Finally, participants were asked whether they were familiar with the CRT tasks.
5
The European version of the bat and ball problem was administered where the cost of the bat and the
ball is given in €.
THINKING & REASONING 7
During the online sessions, participants completed the following ques-
tionnaires and ability measures in a fixed order using the Qualtrics survey
software tool in installments: AOT (Baron, 1993), rational-experiential inven-
tory (REI; Pacini & Epstein, 1999), BBS (De Neys, Moyens, & Vansteenwegen,
2010), Berlin numeracy test (BNT; Cokely, Galesic, Schulz, Ghazal, & Garcia-
Retamero, 2012), semantic illusions (SIs; Mata, Schubert, & Ferreira, 2014)
and finally the balanced inventory of desirable responding (BIDR; Paulhus,
1991).
Materials
Numeracy measure
We used the computer adaptive version of the BNT (Cokely et al., 2012)to
measure numeracy. The BNT predicts the comprehension of everyday risk,
and the performance on the CRT and many other decision-making tests more
strongly than other numerical instruments. Additionally, it is able to differenti-
ate between highly educated individuals. The test consists of two or three
questions adaptively selected based on the former answers.
Reflectivity measures
Participants were asked to fill out the AOT (see Appendix A.1) which meas-
ures people’s tendency to consider several possible answers when facing a
question, to search for evidence supporting an answer other than their previ-
ously established answer, and to seek evidence against their favoured
answer (Baron, 1993). We used the eight-item version of the AOT (Haran,
Ritov, & Mellers, 2013) supplemented by three additional items which
increase the overall reliability of the original scale (Baron, personal
communication).
We also administered the 20-item rationality scale from the REI (Pacini &
Epstein, 1999) which measures the degree to which a person engages in and
enjoys effortful cognitive activity. The inventory separates the construct of
Rationality from Faith in Intuition. In this test, participants are asked to indi-
cate on a five-point Likert-scale how much statements such as “I enjoy intel-
lectual challenges”are judged to be true for themselves.
Three valid and three invalid BBS were presented in a random order (see
Appendix A.1). Four of our items were adopted from De Neys et al. (2010)
study, and two additional items were developed by our research group. BBS
can be used as a reflectivity measure because the supposed underlying mech-
anism behind performance on BBS items is the same as behind the CRT items.
People tend to decide upon the logical validity of the syllogisms based on the
8B. SZASZI ET AL.
believability of the conclusion, which is thought to be an intuitive response.
Supposedly, people have to suppress the first intuition and engage in effortful
reasoning to arrive at the correct answer (Evans, 2003).
A set of SI (Mata et al., 2014) were also administered. SI tests are usually
used to measure the degree to which individuals process verbal or written
information carefully and accurately without containing any mathematical
content (Barton & Sanford, 1993; Erickson & Mattson, 1981). Consequently, we
presumed that SI could potentially assess reflective processing without mea-
suring numeracy. The SI block consisted of six questions containing SIs where
to give the right answer participants needed to realise the semantic inconsis-
tency embedded into the question (e.g., “How many animals of each kind did
Moses take on the Ark?”) and two simple general knowledge questions (see
Appendix A.1). These latter general knowledge questions were used so partic-
ipants would not become suspicious once they detected the illusions. The SIs
were adapted from Mata et al. (2014). Based on a similar thinking, Thomson
and Oppenheimer (2016) also created an alternate form of the CRT using
tasks with non-numerical content.
Social desirability measure
Participants were also asked to fill out the BIDR (Paulhus, 1991). BIDR meas-
ures the responder’s tendency to answer in a way that makes them socially
desirable in order to manage self-presentation. The BIDR consists of two sub-
scales (Self-Deceptive Enhancement, Impression Management), from which
only the second one was administered for the purpose of this study. The sub-
scale consists of 20 items, such as “I sometimes drive faster than the speed
limit”, and the responders had to report their answer on a seven-point rating
scale.
Bayes factor
As no scientific inference can be made to the hypotheses from statistically
non-significant results alone (Dienes, 2014), we calculated Bayes factors (B)to
supplement the frequentist analyses and used it to determine whether the
null results in this study imply data-insensitivity or provide evidence for
the null hypotheses. Bis a statistical measure which can be used to assess the
degree to which the data support one hypothesis compared to another one.
To interpret the Bvalues, we employed Jeffreys’s(1961) sensitivity criterion.
Accordingly, Bvalues less than 1/3 indicate substantial evidence for the null
while Bvalues more than 3 indicate substantial evidence for the alternative
hypothesis. Bvalues between 1/3 and 3 show that the data are insensitive
and should not be used as scientific evidence towards any of the hypotheses.
THINKING & REASONING 9
For the Bcalculations, we applied the B calculator of Dienes (2008) imple-
mented in R.
6
Results
Descriptive results of the CRT
As the first step of our analysis, we compared the descriptive results of the
protocol analysis with the most commonly reported descriptive patterns from
previous studies of the CRT. The data showed acceptable reliability as mea-
sured by Cronbach-alpha (0.64), which is comparable with the results of previ-
ously reported studies (Campitelli & Gerrans, 2014; Liberali et al., 2012; Primi
et al., 2015; Weller et al., 2013). While, in total, 28% of the responses were cor-
rect, the participants reported the “intuitive incorrect”answers and other
incorrect answers in 60% and in 8% of the cases, respectively, and gave up on
solving the problems in 4% of the cases. The proportion of different types of
answers showed considerable variance across the tasks of the CRT. Table 1
provides a summary of these findings. Both the solution rates and the propor-
tion of different types of answers were in line with the previous findings in the
literature (e.g., Primi et al., 2015). Our results were also consistent with
previous results regarding gender differences in the CRT performance (e.g.,
Frederick, 2005): the Mann–Whitney test indicated that males scored higher
(Mdn = 1) on the CRT than females (Mdn = 0), W= 5206, p= 0.003.
Individual differences measures and the CRT performance
The first part of the follow-up online survey containing the numeracy and
reflectivity measures was returned by 206 out of the 210 participants while
195 individuals (93%) completed the second survey comprising the social
desirability scale. Appendix A.2 provides an overview of the descriptive statis-
tics of the used tests. Each analysis was run with all of the data available for
Table 1. The number and the proportion of answers per answer type.
Correct answers Intuitive incorrect answers Other incorrect answers Gave up
CRT1 44 (21%) 150 (71%) 5 (2%) 11 (5%)
CRT2 46 (22%) 130 (62%) 28 (13%) 6 (3%)
CRT3 87 (41%) 98 (47%) 18 (9%) 7 (3%)
Total 177 (28%) 387 (60%) 51 (8%) 24 (4%)
6
In order to compute B, one has to model the predictions of the tested hypotheses. Since all of the
hypotheses in the current study had directional predictions, following Dienes’s recommendations (2011,
2014), we modelled the alternative hypotheses with half-normal distributions with 0 probability for nega-
tive values. We applied two ways to determine the SD of the half-normal distributions. If we had informa-
tion on the effect size of the alternative model, then we used it as the SD of the half-normal distribution.
Otherwise, we estimated the maximum possible effect size of the alternative hypothesis and we applied
the half of it as the SD of the half-normal distribution.
10 B. SZASZI ET AL.
that test. BNT showed significant correlation with the CRT performance, r=
0.49, p<0.001, and all the reflectivity measures (REI, AOT, SI, and BBS) also
correlated significantly with the CRT (Table 2). However, after controlling for
BNT, the partial correlation analysis showed that only REI, r(178) = 0.26, p<
0.001, and AOT, r(178) = 0.20, p= 0.007, retained a significant relation with
the CRT (SI., r(178) = 0.03, p<0.71; BBS, r(178) = 0.11, p<0.13).
As a next step, we aimed to investigate the individual differences behind
good performance on the CRT. To do that, we built standard multiple regres-
sion models to assess the variables’predictive ability on the CRT performance.
First, all the independent variables were entered into the model, then all the
statistically non-significant predictors were removed. Our final model, com-
prising BNT, b= 0.39, 95% CI [0.29, 0.48], t= 8.22, p<0.001, and REI, b= 0.02,
95% CI [0.01, 0.03], t= 4.16, p<0.001, fitted the data best, F(2,203) = 48.09,
p<0.001, adj. R
2
= 0.32.
7
Protocol analysis: exploring the ways individuals solve the CRT
Two raters, blind to our hypotheses, categorised the verbal reports using the
following coding system (Table 3). First, the answer of every individual on
each CRT task has been signed as correct or incorrect. Then, a different cate-
gorisation procedure was applied for the correct and for the incorrect
answers. The coding system is summarised in Table 3 with some prototypical
examples from the bat and the ball problem. The result of the categorisation
procedure showed high inter-rater reliability, kappa = 0.83.
The correct answers were classified into the “Correct start”, or the “Incor-
rect start”categories. All the cases where participants started their response
with a line of thought which led to the correct answer (i.e., after reading the
task, expressed a coherent sequence of mental steps that led her to the cor-
rect answer), or after reading a question immediately gave the correct answer,
were categorised as “Correct start”. Otherwise, the cases where the
Table 2. Correlations of the main variables.
Berlin
numeracy
test
Rational-
experiential
inventory
Actively
open-minded
thinking
Semantic
illusions
Belief bias
syllogisms
CRT 0.494
0.291
0.256
0.187
0.292
Berlin numeracy test 0.143
0.24
0.286
0.384
Rational-experiential
inventory
0.339
0.095 0.165
Actively open-minded
thinking
0.206
0.242
Semantic illusions 0.224
p<0.05;
p<0.01.
7
The assumptions of the multiple regression were not met. A bootstrapping estimation of 10,000 sam-
ples confirmed the results of the regression analysis.
THINKING & REASONING 11
Table 3. Categorisation of the verbal reports and the number of cases and individuals in each category.
Participants’final
answer Basis of the categorisation Categories Definition of the categories Example
No. of cases (no.
of individuals)
Correct What does the person start to
say after reading out loud the
task?
Correct start Starting their response with the
correct answer
It’s 5 cents! 124 (86)
Starting their response with a
thinking leading to the correct
answer
I see. This is an equation. Thus. if the ball equals to x.
the bat equals to x plus 1 …
Incorrect start Starting their response with an
incorrect answer
I would say 10 cents…. But this cannot be true as it
does not sum up to €1.10…
37 (34)
Starting their response with a
thinking leading to an
incorrect answer
Let’s see! €1.10 minus €1 is 10 cents…Wait. that’s
wrong! This should be solved as an equation…
Incorrect What does the person say after
reporting a first answer?
Reflective Expressing doubt and re-
performing original strategy
…but I’m not sure…If together they cost €1.10. and
the bat costs €1 more than the ball. the solution
should be 10 cents. I’m done.
219 (136)
Non-reflective No reflection Ok. I’m done. 142 (106)
12 B. SZASZI ET AL.
participants started their response with an incorrect answer or with a line of
thought which led to an incorrect answer, but later realised their failure, were
labelled as “Incorrect start”.
The incorrect responses were grouped as “Reflective”or as “Non-reflec-
tive”. Regarding the incorrect cases, the categorisation procedure focused on
whether the participant reflected or not after reporting a first answer. A case
was classified as “Non-reflective”, if the participant accepted the first answer
that came to her mind without any type of consideration, or simply echoed it.
Otherwise (e.g., when the participant tried to reframe the problem, re-per-
formed the original strategy, looked for alternative strategies or answers,
expressed doubt), the protocol was categorised as “Reflective”.
The data of one participant partially and the data of two individuals
completely were omitted, as the audio recordings of their trials were dam-
aged. The exclusion criterion was set before the experiment was conducted.
All the cases were excluded where the raters did not agree about the group-
ing of the protocol, to minimise the noise in the results of the protocol analy-
sis. As a result, 76 additional cases (12%) were omitted from the subsequent
analyses. The cases where the participants gave correct and incorrect answers
were analysed separately according to the corresponding hypotheses.
Analysis of the correct cases
The protocol analysis of the correct answers suggests that the participants
performed a “Correct-start”in 124 cases (77%) and showed an “Incorrect start”
pattern only in 37 cases (23%). The “Correct start”pattern emerged as domi-
nant for all of the CRT items (see Appendix B.1.1); however, it was most
robustly expressed for the “lily pads”task. Note that the individual protocols
formed the bases of the analysis.
To test the validity of this result, further analyses were conducted. First, we
tested the hypothesis that the average final response time (FRT) in the “Incor-
rect start”group is longer than in the “Correct start”group. The rationale
behind this thinking is that those in the “Incorrect start”group need to per-
form extra mental operations compared to those who started their response
with the correct answer or with a line of thought leading to the correct
answer. In this study, FRT was operationalised as latency between the points
at which the participants finished reading aloud the tasks and when they indi-
cated that their final answer had been given. Log transformation was con-
ducted to correct for the deviations from the normal distribution on FRT data.
These log-transformed data were used in the comparison of several linear
mixed random-effects models.
8
The base-model contained only the
8
We used the glmer and lmer functions from the lme4 package in R for the mixed-effect analyses (Bates,
Maechler, Bolker, & Walker, 2015). The corresponding t statistics reported are based on the result of Wald t tests.
THINKING & REASONING 13
participants’ID as a random intercept regressed on FRT. In the second model,
a random intercept was specified for each of the CRT items. As a result, the
model fit increased significantly, x
2
(1) = 15.41, p<0.001. In the third model,
group membership (“Correct start”vs. “Incorrect start”) was added as a fixed
effect which significantly increased the model fit, x
2
(1) = 52.37, p<0.001.
The analysis revealed that the FRT was significantly higher in the “Incorrect
start”group than in the “Correct start”group, b= 1.02, 95% CI[0.77, 1.29],
t(158.81) = 7.91, p<0.001.
For the purposes of the current study, we defined RT as the time interval
that happened between the end of the task-reading and the onset of the for-
mulation of the individual’s answer. Assuming that any deliberative process
is expressed in terms of thinking times, if people in the “Correct start”group
also started their reasoning process with an incorrect answer or with a line of
thought which led to an incorrect answer and suppressed this first thought
before starting to articulate their answer, their RT should be longer than the
RT of the “Incorrect start”group. This would indicate the presence of a con-
found in our data. To test this hypothesis, we built a linear mixed random-
effect model and conducted model comparisons in the same way for RT as
we did for FRT above. We found that neither the CRT items increased the fit
of the model significantly, nor did the fixed effect of the group membership.
Additionally, we calculated Bto determine whether this null result implies
data-insensitivity or provides evidence for the null hypothesis. The analysis
yielded B
H(0,1.63)
= 0.28, indicating evidence for the null.
9,10
Thus, we found
no difference in RT between the “Incorrect start”and the “Correct start”
groups.
People ranking higher on the social desirability scale may be less likely to
verbalise the first answer that comes to mind in case it is incorrect. As this
could result in a possible confound in our findings, we tested the hypothesis
that individuals in the “Correct start”group score higher on the BIDR than
people in the “Incorrect start”group. We compared mixed random-effect
logistic regression models where the group membership was the outcome
variable. First, we specified random intercepts for each participant and then
for each CRT item. This latter effect did not significantly increase the fit of the
model. In the last step, BIDR was stepped into the model, but we found no
evidence that the groups differ in Social Desirability. The Bayesian analysis
9
“H”indicates that we applied a half-normal distribution to model the predictions of the alternative
hypothesis. The first number in the bracket displays the centre of the distribution, and the second indi-
cates the SD of the distribution.
10
We assumed that the effect size of H1 cannot be bigger than the average RT of the group with longer
RT. Consequently, the average RT in the “Correct start”group was taken as an estimate of the maximum
effect size of H1. The half of its value was employed as the SD of the model.
14 B. SZASZI ET AL.
further supported that BIDR does not predict the group membership of the
participants, B
H(0,0.45)
= 0.015.
11
Analysis of the incorrect cases
The protocol analysis of the incorrect answers aimed to explore whether there
are people who check the first answer that comes to their mind but still fail to
solve the task. The data suggest that in 142 of the 361 cases (39%), people
engaged in some kind of reflective behaviour after reporting their first
answer, while in 219 cases (61%) people accepted the first answer that they
reported without any further deliberation. We observed a similar pattern for
all the CRT items (see Appendix B.1.2).
Based on the definition of the “Reflective”and “Non-reflective”groups,
one would expect that FRT in the “Non-reflective”group is shorter than in
the “Reflective”group. To test this assumption, we again compared linear
mixed random-effect models. The model comparison method followed the
procedure introduced above. The base-model contained random intercept
for each participant. Then, random intercept was added for the CRT items,
which significantly increased the fit of the model, x
2
(1) = 13.31, p<0.001.
Finally, group membership was added as a fixed effect. We found that the
group membership variable significantly increased the fit of the model, x
2
(1)
= 91.63, p<0.001. The analysis revealed that people in the “Reflective”
group spent significantly more time on solving the problems than people
in the “Non-reflective”group, b= 0.73, 95% CI[0.59, 0.87], t(349.6) = 10.24,
p<0.001.
Individual differences as predictors of task solution
12
We hypothesised that more numerate individuals start their thinking with cor-
rect strategies or have correct intuitions on the CRT more often than their low
numeracy counterparts. We compared mixed random-effect logistic regres-
sion models to test whether group membership (“Correct start”vs. “Incorrect
start”) is predicted by BNT performance. In the first model, we specified a ran-
dom intercept for each participant. The CRT item variable being stepped into
the model as a random factor did not increase the model fit, nor did BNT per-
formance yield a significant effect. We calculated Bin order to test whether
the data supported the null-hypothesis. The analysis resulted in B
H(0,0.45)
=
0.62, suggesting that the data obtained are not sensitive enough to permit a
11
As there was no previous study examining the predictive power of BIDR on the CRT performance, we
applied the predictive power of the BNT as a rough estimate for the maximum effect size of H1. Thus, the
half of this value was employed as the SD of the model.
12
Although we did not formulate specific hypotheses, Appendix B.2 depicts the means and standard
deviations of all the individual differences measures (BNT, AOT, REI, BBS, SI, BIDR) across the different cate-
gories created in the protocol analyses.
THINKING & REASONING 15
conclusion.
13
It has to be added that our data showed a ceiling effect on BNT
among the correct responders which is not surprising taken that CRT tasks
are highly difficult. Taken together, these findings do not allow us to draw
any inference regarding our hypothesis.
Our last hypothesis predicted that people in the “Reflective”group score
higher on the REI scale than the members of the “Non-reflective”group. To
test this idea, we built a linear mixed random-effect logistic regression mod-
els. First, we added a random intercept for each participant, in a model with
group membership as the criterion variable. Adding random intercepts for
the individual CRT items did not increase the model fit significantly. Adding
REI as a fixed-effect predictor failed to increase model fit significantly. The
result of the corresponding Bayes factor analysis indicated that the obtained
data is not sensitive enough to permit a conclusion,
14
B
H(0,0.03)
= 0.80.
Discussion
The findings of this study deepen our understanding about how people solve
the CRT and about the abilities needed for its correct solution. The results sug-
gest that there are individuals who start their response with the correct
answer or with a line of thought which led to the correct answer when solving
the CRT tasks. Mata et al. (2013, Study 5) explicitly asked the participants after
solving the modified version of the bat and the ball problem whether the typ-
ically incorrect solution came to their mind while thinking about the task. As
we did, they also found that correct responders had not thought of the “intui-
tive response”in a noteworthy number of cases (28%),
15
which can be inter-
preted as the proportion of the “Correct start”individuals. Cokely and Kelley
(2009), based on the findings of their protocol analysis, also argued that the
significance of early selection control mechanisms is underestimated in the
decision literature. However, these results provide empirical evidence that
the early selection processes may play an important role in solving the CRT.
The finding that the majority of the correct responders started their
response with the correct answer or with a line of thought which led to the
correct answer raises questions regarding the usage of the CRT as a pure
13
The predictive power of the BNT for giving the right answer on the CRT was taken as the maximum of
the expected effect size for H1, and so the half of this value was employed as the SD of the model.
14
We took the maximum expected effect size from a model where REI predicted the accuracy of the
answer for H1. The half of its value was employed as the SD of the model.
15
Compared to our findings, the relatively low proportion of ‘Correct start’cases could have been caused
by several differences between the two experimental designs. First, unlike us, the authors used the modi-
fied bat and ball problem. Additionally, the authors did not control for the time-course assumption of the
answers, which is crucial regarding our theoretical question, as it is possible that those who indicated
awareness of the ‘intuitive’response may have started to think with a correct strategy, and the incorrect
solution came to their mind only later. Finally, their results are based on participants’self-reports after
solving the task and not on verbal protocols.
16 B. SZASZI ET AL.
measurement of the ability to override the first “intuitive response”. In addi-
tion, our correlational results further support that the late suppression mecha-
nism may not be the only feature of reflectivity in the CRT. We have found
that the REI and the AOT were the best predictors of the CRT performance
above the BNT, and not the reflectivity measures which theoretically build
upon the preconception of the suppression of a first “intuitive answer”(BBS,
SI). Cokely and Kelley (2009) found that the quantity of the verbalised reason-
ing in risky decision-making tasks was related to CRT performance. Campitelli
and Labollita (2010) have found that individuals who solved more CRT tasks
possessed more general knowledge and used more detailed heuristic cues.
Cokely, Parpart, and Schooler (2009) demonstrated that more reflective indi-
viduals provided more normatively justifiable judgements in environments
where multiple diagnostic cues were available; however, they also relied
more on heuristic processes when there was no diagnostic cue available.
Additionally, Baron et al. (2014) observed that the predictive power of the
CRT does not stem from the disposition to overcome an initial intuition in
moral judgements. In line with previous results, our findings support the view
that the definition of reflectivity –at least when it is operationalised by the
CRT –should not be restricted to the description of the ability or disposition
to override gut feelings, but instead a broader RI account of reflectivity should
be used embracing the general preference for speed over accuracy.
Stanovich, Toplak, and West (2008) suggested a general framework to
understand rational thinking errors in heuristics and biases tasks. Their classifi-
cation embraces two different kinds of causes that may be behind the think-
ing failures. The first cause is rooted in the individuals’tendency to use
heuristic-processing mechanisms (Simon, 1956; Stanovich et al., 2008; Tversky
& Kahneman, 1974). The heuristics and biases tasks are designed to trigger
automatic but incorrect responses, which can lead individuals to report this
incorrect answer as it is of low computational expense. The second cause is
called the mindware problem (Perkins, 1995); it stems from the fact that indi-
viduals lack the declarative knowledge and strategic rules that are needed to
solve some problems. Consequently, even when individuals put considerable
mental effort into the problem-solving process, the lack of this necessary
knowledge can lead to thinking failures (Stanovich et al., 2008).
The CRT is believed to assess “people’s tendency to answer questions with
the first idea that comes to their mind without checking it”(Kahneman, 2011,
p. 65). Toplak et al. argued (2011,2014) that incorrect responding on the CRT
is not a result of a mindware problem, but rather that of miserly processing.
In a recent review, Pennycook, Fugelsang, and Koehler (2015) considered the
role of cognitive abilities “rather rudimentary" (2015, p. 426). However, we
found that many reasoners are not able to come to the right solution in the
CRT even if they reflect on their first answer. Consequently, the mindware
problem should be considered as one of the reasons people make errors on
THINKING & REASONING 17
the CRT tasks. Meyer et al. (2015) work also supports our findings in this
regard. The authors used four different kinds of manipulation to make people
reflect on the bat and the ball problem and found that throughout all condi-
tions a significant amount of people still reported an incorrect response. Their
results also suggest that the tendency to fail the task can be caused either by
“hopeless”(low ability) or by “careless”(high ability, low reflectivity) behaviou-
ral patterns. A recent study of Szollosi, Bago, Szaszi, and Aczel (in press) brings
further evidence to this hypothesis: their results showed that many of the par-
ticipants who failed to solve the bat and ball problem reported that they had
verified their answer, which can be interpreted as an indication of deliberative
thinking. Additionally, our finding converges with others in the literature
showing that a period of reflection does not necessarily produce beneficial
results (Stanovich et al., 2008; Thompson, Turner, & Pennycook, 2011; Thomp-
son et al., 2013). This result raises serious concerns about the usage of the
CRT as a measure of cognitive miserliness and warns that whenever the CRT
is used in correlational studies, researchers have to take into consideration
whether the lack of miserliness or the mindware problem could have caused
the effect as the failure on the CRT tasks can be caused by both.
The responses in the CRT are often grouped into “intuitive incorrect”,
“non-intuitive correct”,and“non-intuitive incorrect”categories (e.g.,
Pennycook, Cheyne, Koehler, & Fugelsang, 2015). More importantly, many
studies make central conclusions from the hypotheses built on this classifica-
tion (e.g., B€
ockenholt, 2012; Brosnan, Hollinworth, Antoniadou, & Lewton,
2014;Piazza&Sousa,2013;Sinayev&Peters,2015). Although our study did
not focus on the question of whether a response was intuitive or deliberative
(Evans, 2003,2009), the results of the protocol analysis suggest that partici-
pants deliberated after articulating a first response in 39% of the trials where
they reported an “incorrect intuitive”final response. Note that we do not
mean to speculate on whether the first response was generated by intuition
or deliberation, but we argue that many of the reasoners engaged in some
form of reflection despite eventually reporting the “intuitive incorrect”
answer. As a consequence, the classification based only on the final answer
to indicate deliberative tendencies yields a contaminated measure that could
lead to biased results. Our conclusion here is in line with previous research
(e.g., Elqayam & Evans, 2011; Thompson & Johnson, 2014; Thompson et al.,
2011 ): solely based on the normativity of the responses, one cannot infer
whether the answer was the output of Type 1 or Type 2 processes (Evans &
Stanovich, 2013), or the decision-maker engaged in deliberation or not. Our
results indicate that before building on the conclusions of the studies using
the original classification schema, more scientific examination would be
needed to investigate the validity and the reliability of the intuitive/delibera-
tive categories.
18 B. SZASZI ET AL.
In accord with previous findings (e.g., Campitelli & Gerrans, 2014; Del Miss-
ier, M€
antyl€
a, & Bruin, 2012; Pennycook & Ross, 2016), our results support the
idea that both reflective ability and numeracy account for the performance in
the CRT. Consequently, we suggest that whenever the CRT is used as a stand-
alone individual differences measure, one should draw only careful conclu-
sions about the reasons behind any correlations found (see also, Aczel, Bago,
Szollosi, Foldes, & Lukacs, 2015), as there is no simple way to tell whether
numerical abilities or the reflective disposition are causing the effect.
However, the methodological difficulty in the dissociation of numeracy
and reflectivity is rooted deeper than the reliability of the tests. Those who
have better numerical abilities might have richer and more accurate intuitions
(Pachur & Spaar, 2015; Peters, 2012; Reyna et al., 2009; Thompson & Johnson,
2014), or use early controlled processes (Jacoby, Shimizu, Daniels, & Rhodes,
2005; Peters, 2012), which could lead them to more accurate responding
without being reflective in reflectivity tests that are based on numerical tasks.
At the same time, low numeracy can lead to low scoring even for the highly
reflective individuals. (See also the mindware problem). Similarly, in numeracy
tests, high reflectivity can lead people to put more effort into the problem-
solving procedure resulting in more correct responses (Ghazal, Cokely, &
Garcia-Retamero, 2014), but low reflectivity can have a detrimental effect on
performance.
16
As a consequence, whenever researchers aim to assess reflec-
tivity with numerical test-based assessment tools, they have to be careful
about the interpretation of the findings, as it is not possible to determine
only by examining the accuracy measures whether numeracy or reflectivity
lead to a correct/incorrect response. However, this conclusion is not specific
to the numerical domain (Szaszi, 2016), but holds true for any domain-specific
reflectivity test where additional thinking effort increases the probability of
successful responding (for a similar argument, see Baron, Badgio, & Gaskins,
1986).
Fox et al. (2011) outlined that verbal protocols “do not assure a complete
record of the participants’thoughts”(p. 338). Consequently, one limitation of
our thinking aloud study is that we cannot exclude the possibility that some
of those who apparently started their response with the correct answer or
with a line of thought which led to the correct answer did not perceive any
other response option. Although the RT measure supported the idea that the
“Correct start”group do not need to inhibit a first answer before starting to
verbalise their response, there are alternative explanations that cannot be
ruled out in our experimental design. First, RT is a valid measure to diagnose
16
Working memory (WM) differences can bring additional complexity in the equation: people with
higher working memory spam are thought to be more numerate (Peters, Dieckmann, Dixon, Hibbard, &
Mertz, 2007; Reyna, Nelson, Han, & Dieckmann, 2009), but they may find the cost of additional thinking
lower than their low WM counterparts (Stupple, Gale, & Richmond, 2013).
THINKING & REASONING 19
how much thinking is being done, but it is less reliable in determining how
many mental operations are occurring. Additionally, one can assume that
“Correct start”individuals are more cognitively able than people in the “Incor-
rect start”group. Taken as a whole, it is possible that “Correct start”people
suppress their first answer and generate a new answer or strategy in the
same time-frame as “Incorrect start”responders generate their first answer.
Finally, it is possible that “Correct start”reasoners considered the “intuitive
response”during the reading phase, and if so, our RT measure would not be a
sensitive measure of it.
It has been argued that reflectivity is a key individual differences dimen-
sion predicting rational errors in heuristics and biases tasks (e.g., Stanovich
et al., 2008; Toplak et al., 2011) and in diverse everyday situations (Pennycook
et al., 2015). Our study aimed to enhance our knowledge of the CRT, as it is
the most widely used behavioural measure of reflectivity. In sum, we
observed that there are several ways people can solve or fail the test. Impor-
tantly, some individuals started their response with the correct answer or
with a line of thought which led to the correct answer, while others fail to
solve the CRT tasks even when they reflect on it. Additionally, the current
results suggest that the CRT test rather measures a general preference for
speed over accuracy and not just individuals’ability to suppress a first “intui-
tive answer”. In our view, the CRT is a useful and important measurement tool
of reflectivity. However, this study raises doubts about the validity of the stud-
ies that build on the CRT as a simple measure of analytical thinking, since the
use of the CRT as a standalone predictor can easily lead to the overestimation
of the role of reflectivity and the underestimation of the role of numerical
ability in decision performance. As the CRT tasks are pivotal examples in sev-
eral dual-process models of reasoning and decision-making, the implications
of our findings go beyond the CRT as a measurement tool. Our implications
about the processes and abilities involved in the CRT can be used to better
apprehend the “whys”and “whens”(De Neys & Bonnefon, 2013) of the deci-
sion errors in heuristics and biases tasks and to further refine existing explana-
tory models.
Acknowledgments
We would like to thank
Arp
ad V€
olgyesi for running the verbal protocols, Bence Bago
and Zoltan Kekecs for their helpful comments with the analysis, Melissa Wood for
proofreading the manuscript and Melinda Sz
aszi-Szrenka for her supporting patience
throughout the study.
Disclosure statement
No potential conflict of interest was reported by the authors.
20 B. SZASZI ET AL.
Funding
This work was supported by the doctoral scholarship of E€
otv€
os Lor
and University, and
by the “Pallas Ath
en
e Domus Animae Alap
ıtv
any”. Aba Szollosi was supported by the
“Nemzet Fiatal Tehets
egei
ert”Scholarship [NTP-NFT
€
O-16-1184].
ORCID
B. Szaszi http://orcid.org/0000-0001-7078-2712
A. Szollosi http://orcid.org/0000-0003-3457-542X
B. Palfihttp://orcid.org/0000-0002-6739-8792
B. Aczel http://orcid.org/0000-0001-9364-4988
References
Aczel, B., Bago, B., Szollosi, A., Foldes, A., & Lukacs, B. (2015). Measuring individual dif-
ferences in decision biases: Methodological considerations. Frontiers in Psychology,
6, 1770.
Baron, J. (1993). Why teach thinking? An essay. Applied Psychology, 42(3), 191–214.
Baron, J., Badgio, P., & Gaskins, I. W. (1986). Cognitive style and its improvement: A nor-
mative approach. Advances in the Psychology of Human Intelligence, 3, 173–220.
Baron, J., Scott, S., Fincher, K., & Metz, S. E. (2014). Why does the cognitive reflection
test (sometimes) predict utilitarian moral judgment (and other things)? Journal of
Applied Research in Memory and Cognition, 4, 265–284
Barton, S. B., & Sanford, A. J. (1993). A case study of anomaly detection: Shallow seman-
tic processing and cohesion establishment. Memory & Cognition, 21(4), 477–487.
Bates, D., Maechler, M., Bolker, B., & Walker, S. (2015). Lme4: Linear mixed-effects mod-
els using Eigen and S4. R Package, version 1.1-8. Retrieved from https://cran.r-proj
ect.org/web/packages/lme4/index.html
B€
ockenholt, U. (2012). The cognitive-miser response model: Testing for intuitive and
deliberate reasoning. Psychometrika, 77(2), 388–399.
Brandst€
atter, E., & Gussmack, M. (2013). The cognitive processes underlying risky
choice. Journal of Behavioral Decision Making, 26(2), 185–197.
Brosnan, M., Hollinworth, M., Antoniadou, K., & Lewton, M. (2014). Is empathizing intui-
tive and systemizing deliberative? Personality and Individual Differences, 66,39–43.
Campitelli, G., & Gerrans, P. (2014). Does the cognitive reflection test measure cognitive
reflection? A mathematical modeling approach. Memory & Cognition, 42(3), 434–
447.
Campitelli, G., & Labollita, M. (2010). Correlations of cognitive reflection with judg-
ments and choices. Judgment and Decision Making, 5(3), 182–191.
Cokely, E. T., Galesic, M., Schulz, E., Ghazal, S., & Garcia-Retamero, R. (2012). Measuring
risk literacy: The Berlin numeracy test. Judgment and Decision Making, 7(1), 25–47.
Cokely, E. T., & Kelley, C. M. (2009). Cognitive abilities and superior decision making
under risk: A protocol analysis and process model evaluation. Judgment and Deci-
sion Making, 4(1), 20–33.
Cokely, E. T., Parpart, P., & Schooler, L. J. (2009). On the link between cognitive control
and heuristic processes. In N. A. Taatgnen & H. Van Rijn (Eds.), Proceedings of the
31th annual conference of the cognitive science society (pp. 2926–2931). Austin, TX:
Cognitive Science Society.
THINKING & REASONING 21
Del Missier, F., M€
antyl€
a, T., & Bruin, W. B. (2012). Decisionmaking competence, execu-
tive functioning, and general cognitive abilities. Journal of Behavioral Decision Mak-
ing, 25(4), 331–351.
De Neys, W., & Bonnefon, J.-F. (2013). The “whys”and “whens”of individual differences
in thinking biases. Trends in Cognitive Sciences, 17(4), 172–178.
De Neys, W., & Glumicic, T. (2008). Conflict monitoring in dual process theories of think-
ing. Cognition, 106(3), 1248–1299.
De Neys, W., Moyens, E., & Vansteenwegen, D. (2010). Feeling we’re biased: Autonomic
arousal and reasoning conflict. Cognitive, Affective, & Behavioral Neuroscience, 10(2),
208–216.
Dienes, Z. (2008). Understanding psychology as a science: An introduction to scientific
and statistical inference. New York, NY: Palgrave Macmillan.
Dienes, Z. (2011). Bayesian versus orthodox statistics: Which side are you on? Perspec-
tives on Psychological Science, 6(3), 274–290.
Dienes, Z. (2014). Using Bayes to get the most out of non-significant results. Frontiers in
psychology, 5, 781.
Elqayam, S., & Evans, J. S. B. (2011). Subtracting “ought”from “is”: Descriptivism versus
normativism in the study of human thinking. Behavioral and Brain Sciences, 34(5),
233–248.
Erickson, T. D., & Mattson, M. E. (1981). From words to meaning: A semantic illusion.
Journal of Verbal Learning and Verbal Behavior, 20(5), 540–551.
Ericsson, K. A., & Simon, H. A. (1980). Verbal reports as data. Psychological Review, 87(3),
215–251.
Evans, J. S. B. (1996). Deciding before you think: Relevance and reasoning in the selec-
tion task. British Journal of Psychology, 87(2), 223–240.
Evans, J. S. B. (2003). In two minds: Dual-process accounts of reasoning. Trends in Cogni-
tive Sciences, 7(10), 454–459.
Evans, J. S. B. (2009). How many dual-process theories do we need? One, two, or many?
In In two minds: Dual processes and beyond (pp. 33–54). New York, NY: Oxford Uni-
versity Press.
Evans, J. S. B., & Ball, L. J. (2010). Do people reason on the Wason selection task: A new
look at the data of Ball et al. (2003). Quarterly Journal of Experimental Psychology, 63
(3), 434–441.
Evans, J. S. B., Barston, J. L., & Pollard, P. (1983). On the conflict between logic and belief
in syllogistic reasoning. Memory & Cognition, 11(3), 295–306.
Evans, J. S. B., & Stanovich, K. E. (2013). Dual-process theories of higher cognition
advancing the debate. Perspectives on Psychological Science, 8(3), 223–241.
Fific, M. (2014). Double jeopardy in inferring cognitive processes. Frontiers in Psychol-
ogy, 5, 1130.
Finucane, M. L., & Gullion, C. M. (2010). Developing a tool for measuring the decision-
making competence of older adults. Psychology and Aging, 25(2), 271–288.
Fox, M. C., Ericsson, K. A., & Best, R. (2011). Do procedures for verbal reporting of think-
ing have to be reactive? A meta-analysis and recommendations for best reporting
methods. Psychological Bulletin, 137(2), 316–344.
Frederick, S. (2005). Cognitive reflection and decision making. The Journal of Economic
Perspectives, 19(4), 25–42.
Gervais, W. M., & Norenzayan, A. (2012). Analytic thinking promotes religious disbelief.
Science, 336(6080), 493–496.
Ghazal, S., Cokely, E. T., & Garcia-Retamero, R. (2014). Predicting biases in very highly
educated samples: Numeracy and metacognition. Judgment and Decision Making, 9
(1), 15–34.
22 B. SZASZI ET AL.
Haran, U., Ritov, I., & Mellers, B. A. (2013). The role of actively open-minded thinking in
information acquisition, accuracy, and calibration. Judgment and Decision Making, 8
(3), 188–201.
Jacoby, L. L., Kelley, C. M., & McElree, B. D. (1999). The role of cognitive control: Early
selection versus late correction. In S. Chaiken & Y. Trope (Eds.), Dual-process theories
in social psychology (pp. 383–400). New York, NY: Guilford.
Jacoby, L. L., Shimizu, Y., Daniels, K. A., & Rhodes, M. G. (2005). Modes of cognitive con-
trol in recognition and source memory: Depth of retrieval. Psychonomic Bulletin &
Review, 12(5), 852–857.
Jeffreys, H. (1961). The theory of probability. Oxford: Oxford University Press.
Johnson, E. D., Tubau, E., & De Neys, W. (2016). The doubting system 1: Evidence for
automatic substitution sensitivity. Acta Psychologica, 164,56–64.
Kagan, J., Rosman, B. L., Day, D., Albert, J., & Phillips, W. (1964). Information processing
in the child: Significance of analytic and reflective attitudes. Psychological Mono-
graphs: General and Applied, 78(1), 1–37.
Kahneman, D. (2011). Thinking, fast and slow. New York, NY: Farrar, Straus, and Giroux.
La
g, T., Bauger, L., Lindberg, M., & Friborg, O. (2014). The role of numeracy and intelli-
gence in healthrisk estimation and medical data interpretation. Journal of Behav-
ioral Decision Making, 27(2), 95–108.
Liberali, J. M., Reyna, V. F., Furlan, S., Stein, L. M., & Pardo, S. T. (2012). Individual differ-
ences in numeracy and cognitive reflection, with implications for biases and falla-
cies in probability judgment. Journal of Behavioral Decision Making, 25(4), 361–381.
Mata, A., Ferreira, M. B., & Sherman, S. J. (2013). The metacognitive advantage of delib-
erative thinkers: A dual-process perspective on overconfidence. Journal of Personal-
ity and Social Psychology, 105(3), 353–355.
Mata, A., Schubert, A.-L., & Ferreira, M. B. (2014). The role of language comprehension
in reasoning: How “good-enough”representations induce biases. Cognition, 133(2),
457–463.
Meyer, A., Spunt, R., & Frederick, S. (2015). The bat and ball problem. Unpublished
manuscript.
Pachur, T., & Spaar, M. (2015). Domain-specific preferences for intuition and delibera-
tion in decision making. Journal of Applied Research in Memory and Cognition, 4(3),
303–311.
Pacini, R., & Epstein, S. (1999). The relation of rational and experiential information
processing styles to personality, basic beliefs, and the ratio-bias phenomenon. Jour-
nal of Personality and Social Psychology, 76(6), 972–987.
Paulhus, D. L. (1991). Measurement and control of response bias. In J. P. Robinson, P. R.
Shaver, & L. S. Wrightsman (Eds.), Measures of personality and social psychological
attitudes (pp. 17–59). San Diego, CA: Academic Press.
Paxton, J. M., Ungar, L., & Greene, J. D. (2012). Reflection and reasoning in moral judg-
ment. Cognitive Science, 36(1), 163–177.
Pennycook, G., Cheyne, J. A., Koehler, D. J., & Fugelsang, J. A. (2015). Is the cognitive
reflection test a measure of both reflection and intuition? Behavior Research Meth-
ods, 48(1), 341–348.
Pennycook, G., Cheyne, J. A., Seli, P., Koehler, D. J., & Fugelsang, J. A. (2012). Analytic
cognitive style predicts religious and paranormal belief. Cognition, 123(3), 335–346.
Pennycook, G., Fugelsang, J. A., & Koehler, D. J. (2015). Everyday consequences of ana-
lytic thinking. Current Directions in Psychological Science, 24(6), 425–432.
Pennycook, G., & Ross, M. R. (2016). Commentary: Cognitive reflection vs. calculation in
decision making. Frontiers in Psychology,7, 9. Retrieved from https://www.ncbi.nlm.
nih.gov/pmc/articles/PMC4722428/
THINKING & REASONING 23
Perkins, D. (1995). Outsmarting IQ: The emerging science of learnable intelligence. New
York, NY: Free Press.
Peters, E. (2012). Beyond comprehension the role of numeracy in judgments and deci-
sions. Current Directions in Psychological Science, 21(1), 31–35.
Peters, E., Dieckmann, N., Dixon, A., Hibbard, J. H., & Mertz, C. K. (2007). Less is more in
presenting quality information to consumers. Medical Care Research and Review, 64
(2), 169–190.
Piazza, J., & Sousa, P. (2013). Religiosity, political orientation, and consequentialist
moral thinking. Social Psychological and Personality Science, 5(3), 334–342.
Primi, C., Morsanyi, K., Chiesi, F., Donati, M. A., & Hamilton, J. (2015). The development
and testing of a new version of the cognitive reflection test applying item response
theory (IRT). Journal of Behavioral Decision Making, 29. doi:10.1002/bdm.1883
Reisen, N., Hoffrage, U., & Mast, F. W. (2008). Identifying decision strategies in a con-
sumer choice situation. Judgment and Decision Making, 3(8), 641–658.
Reyna, V. F., Nelson, W. L., Han, P. K., & Dieckmann, N. F. (2009). How numeracy influen-
ces risk comprehension and medical decision making. Psychological Bulletin, 135(6),
943–973.
Simon, H. A. (1956). Rational choice and the structure of the environment. Psychological
Review, 63(2), 129–138.
Sinayev, A., & Peters, E. (2015). Cognitive reflection vs. calculation in decision making.
Frontiers in Psychology, 6, 532. doi:10.3389/fpsyg.2015.00532
Stanovich, K. E., Toplak, M. E., & West, R. F. (2008). The development of rational thought:
A taxonomy of heuristics and biases. Advances in Child Development and Behavior,
36, 251–285.
Stupple, E. J., Ball, L. J., & Ellis, D. (2013). Matching bias in syllogistic reasoning: Evidence
for a dual-process account from response times and confidence ratings. Thinking &
Reasoning, 19(1), 54–77.
Stupple, E. J., Gale, M., & Richmond, C. R. (2013). Working memory, cognitive miserli-
ness and logic as predictors of performance on the cognitive reflection test. In M.
Knauff, M. Pauen, N. Sebanz, & I. Wachsmuth (Eds.), Proceedings of the 35th annual
conference of the cognitive science society (pp. 1396–1401). Austin, TX: Cognitive Sci-
ence Society.
Svedholm-H€
akkinen, A. M. (2015). Highly reflective reasoners show no signs of belief
inhibition. Acta Psychologica, 154,69–76.
Szaszi, B. (2016). The role of expertise and preference behind individuals’tendency to
use intuitive decision style. Journal of Applied Research in Memory and Cognition, 5
(3), 329–330.
Szollosi, A., Bago, B., Szaszi, B., & Aczel, B. (in press). Exploring the determinants of con-
fidence in the bat-and-ball problem.
Thomson, K. S., & Oppenheimer, D. M. (2016). Investigating an alternate form of the
cognitive reflection test. Judgment and Decision Making, 11(1), 99–113.
Thompson, V. A., & Johnson, S. C. (2014). Conflict, metacognition, and analytic thinking.
Thinking & Reasoning, 20(2), 215–244.
Thompson, V. A., Turner, J. P., & Pennycook, G. (2011). Intuition, reason and metacogni-
tion. Cognitive Psychology, 63(3), 107–140.
Thompson, V. A., Turner, J. P., Pennycook, G., Ball, L. J., Brack, H., Ophir, Y., & Ackerman,
R. (2013). The role of answer fluency and perceptual fluency as metacognitive cues
for initiating analytic thinking. Cognition, 128(2), 237–251.
24 B. SZASZI ET AL.
Toplak, M. E., West, R. F., & Stanovich, K. E. (2011). The Cognitive Reflection Test as a
predictor of performance on heuristics-and-biases tasks. Memory & Cognition, 39(7),
1275–1289.
Toplak, M. E., West, R. F., & Stanovich, K. E. (2014). Assessing miserly information proc-
essing: An expansion of the Cognitive Reflection Test. Thinking & Reasoning, 20(2),
147–168.
Tor,A.,&Bazerman,M.H.(2003). Focusing failures in competitive environments: Explain-
ing decision errors in the Monty Hall game, the acquiring a company problem, and
multiparty ultimatums. Journal of Behavioral Decision Making, 16(5), 353–374.
Travers, E., Rolison, J. J., & Feeney, A. (2016). The time course of conflict on the cogni-
tive reflection test. Cognition, 150, 109–118.
Tversky, A., & Kahneman, D. (1974). Judgment under uncertainty: Heuristics and biases.
Science, 185(4157), 1124–1131.
Wason, P. C., & Evans, J. S. B. (1975). Dual processes in reasoning? Cognition, 3(2), 141–
154.
Weller, J. A., Dieckmann, N. F., Tusler, M., Mertz, C. K., Burns, W. J., & Peters, E. (2013).
Development and testing of an abbreviated numeracy scale: A Rasch analysis
approach. Journal of Behavioral Decision Making, 26(2), 198–212.
Welsh, M., Burns, N., & Delfabbro, P. (2013). The Cognitive Reflection Test: How much
more than numerical ability. In M. Knauff, M. Pauen, N. Sebanz, & I. Wachsmuth
(Eds.), Proceedings of the 35th annual conference of the cognitive science society
(pp. 1396–1401). Austin, TX: Cognitive Science Society.
Appendices
Appendix 1
A.1. Materials used
A.1.1. Actively open-minded thinking scale
1. Allowing oneself to be convinced by an opposing argument is a sign of good character
2. People should take into consideration evidence that goes against their beliefs
3. People should revise their beliefs in response to new information or evidence
4. Changing your mind is a sign of weakness
5. Intuition is the best guide in making decisions
6. It is important to persevere in your beliefs even when evidence is brought to bear against them
7. One should disregard evidence that conflicts with one’s established beliefs
8. People should search actively for reasons why their beliefs might be wrong
9. When we are faced with a new question, the first answer that occurs to us is usually best
10. When faced with a new question, we should consider more than one possible answer before
reaching a conclusion
11. When faced with a new question, we should look for reasons why our first answer might be
wrong, before deciding on an answer
Note. Items 1–8 were published by Haran et al. (2013). Items 9–11 were provided through personal
communication by Jonathan Baron. Reverse scored items: 4, 5, 6, 7, 9.
THINKING & REASONING 25
A.1.2. Semantic illusions
1. There is a running race among A, B, C, D, E, F. If B pass the person in second place, what place is
now B in.
2. Larry’s father has five sons, viz. Ten, Twenty, Thirty, Forty…Guess what would be the name of the
fifth?
3. How many animals of each kind did Moses take on the ark?
4. In which decade did the Beatles become the most popular American band ever?
5. In which day of September did the Twin Towers in Washington, DC get attacked by Islamist
terrorists?
6. A plane was flying from Germany to Barcelona. On the last leg of the journey, it developed engine
trouble. Over the Pyrenees, the pilot started to lose control. The plane eventually crashed right on
the border. Wreckage was equally strewn in France and Spain. Where should the survivors be
buried?
Note. Items 1 and 2 were collected from the Internet while items 3–6 were adopted from Mata et al.
(2014) study.
A.1.3. Belief bias syllogisms
Invalid/believable Valid/unbelievable
1. All flowers need light.
Roses need light.
Roses are flowers.
2. All mammals can walk.
Whales are mammals.
Whales can walk.
3. All dogs have snouts.
Labradors have snouts.
Labradors are dogs.
4. All vehicles have wheels.
Boats are vehicles.
Boats have wheels.
5. All fruits have corns.
Apples have corns.
Apples are fruits.
6. All birds have wings.
Cats are birds.
Cats have wings.
Note. Items 1–4 were adopted from De Neys et al. (2010). Items 5 and 6 were developed by our
research group.
A.2. Descriptive statistics of the tests used in the study
CRT AOT REI BBS SI BNT BIDR
Number of people 210 206 206 206 206 206 195
Theoretical range 0–311–77 20–100 0–60–61–420–140
Range of data 0–339–71 27–98 0–60–61–447–118
Median 1 57 75 5 2 2 86
Mean 0.8 56.7 72.1 4.5 2.6 2.4 84.9
SD 1.0 6.5 13.4 1.8 1.4 1.3 15.0
Note. AOT, actively open-minded thinking; REI, rational-experiential inventory; BBS, belief bias syllo-
gisms; SI, semantic illusions; BNT, Berlin numeracy test; BIDR, balanced inventory of desirable
responding.
26 B. SZASZI ET AL.
Appendix 2
B.1. Protocol analysis results per CRT item
B.1.1. Distribution of final correct responses per CRT item: number of
trials in the “correct start”and the “incorrect start”groups.
Item Correct start (n) Incorrect start (n) Total (n)
CRT1 24 14 38
CRT2 28 11 39
CRT3 72 12 84
CRT 124 37 161
B.1.2. Distribution of final incorrect responses per CRT item: number of
trials in the “reflective”and the “non-reflective”groups.
Item Non-reflective (n)Reflective (n) Total (n)
CRT1 78 56 134
CRT2 83 43 126
CRT3 58 43 101
CRT 219 142 361
B.2. Means and standard deviations of the individual differences
measures used for each protocol category (mean (SD)).
Correct start Incorrect start Non-reflective Reflective Gave up
BNT 3.14 (1.10) 3 (1.20) 2.11 (1.18) 2.06 (1.17) 1.95 (1.21)
AOT 58.45 (6.03) 58.22 (5.55) 55.76 (6.90) 56.45 (6.24) 54.36(6.75)
REI 76.60 (11.42) 77.32 (8.57) 69.30 (13.90) 71.63 (12.70) 65.45 (19.31)
BBS 5.05 (1.59) 5.03 (1.61) 4.31 (1.89) 4.27 (1.82) 4.68 (2.06)
SI 2.86 (1.23) 2.65 (1.27) 2.53 (1.46) 2.42 (1.59) 2.64 (1.33)
BIDR 83.72 (14.97) 87.29 (14.48) 85.38 (14.54) 83.6 (14.78) 89.48 (15.72)
Note. BNT, Berlin numeracy test; AOT, actively open-minded thinking; REI, rational-experiential inven-
tory; BBS, belief bias syllogisms; SI, semantic illusions; BIDR, balanced inventory of desirable
responding.
THINKING & REASONING 27
B.4. The number of ‘Correct start’and ‘Incorrect start’cases within
the correct and incorrect final responses.
Correct start Incorrect start
Correct final response 124 37
Incorrect final response 1 349
Note. We run an additional protocol analysis to separate the “Correct start”and “Incorrect start”cases
within the incorrect responses. Similar to Appendix Sections B.1.1 and B.1.2, this table only shows
those cases where the raters were on agreement upon the categorisation of the cases. Twelve cases
were excluded from the 362 incorrect responses due to disagreement among the raters.
Figure B.1. Histograms of final response times and reaction times broken down by CRT
tasks.
28 B. SZASZI ET AL.