PreprintPDF Available

Tell Us What You Really Think: A think aloud protocol analysis of the verbal cognitive reflection test

Authors:
  • Geisinger College of Health Sciences

Abstract and Figures

The standard interpretation of cognitive reflection tests assumes that correct responses are reflective and lured responses are unreflective. However, prior process-tracing of mathematical reflection tests has cast doubt on this interpretation. In two studies (N = 201), we deployed a validated think-aloud protocol in-person and online to test how this assumption is satisfied by the new, validated, less familiar, and less mathematical verbal Cognitive Reflection Test (vCRT). Importantly, thinking aloud did not disrupt test performance compared to a control group. Moreover, verbalized thoughts in both studies revealed that most (but not all) correct responses involved reflection and that most (but not all) lured responses lacked reflection. These data suggest that the vCRT usually satisfies the standard interpretation of the reflection tests (albeit not without exceptions) and that the vCRT can be a good measure of the construct theorized by the two-factor explication of ‘reflection’ (as deliberate and conscious).
Content may be subject to copyright.
Tell Us What You Really Think: A think aloud protocol analysis of the verbal Cognitive
Reflection Test
Nick Byrd*, Brianna Joseph**, Gabriela Gongora**, Miroslav Sirota***
*Stevens Institute of Technology, **Florida State University, ***University of Essex
Author Note
This project was funded, in part, by Florida State University’s Graduate School and Center for
Undergraduate Research and Academic Engagement, neither of which had any awareness of or
influence on the research process or the writing of this manuscript.
† Corresponding Author
Pearce 308
1 Castle Point Terrace
Hoboken NJ, 07030
byrdnick.com/contact
Conflict of Interest For Study 2, Phonic agreed to let us use their premier tier of service at
no cost in exchange for our beta testing their pre-release version of their survey platform.
The published version of this paper (i) is open access—i.e., free—and (ii) includes more
information than this *pre-review* submitted manuscript. Please use the published version:
Byrd, N., Joseph, B., Gongora, G., & Sirota, M. (2023). Tell Us What You Really Think: A
Think Aloud Protocol Analysis of the Verbal Cognitive Reflection Test. Journal of
Intelligence, 11(4), 76. https://doi.org/10.3390/jintelligence11040076
See byrdnick.com/cv#publications for free access to the first author's other papers.
Tell Us What You Really Think: A think aloud protocol analysis of the verbal Cognitive
Reflection Test
Abstract
The standard interpretation of cognitive reflection tests assumes that correct responses are
reflective and lured responses are unreflective. However, prior process-tracing of mathematical
reflection tests has cast doubt on this interpretation. In two studies (N = 201), we deployed a
validated think-aloud protocol in-person and online to test how this assumption is satisfied by the
new, validated, less familiar, and less mathematical verbal Cognitive Reflection Test (vCRT).
Importantly, thinking aloud did not disrupt test performance compared to a control group.
Moreover, verbalized thoughts in both studies revealed that most (but not all) correct responses
involved reflection and that most (but not all) lured responses lacked reflection. These data
suggest that the vCRT usually satisfies the standard interpretation of the reflection tests (albeit
not without exceptions) and that the vCRT can be a good measure of the construct theorized by
the two-factor explication of ‘reflection’ (as deliberate and conscious).
Keywords: cognitive reflection test, think aloud protocol analysis, psychometrics, judgment- and
decision-making, heuristics and biases
Running Head: TELL US WHAT YOU REALLY THINK
Tell Us What You Really Think: A think aloud protocol analysis of the verbal Cognitive
Reflection Test
Introduction
If you were running a race and you passed the person in 2nd place, what place would you
be in now? The standard interpretation of a problem like this assumes that the answer that comes
quickly and effortlessly to many people’s mind is “1st place”. However, upon reflection, many
people realize that the correct answer is “2nd place”. This problem is considered a test of
reflection because it is designed to lure us toward a particular response that, upon reflection, we
can realize is incorrect (Byrd, 2022b, 2022a). Thus, the standard interpretation of reflection
labels lured responses with ‘unreflective’ and correct with ‘reflective’ (Pennycook, Cheyne, et
al., 2015).
Since the introduction of the Cognitive Reflection Test (Kahneman & Frederick, 2002),
theories of reflection have advanced (J. Evans & Stanovich, 2013). In the midst of this progress,
some theorists distilled dozens of reflective-unreflective distinctions (see Frankish, 2010, Table
1) down to just two somewhat orthogonal distinctions: automatic versus deliberate processing
and conscious versus unconscious representations (Shea & Frith, 2016). According to this two-
factor explication of ‘reflection’, reflective thinking is more consciously represented and
deliberate while unreflective thinking is less consciously represented and more automatic (Byrd,
2019, 2022b). These and other theorists also posited that the need for reflection may depend on
context: in familiar reasoning domains, unreflective reasoning may be able to achieve desirable
TELL US WHAT YOU REALLY THINK 2
results, but in unfamiliar domains overcoming mistakes or biases might require some reflection
(Pennycook, Cheyne, et al., 2015).
As theories of reflection progressed, so did our understanding of reflection tests. Some
have proposed inconsistencies between the standard interpretation of mathematical reflection
tests (mCRTs for short) and the two-factor explication of ‘reflection (Stanovich, 2018). And there
is some evidence for this: some have found that most (67%) of those who answered a reflection
test correctly after deliberation had already answered it correctly under time pressure or
cognitive load before deliberation (Bago & Neys, 2019). While that two-response paradigm is
has helped test the default interventionist account of reflection—which posits that correct
responses involve intervening on a default (a.k.a., lured) response (J. St. B. T. Evans, 2007)—the
paradigm overlooks plenty of useful information about the process of solving reflection test
problems. So others have listened to every word test-takers utter while thinking aloud during the
reflection test to find that most (77%) correct responses on the mCRT were the first unreflective
response and that many (39%) lured responses on the mCRT followed sustained reflection
(Szaszi et al., 2017). Some psychometric investigations of reflection tests suggest that this
inconsistency might be explained by domain familiarity (Purcell et al., 2020), intelligence
(Thompson & Johnson, 2014), or strategy (Markovits et al., 2020). Although the predictive value
of the mCRT remains after retaking the test (Bialek & Pennycook, 2018; Stagnaro et al., 2018),
the best predictor of mCRT performance is often general math test performance (Attali & Bar-
Hillel, 2020; Erceg et al., 2020). So mCRTs may track not only reflection, but also mathematical
competence (a.k.a., numeracy).
TELL US WHAT YOU REALLY THINK 3
The Verbal Cognitive Reflection Test
Sirota and colleagues developed and validated a new, 10-item, non-mathematical variant
of Shane Frederick and colleagues’ (Frederick, 2005; Kahneman & Frederick, 2002) well-known
(mathematical) cognitive reflection test (mCRT) to address the familiarity and numeracy
problems (Byrd, 2022c). One example of these items is the opening example: “If you were
running a race, and you passed the person in 2nd place, what place would you be in now?”
Multiple studies found that the verbal cognitive reflection test (or vCRT for short) enjoys high
internal consistency, high test-retest reliability, and less association with general mathematical
ability than the mCRT (ibid.). This suggests that the vCRT is a promising supplement or
replacement for the mCRT in many research contexts.
Think Aloud Protocol Analysis
Researchers have long called for investigation into the content and process of reflection
rather than just the outcome (Stromer-Galley, 2007). Fortunately, Ericsson and colleagues have
developed and validated concurrent think aloud protocols (Ericsson, 2003; Ericsson & Simon,
1993) that have been shown to overcome some well-known problems of early verbal report
protocols such as post-hoc confabulation (e.g., Wilson & Nisbett, 1978). For example, asking
participants to verbalize or recall their thinking does not change task performance or produce
verbal reports inconsistent with their observed performance (Fox et al., 2011; Petitmengin et al.,
2013). One use of think aloud protocols is the kind of psychometric investigation that Szaszi and
colleagues conducted on the original CRT (2017). So think aloud protocols may also be useful in
investigating some of the psychometric properties of the new vCRT.
TELL US WHAT YOU REALLY THINK 4
The Current Research
Our primary goals were (a) to test whether thinking aloud changed reflection test
performance, (b) beta test an online think aloud platform, (c) quantify the deviation between the
standard interpretation of reflection tests and the two-factor explication of reflection, (d) to
assess how vCRT performance depends on vCRT familiarity, and (e) assess the default
interventionist account of reflection test responses. We pre-registered two hypotheses. First,
thinking aloud during the vCRT will provide evidence of correct-but-unreflective responses and
lured-yet-reflective responses. Second, thinking aloud will not significantly hinder vCRT
performance—i.e., it will either not impact or improve reflection test performance. The results of
an experiment and a follow-up study produced the hypothesized outcomes. They also detected
that two-factor explication of ‘reflection’ strongly albeit imperfectly correlated with the standard
interpretation of reflection test performance. All manipulations, measures, and exclusions are
reported. All APA and IRB ethical guidelines were followed. Pre-registered hypotheses, methods,
analytic strategy, data, and R scripts are on the Open Science Framework: https://osf.io/rk3jq/.
Study 1
The first study primarily aimed to test the effect of thinking aloud on final responses to
vCRT questions. The secondary aims were to test the difference in vCRT performance between
familiar and naïve participants, the correlations between the standard interpretation of reflection
tests and more recent explications of ‘reflection’, as well as the rate of correct-but-unreflective
and lured-yet-reflective responses.
Method
Participants. People were recruited from public spaces on a university campus in the
Southeastern United States. We pre-registered a target sample size of 100 participants—50
TELL US WHAT YOU REALLY THINK 5
participants per condition (Simmons et al., 2013). After months of recruitment, reaching the pre-
registered sample size with the in-person protocol during a pandemic became ethically and
practically untenable: the World Health Organization announced a global pandemic
(Ghebreyesus, 2020), the university campus closed, and the university IRB announced that all in-
person data collection must cease until further notice (Office for Human Subjects Protection,
2020). Since the protocol could not be replicated online, we had to halt data collection after
recruiting only 99 participants (mean age = 30.38 years; 57 identified as female, 38 as male, and
7 did not select a gender; 85 identified as White, 3 as Black, 3 as Hispanic or Latino, and 11 as
other ethnicity).
Procedure and Materials
Manipulation. After consenting to participate, participants navigated to a Qualtrics
survey using a QR code where they were randomly assigned to either a think aloud condition or
a control condition. To ensure that participants in both conditions completed the survey in front
of a researcher, they were asked to remain at the research table until the end of the survey to
receive their compensation: entry to win the smart speaker, water bottle, or books that were on
the table.
Think aloud protocol. Participants randomly assigned to the think aloud condition were
prompted to request instructions from a researcher. After the researcher explained the think aloud
protocol to participants, participants had a chance to ask for clarification and consent by pressing
a button labeled “I received and understand the instructions from the researcher”. Then a
researcher began an audio recording on a smartphone and the participant practiced thinking
aloud on a pre-survey task, “To practice thinking aloud, please say this sentence aloud, followed
by the following number ….” The number each participant read aloud was generated randomly
TELL US WHAT YOU REALLY THINK 6
and used to anonymously pair survey responses with each corresponding think aloud recording.
Participants were reminded to think aloud as needed throughout the survey.
Verbal Cognitive Reflection Test. Participants completed the 10-item verbal Cognitive
Reflection Test or vCRT (Sirota et al., 2020). Responses were typed into text boxes. Following
the standard interpretation of reflection tests, reflective scores were computed by summing
correct responses (e.g., 2nd place) and unreflective scores were computed by summing lured
responses (e.g., 1st place) on these verbal reasoning items.
Questions about lures. To test whether correct responses followed lured responses—a la
the default-interventionist account of reflection testing (Howarth & Handley, 2016)—participants
reported whether the lured response occurred to them after they submitted each reflection test
answer. For instance, after answering the aforementioned question about passing the racer in 2nd
place, participants were asked, “Have you thought at any point that '1st place' could be the
answer?”
Deliberateness and consciousness in think aloud recordings. The two-factors
explication of ‘reflection’ holds that reasoning is reflective when if it is more deliberate and more
consciously represented (Shea & Frith, 2016). Reasoning is said to be deliberate when it does not
merely accept the initial, automatic response and is said to be conscious when participants can
articulate parts of their reasoning (Byrd, 2019, 2022b). So each think aloud recording was used
to determine responses’ deliberateness—i.e., whether the participant verbally reconsidered their
initial response—and conscious representation—i.e., whether the participant verbalized a reason
for or against a response. Determinations were labeled “yes”, “no”, and “indeterminate”.
Familiarity in think aloud recordings. Prior work found that many participants are
already familiar with reflection test questions and that such familiarity may be the best predictor
TELL US WHAT YOU REALLY THINK 7
of reflection test performance (Byrd, 2022c; Stieger & Reips, 2016). So think aloud recordings
were also used to determine whether each participant mentioned being familiar with any of the
vCRT items? Determinations were labeled “yes”, “no”, and “indeterminate” and average across
three raters to compute a familiarity parameter.
Results
We tested the rate of correct-but-unreflective responses and lured-yet-reflective responses
on the vCRT. the effect of thinking aloud on reflection test performance, the correlation between
standard reflection test scoring and recent explications of ‘reflection’, the correlation between
test familiarity and test performance, as well as
Correct-but-unreflective and lured-yet-reflective responses. Think aloud verbal
reports sometimes deviate from the standard interpretation of reflection tests. We expected some
people to arrive at correct answers prior to reflection and without first thinking of the lured
answer and to be lured into particular incorrect answers despite sustained reflection. confirms
this pre-registered expectation: the standard interpretation of correct and lured responses usually
but imperfectly agrees with the two-factor explication of ‘reflection’.
Table 1. Standard and two-factor categorizations of reflection test responses based on think aloud
protocol analysis of responses to the verbal reflection test in Study 1. Example verbalization based on the
following reflection test question: If you were running a race, and you passed the person in 2nd place,
what place would you be in now?
Category
Example verbalization
Answer
Standard
Two-factor
Rate
Correct-and-reflective
“1st. Nowait2nd.”
Correct
Reflective
Reflective
80.2%
Correct-but-unreflective
“2nd
Correct
Reflective
Unreflective
19.8%
Lured-and-unreflective
“1st
Lured
Unreflective
Unreflective
71.5%
Lured-yet-reflective
“1st. WaitYeah: 1st.
Lured
Unreflective
Reflective
28.5%
TELL US WHAT YOU REALLY THINK 8
Thinking aloud did not impact performance. Figure A affirms our pre-registered
hypothesis and prior meta-analytic work (Fox et al., 2011): we did not detect an effect of
thinking aloud on the number of lured or correct responses on the vCRT.
Figure A. The effect of thinking aloud on verbal cognitive reflection test (vCRT) performance in Study 1.
Error bars represent a standard error. (N = 99)
Two factor interpretation predicts the standard interpretation. Regression analysis
was employed to understand how well the standard interpretations of reflection tests aligns with
dual process theorists’ two-factor explication of ‘reflection’. Figure B shows that they align well:
increases in the number of participants’ responses that involved deliberate or conscious
thinking—as determined by think aloud recordings—correlated with significant decreases in the
number of lured responses and significant increases in the number of correct responses.
TELL US WHAT YOU REALLY THINK 9
Figure B. Correlations between deliberate and/or conscious responses and the standard interpretation of
verbal reflection test (vCRT) performance in the think aloud condition of Study 1 (N = 47) with gray
standard error bands.
Consideration of lured responses. Frederick (Frederick, 2005) observed that correct
mCRT responses often involved consideration of the lured response. Some hypothesize that the
appeal of incorrect lures is they are more likely to feel correct (Thompson et al., 2013). If that is
right, then people should not only be likely to consider lured responses, but those who consider
lures should be very likely to accept lures as their final answer. Figure C confirms this:
consideration of lured responses while thinking aloud was relatively high (mean = 6.13, range =
0-10, S.D. = 2.59), but it almost perfectly predicted lured responding.
TELL US WHAT YOU REALLY THINK 10
Figure C. Correlations between consideration of lures and verbal reflection test (vCRT) performance in
Study 1 (N = 99). Gray bands represent a standard error.
Test familiarity predicted test performance. In about 27% of think aloud recordings,
participants mentioned prior familiarity with at least one item on the vCRT—e.g., “I’ve seen
these questions on TikTok”. shows a large difference in vCRT performance between familiar and
naïve participants on both lured responses (d = -0.87) and correct responses (d = 1.13).
Figure D. Performance on the verbal reflection test (vCRT) among participants in the think aloud
condition of Study 1 (N = 47) depending on their unsolicited self-report of familiarity with the vCRT with
standard error bars.
TELL US WHAT YOU REALLY THINK 11
Discussion
These data suggest significant alignment between the dual process theorists’ two-factor
explication of ‘reflection’ and the standard interpretation of vCRT performance. They also
suggest that the university participants were largely naïve to the vCRT even though self-reported
familiarity remained a strong predictor of the standard interpretation of vCRT performance.
One might wonder whether these results will replicate in a larger think aloud validation
of the vCRT. Fortunately, our initial results suggest that the think aloud protocol will not
significantly influence vCRT performance. So a larger replication is methodologically possible.
Unfortunately, large-scale, in-person think aloud protocols are prohibitively time-consuming,
tedious, or—during a pandemic—unethical. To overcome these challenges, we partnered with a
startup to develop a platform for large-scale, online, think aloud surveys.
Study 2
Study 2 aimed to replicate the findings of Study 1 in a new sample of participants and test
the feasibility of online think aloud survey methodology. To do this, we reproduced all of the
instructions and measures in the thinking-aloud condition of Study 1 in an online audio survey
platform, Phonic (Perrodin & Todd, 2021; Phonic Inc., 2020).
Method
Participants. English speaking monolingual participants were recruited from Prolific
(Palan & Schitter, 2018; Peer et al., 2017) for an expected $9.85/hour based on average
completion time of the think aloud participants in Study 1. To ensure data quality, Prolific alerted
candidate participants that compensation would depend on their consent and ability to provide
usable recordings of their thoughts throughout the survey. We aimed to double the pre-registered
TELL US WHAT YOU REALLY THINK 12
sample size of the think aloud condition in Study 1 (N = 47), recruiting 102 participants (mean
age = 30.38; 57 identified as female, 38 as male, and 7 did not select a gender; 85 identified as
White, 3 as Black, 3 as Hispanic or Latino, and 11 as other ethnicity).
Procedure and Materials
Phonic audio survey platform. We used an online audio survey platform Phonic to
record verbal protocols online; this was premier tier of service in exchange for beta testing their
new survey platform.
Study 1 materials. All procedures and materials from Study 1, thinking-aloud condition,
were included in Study 2. Participants practiced thinking aloud before the survey, thought aloud
while completing the 10-item vCRT—with reminders to verbalize all of their thoughts
throughout the test—and answered follow-up questions about whether they considered the lured
response to each vCRT question. The deliberateness, consciousness of each response was
determined by each participants’ think aloud recording for each question. Participants familiarity
with any portion of the vCRT was determined by all of each participants’ think aloud recordings.
Results
We tested the correlation between test familiarity and test performance, the correlation
between standard reflection test scoring and recent explications of ‘reflection’, as well as the rate
of correct-but-unreflective responses and lured-yet-reflective responses on the vCRT.
Correct-but-unreflective and lured-yet-reflective responses. To test the agreement
between the standard interpretation of reflection tests and more recent two-factor explications of
‘reflection’, the rates of correct-but-unreflective and lured-yet-reflective responses were
determined by Prolific participants’ think aloud recordings. shows a replication of the
TELL US WHAT YOU REALLY THINK 13
preponderant yet imperfect agreement between with the standard interpretation and the two-
factor explication.
Table 2. Standard and two-factor categorizations of reflection test responses based on think aloud
protocol analysis of responses to the verbal reflection test in Study 2. Example verbalization based on the
following reflection test question: If you were running a race, and you passed the person in 2nd place,
what place would you be in now?
Categorization
Example verbalization
Standard
Two-factor
Rate
Correct-and-reflective
“1st. Nowait2nd.”
Reflective
Reflective
68.5%
Correct-but-unreflective
“2nd
Reflective
Unreflective
31.5%
Lured-and-unreflective
“1st
Unreflective
Unreflective
75.8%
Lured-yet-reflective
“1st. WaitYeah: 1st.
Unreflective
Reflective
24.2%
Two factor interpretation predicts the standard interpretation. Another regression
analysis was employed to test how the standard interpretation of reflection tests aligns with dual
process theorists’ two-factor explication of ‘reflection’. Figure E shows a replication of their
correlation: increases in the number of participants’ responses that involved deliberate or
conscious thinking—as determined by think aloud recordings—correlated with significant
decreases in the number of lured responses and significant increases in the number of correct
responses.
TELL US WHAT YOU REALLY THINK 14
Figure E. Correlations between deliberate and/or conscious responses and the standard interpretation of
verbal reflection test (vCRT) performance in Study 2 (N = 102). Gray bands represent a standard error.
Consideration of lured responses. Figure F shows a replication of the strong feeling of
rightness of lured responses. Indeed, lure consideration was not only relatively high (mean =
5.37, range = 0-10, S.D. = 2.38), it remained the best predictor of both lured and correct
responses on the vCRT.
TELL US WHAT YOU REALLY THINK 15
Figure F. Correlations between consideration of lures and verbal reflection test (vCRT) performance
Study 2 (N = 102). Gray bands represent a standard error.
Test familiarity predicted test performance. In about 17% of think aloud recordings,
Prolific participants mentioned prior familiarity with at least one item on the vCRT—
significantly less familiarity than the 27% familiarity among our university participants, t = -2.9,
95% CI [0.09, 0.24], p = 0.005. shows a replication of the large difference in vCRT performance
between familiar and naïve participants for both lured responses (d = -0.91) and correct
responses (d = 0.90).
TELL US WHAT YOU REALLY THINK 16
Figure G. Performance on the verbal reflection test (vCRT) among participants in Study 2 (N = 102)
depending on their unsolicited self-report of familiarity with the vCRT. Error bars represent a standard
error.
General Discussion
After confirming that thinking aloud does not interfere with performance on the verbal
reflection test, observational analysis of think aloud verbal reports in two studies found
significant alignment between the standard interpretation of vCRT responses (Pennycook,
Cheyne, et al., 2015) and more recent two-factor explications of reflective reasoning (Byrd,
2019, 2022, 2022b; Shea & Frith, 2016).
These studies also found that most university participants and Prolific participants were
naïve to the vCRT. Nonetheless, unsolicited think aloud self-reports of familiarity with the vCRT
were a strong predictor of the standard interpretation of vCRT performance in both studies. This
evidence replicates and extends some of the promising features of the vCRT (Sirota et al., 2020).
Methodological Implications
The present studies also suggest that think aloud protocols can reveal valuable and
otherwise undetected nuance in cognitive reflection test performance. For instance, think aloud
TELL US WHAT YOU REALLY THINK 17
recordings revealed that the standard interpretation of reflection test responses mislabeled 19-
31% of responses as either reflective or unreflective.
This insight seems to increase the justificatory burden of employing the standard
interpretation of reflection tests or of not employing the think aloud protocols (Byrd, 2022a) that
more accurately detected the deliberate and conscious features of reflective thinking (Shea &
Frith, 2016). Even if researchers do not rethink their interpretation of or reliance on reflection
tests, they may nonetheless need to offer new reasons for the status quo.
Theoretical Implications
Another result of the present studies was reliable support for the “feeling of rightness
explanation of reflection test performance (Pennycook, Fugelsang, et al., 2015; Thompson et al.,
2011). Most responses involved consideration of the lure and consideration of lures was the best
predictor of both lured and correct responses on the vCRT: considering a lure in one’s initial
response strongly correlated with accepting the lure as one’s final response If lures were not
significantly more appealing than other possible responses, then it would be difficult to explain
this preponderance of lure considerations, lured responding, and their strong correlation.
This may have implications for the debate between default interventionist accounts of
reflection and their alternatives (Howarth & Handley, 2016). Those who never considered the
lured response were mostly likely to arrive at the correct response. In other words, the so-called
reflective (i.e., correct) response on reflection tests may not usually involve intervening on a
default (lured) response after all. Of course, these data suggest that there are some cases of
reflective default intervention. So the current evidence may not falsify the default interventionist
account so much as show that it is not an exhaustive explanation of reflection.
TELL US WHAT YOU REALLY THINK 18
Limitations
The current studies were limited by limitations in the human resources to listening to and
code think aloud verbal reports. This resulted in minimal sample sizes for the research questions
addressed in this paper (Simmons et al., 2013). Although the expected effects were detected—
some in multiple populations, both in-person and online—there remains an opportunity for
researchers with more human resources to conduct larger-scale replication and extensions of the
existing work. Otherwise, we will have to wait for online think aloud survey platforms to
improve their speech transcription, sentiment analysis, and other features enough to automate
and therefore scale up think aloud protocol research.
Conclusion
The present studies partially replicate and clarify existing validations of the verbal
cognitive reflection test, thanks in part to novel online audio survey technology. Most
participants are naïve to the test and the standard interpretation of reflection testing largely aligns
with more advanced explications of reflective reasoning. Taken together with existing work
showing that verbal reflection tests can have high internal consistency, high test-retest reliability,
and less association with mathematical ability or gender, the present evidence suggests that the
vCRT could be a promising supplement or replacement for widely used reflection tests.
Nonetheless, there may still be opportunities to improve our understanding of reflection by
redeploying online think aloud protocols for larger-scale research. Thus, both verbal reflection
tests and online think aloud protocols are promising tools for advancing our understanding of
reflective reasoning and its alternatives.
TELL US WHAT YOU REALLY THINK 19
References
Attali, Y., & Bar-Hillel, M. (2020). The false allure of fast lures. Judgment and Decision Making,
15(1), 93–111. http://journal.sjdm.org/19/191217/jdm191217.html
Bago, B., & Neys, W. D. (2019). The Smart System 1: Evidence for the intuitive nature of
correct responding on the bat-and-ball problem. Thinking & Reasoning, 25(3), 257–299.
https://doi.org/10.1080/13546783.2018.1507949
Bialek, M., & Pennycook, G. (2018). The cognitive reflection test is robust to multiple
exposures. Behavior Research Methods, 50(5), 1953–1959.
https://doi.org/10.3758/s13428-017-0963-x
Byrd, N. (2019). What we can (and can’t) infer about implicit bias from debiasing experiments.
Synthese, 198(2), 1427–1455. https://doi.org/10.1007/s11229-019-02128-6
Byrd, N. (2022). A Two-Factor Explication Of ‘Reflection’: Unifying, Making Sense Of, And
Guiding The Philosophy And Science Of Reflective Reasoning.
Byrd, N. (2022a). All Measures Are Not Created Equal: Reflection test, think aloud, and process
dissociation protocols. https://researchgate.net/publication/344207716
Byrd, N. (2022b). Bounded Reflectivism & Epistemic Identity. Metaphilosophy, 53, 53–69.
https://doi.org/10.1111/meta.12534
Byrd, N. (2022c). Great Minds do not Think Alike: Philosophers’ Views Predicted by Reflection,
Education, Personality, and Other Demographic Differences. Review of Philosophy and
Psychology. https://doi.org/10.1007/s13164-022-00628-y
Erceg, N., Galic, Z., & Ružojčić, M. (2020). A reflection on cognitive reflection testing
convergent validity of two versions of the Cognitive Reflection Test. Judgment &
Decision Making, 15(5), 741–755. https://doi.org/10.31234/osf.io/ewrtq
TELL US WHAT YOU REALLY THINK 20
Ericsson, A. (2003). Valid and Non-Reactive Verbalization of Thoughts During Performance of
Tasks Towards a Solution to the Central Problems of Introspection as a Source of
Scientific Data. Journal of Consciousness Studies, 10(9–10), 1–18.
Ericsson, K. A., & Simon, H. A. (1993). Protocol Analysis: Verbal Reports as Data (revised
edition). Bradford Books/MIT Press.
Evans, J. St. B. T. (2007). On the resolution of conflict in dual process theories of reasoning.
Thinking & Reasoning, 13(4), 321–339. https://doi.org/10.1080/13546780601008825
Evans, J., & Stanovich, K. E. (2013). Dual-Process Theories of Higher Cognition Advancing the
Debate. Perspectives on Psychological Science, 8(3), 223–241.
https://doi.org/10.1177/1745691612460685
Fox, M. C., Ericsson, K. A., & Best, R. (2011). Do procedures for verbal reporting of thinking
have to be reactive? A meta-analysis and recommendations for best reporting methods.
Psychological Bulletin, 137(2), 316–344. https://doi.org/10.1037/a0021663
Frederick, S. (2005). Cognitive Reflection and Decision Making. Journal of Economic
Perspectives, 19(4), 25–42. https://doi.org/10.1257/089533005775196732
Ghebreyesus, T. A. (2020). WHO Director-General’s opening remarks at the media briefing on
COVID-19 on March 11. who.int/dg/speeches/detail/who-director-general-s-opening-
remarks-at-the-media-briefing-on-covid-19---11-march-2020
Howarth, S., & Handley, S. (2016). Belief bias, base rates and moral judgment: Re-evaluating
the default interventionist dual process account. In N. Galbraith, E. Lucas, & D. Over
(Eds.), The Thinking Mind (pp. 97–111). Taylor & Francis.
https://doi.org/10.4324/9781315676074-14
TELL US WHAT YOU REALLY THINK 21
Kahneman, D., & Frederick, S. (2002). Representativeness revisited: Attribute substitution in
intuitive judgment. In T. Gilovich, D. Griffin, & D. Kahneman (Eds.), Heuristics and
biases: The psychology of intuitive judgment (pp. 49–81). Cambridge University Press.
Markovits, H., de Chantal, P.-L., Brisson, J., Dubé, É., Thompson, V., & Newman, I. (2020).
Reasoning strategies predict use of very fast logical reasoning. Memory & Cognition.
https://doi.org/10.3758/s13421-020-01108-3
Office for Human Subjects Protection. (2020, March 23). Temporary cessation to some FSU
human subjects research. Florida State University News.
https://news.fsu.edu/announcements/covid-19/2020/03/23/temporary-cessation-to-some-
fsu-human-subjects-research/
Palan, S., & Schitter, C. (2018). Prolific.ac—A subject pool for online experiments. Journal of
Behavioral and Experimental Finance, 17, 22–27.
https://doi.org/10.1016/j.jbef.2017.12.004
Peer, E., Brandimarte, L., Samat, S., & Acquisti, A. (2017). Beyond the Turk: Alternative
platforms for crowdsourcing behavioral research. Journal of Experimental Social
Psychology, 70, 153–163. https://doi.org/10.1016/j.jesp.2017.01.006
Pennycook, G., Cheyne, J. A., Koehler, D. J., & Fugelsang, J. A. (2015). Is the cognitive
reflection test a measure of both reflection and intuition? Behavior Research Methods, 1–
8. https://doi.org/10.3758/s13428-015-0576-1
Pennycook, G., Fugelsang, J. A., & Koehler, D. J. (2015). What makes us think? A three-stage
dual-process model of analytic engagement. Cognitive Psychology, 80, 34–72.
https://doi.org/10.1016/j.cogpsych.2015.05.001
TELL US WHAT YOU REALLY THINK 22
Perrodin, D. D., & Todd, R. W. (2021). Choices in asynchronously collecting qualitative data:
Moving from written responses to spoken responses for open-ended queries. DRAL4
2021, 11. https://sola.pr.kmutt.ac.th/dral2021/wp-content/uploads/2022/06/3.pdf
Petitmengin, C., Remillieux, A., Cahour, B., & Carter-Thomas, S. (2013). A gap in Nisbett and
Wilson’s findings? A first-person access to our cognitive processes. Consciousness and
Cognition, 22(2), 654–669. https://doi.org/10.1016/j.concog.2013.02.004
Phonic Inc. (2020). Surveys you can answer with your voice. phonic.ai
Purcell, Z. A., Wastell, C. A., & Sweller, N. (2020). Domain-specific experience and dual-
process thinking. Thinking & Reasoning, 0(0), 1–29.
https://doi.org/10.1080/13546783.2020.1793813
Shea, N., & Frith, C. D. (2016). Dual-process theories and consciousness: The case for ‘Type
Zero’ cognition. Neuroscience of Consciousness, 2016(1).
https://doi.org/10.1093/nc/niw005
Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2013). Life after P-Hacking. Meeting of the
Society for Personality and Social Psychology, 38.
https://papers.ssrn.com/abstract=2205186
Sirota, M., Kostovičová, L., Juanchich, M., Dewberry, C., & Marshall, A. C. (2020). Measuring
Cognitive Reflection without Maths: Developing and Validating the Verbal Cognitive
Reflection Test. Journal of Behavioral Decision Making.
https://doi.org/10.1002/bdm.2213
Stagnaro, M. N., Pennycook, G., & Rand, D. G. (2018). Performance on the Cognitive
Reflection Test is stable across time. Judgment and Decision Making, 13(3), 260–267.
https://ideas.repec.org/a/jdm/journl/v13y2018i3p260-267.html
TELL US WHAT YOU REALLY THINK 23
Stanovich, K. E. (2018). Miserliness in human cognition: The interaction of detection, override
and mindware. Thinking & Reasoning, 24(4), 423–444.
https://doi.org/10.1080/13546783.2018.1459314
Stieger, S., & Reips, U.-D. (2016). A limitation of the Cognitive Reflection Test: Familiarity.
PeerJ, 4, e2395. https://doi.org/10.7717/peerj.2395
Stromer-Galley, J. (2007). Measuring Deliberation’s Content: A Coding Scheme. Journal of
Public Deliberation, 3(1). proquest.com/docview/1418196052/
Szaszi, B., Szollosi, A., Palfi, B., & Aczel, B. (2017). The cognitive reflection test revisited:
Exploring the ways individuals solve the test. Thinking & Reasoning, 23(3), 207–234.
https://doi.org/10.1080/13546783.2017.1292954
Thompson, V. A., Evans, J., & Campbell, J. I. D. (2013). Matching bias on the selection task: It’s
fast and feels good. Thinking & Reasoning, 19(3–4), 431–452.
https://doi.org/10.1080/13546783.2013.820220
Thompson, V. A., & Johnson, S. C. (2014). Conflict, metacognition, and analytic thinking.
Thinking & Reasoning, 20(2), 215–244. https://doi.org/10.1080/13546783.2013.869763
Thompson, V. A., Prowse Turner, J. A., & Pennycook, G. (2011). Intuition, reason, and
metacognition. Cognitive Psychology, 63(3), 107–140.
https://doi.org/10.1016/j.cogpsych.2011.06.001
Wilson, T., & Nisbett, R. E. (1978). The Accuracy of Verbal Reports About the Effects of Stimuli
on Evaluations and Behavior. Social Psychology, 41(2), 118–131.
https://doi.org/10.2307/3033572
TELL US WHAT YOU REALLY THINK: APPENDIX
i
Figure A1. Practice effect tests in Study 1 (N = 99). Performance did not increase but decreased during
the survey. The magnitude of this decrease was not dramatically different between conditions (Z-score).
Error bands indicate standard error.
TELL US WHAT YOU REALLY THINK ii
Figure A2. Practice effect test in Study 2 (N = 102). Performance did not increase or decrease during the
survey. Error bands indicate standard error.
TELL US WHAT YOU REALLY THINK iii
Think Aloud Instructions
In this experiment we are interested in what you think about when you find answers to
some questions that I am going to ask you to answer. In order to do this we ask you to THINK
ALOUD as you work on the problem given.
What I mean by think aloud is that I want you to tell me EVERYTHING you are thinking
from the time you first see the question until you give an answer. I would like you to talk aloud
CONSTANTLY from the time you start the survey to the time that you finish.
Please don't plan out what you say or try to explain to me what you are saying. Just act as
if you are alone in the room speaking to yourself.
It is most important that you keep talking. If you are silent for any long period of time I
will ask you to talk.
Do you understand what we need you to do?
Good, now you can practice thinking aloud on this sample question.
[Ensure they read aloud (including the code) and continue to think aloud while
practicing]
Ok. Now you will begin the survey.
[Use phrase like ‘remember to say your thoughts aloud’ if they are silent for a few
seconds.]
[Thank them when they are done.]
TELL US WHAT YOU REALLY THINK: APPENDIX
iv
Modified Verbal Cognitive Reflection Test (Sirota et al., under review)
(1) Mary’s father has 5 daughters but no sons Nana, Nene, Nini, Nono. What is the fifth
daughter’s name probably?
correct answer: Mary, lured answer: Nunu
“Do you remember thinking at any point that 'Nunu' could be the answer?”
Yes No
(2) If you were running a race, and you passed the person in 2nd place, what place would you be
in now?
correct answer: 2nd, lured answer: 1st
“Do you remember thinking at any point that ‘1st could be the answer?”
Yes No
(3) It’s a stormy night and a plane takes off from JFK airport in New York. The storm worsens,
and the plane crashes-half lands in the United States, the other half lands in Canada. In which
country do you bury the survivors?
correct answer: we don’t bury survivors, lured answers: answers about burial location
TELL US WHAT YOU REALLY THINK v
“Do you remember thinking at any point that survivor burial was an option?”
Yes No
(4) A monkey, a squirrel, and a bird are racing to the top of a coconut tree. Who will get the
banana first, the monkey, the squirrel, or the bird?
correct answer: no banana on coconut tree, lured answer: any of the animals
“Do you remember thinking at any point that ‘bird’, ‘squirrel’, or ‘monkey’ could be the
answer?”
Yes No
(5) In a one-storey pink house, there was a pink person, a pink cat, a pink fish, a pink computer,
a pink chair, a pink table, a pink telephone, a pink shower everything was pink! What
colour were the stairs probably?
correct answer: a one-storey house probably doesn’t have stairs, lured answer: pink
“Do you remember thinking at any point that ‘pink’ could be the answer?”
Yes No
(6) How many of each animal did Moses put on the ark?
TELL US WHAT YOU REALLY THINK vi
correct answer: none; lured answer: two
“Do you remember thinking at any point that ‘two’ could be the answer?”
Yes No
(7) The wind blows west. An electric train runs east. In which cardinal direction does the smoke
from the locomotive blow?
correct answer: no smoke from an electric train, lured answer: west
“Do you remember thinking at any point the locomotive will produce smoke?”
Yes No
(8) If you have only one match and you walk into a dark room where there is an oil lamp, a
newspaper and wood – which thing would you light first?
correct answer: match, lured answer: oil lamp
“Do you remember thinking at any point that ‘oil lamp’, ‘newspaper’, or ‘wood’ could be
the answer?”
Yes No
TELL US WHAT YOU REALLY THINK vii
(9) Would it be ethical for a man to marry the sister of his widow?
correct answer: not possible, lured answer: yes, no
“Do you remember thinking at any point that it is possible for a man to marry the sister of
his widow?”
Yes No
(10) Which sentence is correct: a) “the yolk of the egg are white” or b) “the yolk of the egg is
white”?
correct answer: the yolk is yellow, lured answer: b
“Do you remember thinking at any point that ‘a’ or ‘b’ could be the answer?”
Yes No
Scoring and coding:
Conditions: Control condition = 0. Think Aloud condition = 1.
Sum correct and lured answers for reflective and unreflective parameters, respectively.
Produce transcripts from audio of think aloud condition. Create variables for (a) whether
each participant reconsidered each their initial responses, (b) whether each participant
verbalized any reason(s) for or against any response(s), and (c) whether a participant
mentioned being familiar with any verbal reflection test question.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
The Cognitive Reflection Test (CRT) allegedly measures the tendency to override the prepotent incorrect answers to some special problems, and to engage in further reflection. A growing literature suggests that the CRT is a powerful predictor of performance in a wide range of tasks. This research has mostly glossed over the fact that the CRT is composed of math problems. The purpose of this paper is to investigate whether numerical CRT items do indeed call upon more than is required by standard math problems, and whether the latter predict performance in other tasks as well as the CRT. In Study 1 we selected from a bank of standard math problems items that, like CRT items, have a fast lure, as well as others which do not. A 1-factor model was the best supported measurement model for the underlying abilities required by all three item types. Moreover, the quality of all these items – CRT and math problems alike – as predictors of performance on a set of choice and reasoning tasks did not depend on whether or not they had a fast lure, but rather only on their quality as math items. In other words, CRT items seem not to be a “special” category of math problems, although they are quite excellent ones. Study 2 replicated these results with a different population and a different set of math problems.
Conference Paper
Full-text available
The practice of physical distancing has become commonplace in our global society. In response, social science researchers have been forced to make use of alternative yet practical ways to collect qualitative data that would be otherwise inaccessible. Most studies of late have focused on virtually collecting asynchronous written responses to open-ended queries utilizing emails or instant messaging. However, these options have advantages and disadvantages for researchers and respondents. This study looked at the option of virtually gathering asynchronous spoken responses to open-ended queries instead of written responses for the same. At first, the respondents were asked to write replies to open-ended queries via an online document administration software. The average reply length per question per respondent was 18 words. Next, they were asked to independently record spoken replies to similar open-ended queries via an online voice recording service. The average reply length was 373 words. It was found that virtually asynchronously gathering spoken responses, as opposed to written responses, to open-ended queries led to the respondents offering more explanatory answers comprising personal opinions, beliefs, and experiences. The knowledge obtained from this study can help enrich qualitative data collection whereby researchers can gather less resource-intensive, higher-quality extended responses in a shorter amount of time from a greater number of respondents.
Article
Full-text available
Prior research found correlations between reflection test performance and philosophical tendencies among laypeople. In two large studies (total N = 1299)-one pre-registered-many of these correlations were replicated in a sample that included both laypeople and philosophers. For example, reflection test performance predicted preferring atheism over theism and instrumental harm over harm avoidance on the trolley problem. However, most reflection-philosophy correlations were undetected when controlling for other factors such as numeracy, preferences for open-minded thinking, personality, philosophical training, age, and gender. Nonetheless, some correlations between reflection and philosophical views survived this multivariate analysis and were only partially confounded with either education or self-reported reasoning preferences. Unreflective thinking still predicted believing in God whereas reflective thinking still predicted believing that (a) proper names like 'Santa' do not necessarily refer to entities that actually exist and (b) science does reveal the fundamental nature of the world. So some robust relationships between reflection and philosophical tendencies were detected even among philosophers, and yet there was clearly more to the link between reflection and philosophy. To this end, demographic and metaphilosophical hypotheses are considered.
Article
Full-text available
Reflectivists consider reflective reasoning crucial for good judgment and action. Anti-reflectivists deny that reflection delivers what reflectivists seek. Alas, the evidence is mixed. So, does reflection confer normative value or not? This paper argues for a middle way: reflection can confer normative value, but its ability to do this is bound by such factors as what we might call epistemic identity: an identity that involves particular beliefs—for example, religious and political identities. We may reflectively defend our identities’ beliefs rather than reflect open-mindedly to adopt whatever beliefs cohere with the best arguments and evidence. This bounded reflectivism is explicated with an algorithmic model of reflection synthesized from philosophy and science that yields testable predictions, psychometric implications, and realistic metaphilosophical suggestions—for example, overcoming motivated reflection may require embracing epistemic identity rather than veiling it (à la Rawls 1971). So bounded reflectivism should be preferred to views offering anything less.
Preprint
Full-text available
Although it is generally acknowledged that the Cognitive Reflection Test (CRT) captures intelligence and numerical ability, many agree that it cannot be completely reduced to these constructs. Rather, it is presumed that the CRT also assesses some kind of thinking disposition towards reflective and open-minded thinking. In this manuscript, we report the results of a study that tested this assumption by exploring convergent validity of both the numerical and verbal version of the CRT. Using structural equation modelling, we investigated whether intelligence and numerical ability can account for all the variance in the CRT and if not, what is the nature of the unaccounted variance. Our conclusions about the convergent validity differed for the two types of test. For the numerical CRT, we found that the correlation between the latent numerical CRT and numerical ability was so high that the constructs were practically indistinguishable. As for the verbal CRT, the correlations between the latent verbal CRT and intelligence and numerical ability constructs were substantially lower, meaning that these two constructs do not account for all the variance in the test. However, the latent verbal CRT failed to correlate with belief bias and actively open-minded thinking, two closely related constructs, once the variance of intelligence and numerical ability was partialled out. We concluded that, despite its name, the CRT does not seem to assess the construct of cognitive reflection and its correlation with other variables found in the literature might mostly be driven by its overlap with intelligence and numerical ability.
Article
Full-text available
The cognitive reflection test (CRT) has increasingly dominated theorizing about individual differences in intuitive/reflective thinking propensities, and it is associated with many real-world beliefs and judgments, such as religiosity, paranormal beliefs, and moral judgments. The CRT triggers common incorrect responses that come to mind easily, and it is frequently assumed that recognizing this error is tantamount to solving the problems. As a result, incorrect answers on the CRT purportedly indicate an intuitive thought process, whereas correct answers purportedly indicate a reflective thought process. It has also been argued that the CRT problems are fundamentally different from insight problems because insight problems often cause people to sit lost in thought, unable to identify a solution until they correctly reframe it. The present research tested these assumptions and found that a substantial proportion of people have difficulty solving the CRT problems even when the "intuitive" response is unavailable to them, the correct answer is among four multiple-choice options, and they take time to reflect. Associations between the CRT and beliefs (religiosity, paranormal beliefs, moral judgments, etc.) remained even under conditions in which CRT errors appeared to result from more reflective thought than correct responses. Furthermore, multidimensional item response theory models indicated that the CRT loaded onto numeracy and insight problem solving ability factors rather than its own unique factor. Regression analyses also indicated that numeracy and insight may account for many associations between the CRT and real-world beliefs. Broader implications for dual-process theories of reasoning and judgment are discussed.
Article
Full-text available
Influential work on reasoning and decision-making has popularised the idea that sound reasoning requires correction of fast, intuitive thought processes by slower and more demanding deliberation. We present seven studies that question this corrective view of human thinking. We focused on the very problem that has been widely featured as the paradigmatic illustration of the corrective view, the well-known bat-and-ball problem. A two-response paradigm in which people were required to give an initial response under time pressure and cognitive load allowed us to identify the presumed intuitive response that preceded the final response given after deliberation. Across our studies, we observe that correct final responses are often non-corrective in nature. Many reasoners who manage to answer the bat-and-ball problem correctly after deliberation already solved it correctly when they reasoned under conditions that minimised deliberation in the initial response phase. This suggests that sound bat-and-ball reasoners do not necessarily need to deliberate to correct their intuitions; their intuitions are often already correct. Pace the corrective view, findings suggest that in these cases, they deliberate to verify correct intuitive insights.
Article
The dual strategy model proposes that people use one of two potential ways of processing information when making inferences. The statistical strategy generates a rapid probabilistic estimate based on associative access to a wide array of information, while the counterexample strategy uses a more focused representation, allowing for a search for potential counterexamples. In the following studies, we explore the hypothesis that individual differences in strategy use are related to the ability to make rapid intuitive logical judgments. In Study 1, we show that this is the case for rapid judgments requiring a distinction between simple logical form and for a novel form of judgment, the ability to identify inferences that are not linked to their premises (non sequiturs). In Study 2, we show that strategy use is related to the ability to make the kinds of rapid logical judgments previously examined over and above contributions of working memory capacity. Study 3 shows that strategy use explains individual variability in rapid logical responding with belief-biased inferences over and above the contribution of IQ. The results of Studies 2 and 3 indicate that under severe time constraint cognitive capacity is a very poor predictor of reasoning, while strategy use becomes a stronger predictor. These results extend the notion that people can make rapid intuitive “logical” judgments while highlighting the importance of strategy use as a key individual difference variable.
Article
Prominent dual process models assert that reasoning processes can transition from effortful (Type 2) to intuitive (Type 1) with increases in domain-specific experience. In two studies we directly examine this automation hypothesis. We examine the nature of the relationship between mathematical experience and performance on the cognitive reflection test (CRT; Frederick, 2005 Frederick, S. (2005). Cognitive reflection and decision making. Journal of Economic Perspectives , 19 (4), 25–42. https://doi.org/10.1257/089533005775196732 [Crossref], [Web of Science ®] , [Google Scholar]). We test performance and response time at different levels of experience and cognitive constraint. Participants are required to complete a secondary task of varying complexity while solving the CRT. In Study 1, we demonstrate changes in thinking Type across real-world differences in mathematical experience. In Study 2, convergent with Study 1, we demonstrate changes in thinking Type across a mathematical training paradigm. Our findings suggest that for some individuals low experience is associated with Type 1 processing, intermediate experience is associated with Type 2 processing, and high experience is associated with Type 1 processing. Whereas, for other individuals low experience is associated with ineffective Type 2 processing, intermediate experience is associated with effective Type 2 processing, and high experience is associated with Type 1 processing.