ArticlePDF Available

A comprehensive assessment of current methods for measuring metacognition

Springer Nature
Nature Communications
Authors:

Abstract and Figures

One of the most important aspects of research on metacognition is the measurement of metacognitive ability. However, the properties of existing measures of metacognition have been mostly assumed rather than empirically established. Here I perform a comprehensive empirical assessment of 17 measures of metacognition. First, I develop a method of determining the validity and precision of a measure of metacognition and find that all 17 measures are valid and most show similar levels of precision. Second, I examine how measures of metacognition depend on task performance, response bias, and metacognitive bias, finding only weak dependences on response and metacognitive bias but many strong dependencies on task performance. Third, I find that all measures have very high split-half reliabilities, but most have poor test-retest reliabilities. This comprehensive assessment paints a complex picture: no measure of metacognition is perfect and different measures may be preferable in different experimental contexts.
Dependence of estimated metacognitive scores on task performance a Estimated metacognitive ability for all 17 measures, as well as d’, criterion, and confidence for different difficulty levels in the Shekhar (n = 20), Rouault1 (n = 466), and Rouault2 (n = 484) datasets. Traditional measures of metacognition (top row) all showed a strong positive relationship with task performance, whereas all difference measures (third row) show a strong negative relationship. Ratio measures (second row) and the two model-based measures (meta-noise and meta-uncertainty) performed much better but still showed weak relationships with task performance. Error bars showing SEM are displayed on both the x and y axes. Statistical results are based on uncorrected two-sided t-tests comparing the highest to lowest difficulty level within each dataset for each measure (see Supplementary Tables 3–5 for complete results). ***, p < 0.001; **, p < 0.01; *, p < 0.05; ns, not significant. b Effect sizes for dependence on task performance. Effect size (Cohen’s d) is plotted for each metric and dataset. As can be seen in the figure, non-normalized traditional measures (i.e., meta-d’, AUC2, Gamma, Phi, and ΔConf) show strong positive relationship with task performance. Corrections with the ratio and difference methods reverse this relationship, with the ratio correction being clearly superior. The model-based metrics meta-noise and meta-uncertainty perform well too, with meta-uncertainty showing particularly low effect sizes.
… 
This content is subject to copyright. Terms and conditions apply.
Article https://doi.org/10.1038/s41467-025-56117-0
A comprehensive assessment of current
methods for measuring metacognition
Dobromir Rahnev
1,2
One of the most important aspects of research on metacognition is the mea-
surement of metacognitive ability. However, the properties of existing mea-
sures of metacognition have been mostly assumed rather than empirically
established. Here I perform a comprehensive empirical assessment of
17 measures of metacognition. First, I develop a method of determining the
validity and precision of a measure of metacognition and nd that all 17 mea-
sures are valid and most show similar levels of precision. Second, I examine
how measures of metacognition depend on task performance, response bias,
and metacognitive bias, nding only weak dependences on response and
metacognitive bias but many strong dependencies on task performance.
Third, I nd that all measures have very high split-half reliabilities, but most
have poor test-retest reliabilities. This comprehensive assessment paints a
complex picture: no measure of metacognition is perfect and different mea-
sures may be preferable in different experimental contexts.
Metacognition is classically deneda s knowing about knowing1.Within
this broad construct, the term metacognitive abilityrefers more
narrowly to the capacity to evaluate ones decisions by distinguishing
between correct and incorrect answers2,3. High metacognitive ability
allows us to have high condence when we are correct but low con-
dence when we are wrong. Conversely, low metacognitive ability
impairs the capacity of condence ratings to distinguish between
instances when weare correct or wrong. Metacognitive ability is thus a
critical capacity in human beings linked to our ability to learn4, make
good decisions5, interact with others6, and know ourselves7.Assuch,it
is critical that we have the tools to precisely measure metacognitive
ability in human participants.
Metacognitive ability is typically assumed to be a somewhat
stable trait with meaningful variability across people2,8,9. Conse-
quently, metacognitive ability has been correlated with other stable
individual differences, such as brain structure1013. While metacogni-
tive ability is often assumed to be domain-general and rely on shared
neural substrates, this question remains hotly debated1417. The con-
struct of metacognitive ability is also thought to be different from
other constructs such as task skill or bias, so it is often desirable to
nd metrics of metacognitive ability unrelated to these other
constructs18.
Below, I rst examine the properties that one may desire in a
measure of metacognition and then review the known properties of
existing measures of metacognitive ability. This brief overview
demonstrates that there is little we rmly know about the properties of
existing measures of metacognition. The rest of the paper aims to ll
this gap by providing a comprehensive test of the critical properties of
many common measures of metacognition.
Before one can evaluate a given measure of metacognition, it is rst
necessary to determine what properties are important or desirable.
Since there is no existing list of desirable properties, I start by creating
one here (Supplementary Table 1) and discuss each property below.
The most important property of any measure is that it is valid:
namely, it should measure what it purports to measure19. Existing
measures of metacognitive ability assess the degree to which con-
dence is associated with objective reality, thus making them face
valid. Still, we lack a formal way of verifying the validity of existing
measures. A related property is precision. I use the term precision
following its denitions in the literature as the ability to repeatedly
measure a variable with a constant true score and obtain similar
results20,the margin of errorin a measurement21,orthe spread of
values that would be expected across multiple measurement
attempts22. Note that precision here does not refer to whether a
Received: 17 July 2023
Accepted: 9 January 2025
Check for updates
1
School of Psychology, Georgia Institute of Technology, Atlanta, GA, USA.
2
Computational Cognition Center of Excellence, Georgia Institute of Technology,
Atlanta, GA, USA. e-mail: rahnev@psych.gatech.edu
Nature Communications | (2025) 16:701 1
1234567890():,;
1234567890():,;
Content courtesy of Springer Nature, terms of use apply. Rights reserved
measure is only affected by the construct of interest. Precision has
been largely ignored in the context of measures of metacognition and
we currently lack methods to measure it. Here I develop a simple and
intuitive method for assessing both validity and precision of meta-
cognition measures. The method demonstrates that all existing mea-
sures of metacognition are valid but show somevariations in precision.
Another critical property of measures of metacognitionone that
is perhaps the most widely appreciatedis that such measures should
be independent of various nuisance variables. Here a nuisance vari-
ableis any property of peoples behavior that is not directlyrelated to
their metacognitive ability.
The nuisance variable that has received the most attention is task
performance. It is often desirable that a measure of metacognition
should not be affected by whether people happened to be performing
an easy or a difcult task3,18. For example, in visual perception tasks
with condence, there is little reason to believe that the underlying
metacognitive ability should be affected by stimulus contrast. Thus,
one may want to measure the same metacognitive ability regardless of
contrast level. Note that there are subtleties here. If difculty is
manipulated by introducing cognitive load or other task demands that
may tax the metacognitive system, then one would not necessarily
expect metacognitive ability to remain the same anymore (though
whether metacognitive ability is affected by working memory load
remains a topic of debate2325). Therefore, the logic here applies more
readily to stimulus than task manipulations. That said,even if one does
not agree that metacognitive ability should be independent from task
performance, examining how each measure depends on task perfor-
mance isstill informative, especially if thereare meaningfuldifferences
between measures. Task performance can be computed as d,whichis
a measure of sensitivity derived from signal detection theory (SDT).
Asecondnuisancevariableisresponsebias,thatis,thetendencyto
select one response category more than another18. For two-choice tasks,
this variable can be quantied as the decision criterion, c, derived from
SDT. Response bias is under strategic control in that participants can
freely choose to select one stimulus category more often than others. In
fact,theyconsistentlydosoinresponsetoexperimentalmanipulations
such as expectation or reward26. As such, measures of metacognitive
ability should ideally remain independent of response bias.
The nal nuisance variable is metacognitive bias, that is, the ten-
dency of people to be biased towards the lower or upper ranges of the
condence scale27,28. This variable can be quantied simply as the
average condence across all trials. As with response bias, metacog-
nitive bias is under strategic control in that participants can freely
choose to use lower or higher condence. As such, measures of
metacognitive ability should ideally remain independent of metacog-
nitive bias because we do not want to measure different ability if
people purposefully choose to use predominantly low or high con-
dence ratings3. The logic here is similar to the logic in SDT, where the
measure of performance (d) is designed to be mathematically inde-
pendent from the measure of response bias (c)29. In the case of SDT,we
interpret high dvalues as showing high ability to perform the task
even if the participant exhibits anextreme bias and, consequently, low
percent of correct responses. Similarly, this paper, following the
standard in the eld18, adopts the perspective that measures of meta-
cognitive ability should be independent of metacognitive bias.
Task performance, response bias, and metacognitive bias are
arguably the primary nuisance variables that a measure of metacog-
nitive ability should be independent of (Supplementary Table 2). They
are also variables that can be measured in any design that also allows
the measurement of metacognitive ability. It is possible to add more
variables to this list (e.g., reaction time30) but the current paper only
examines these three variables.
The nalcriticalpropertyofmeasures of metacognition is that they
should be reliable. This property is critical for studies of individual dif-
ferences. This paper examines both split-half and testretest reliability.
Having reviewed the desirable properties of measures of meta-
cognition, let us now turn our attention to the existing measures of
metacognitive ability. One popular measure is the area under the Type 2
ROC function31,alsoknownasAUC2. Other popular measures are the
GoodmanKruskall Gamma coefcient (or just Gamma), which is
essentially a rank correlation between trial-by-trial condence and
accuracy32 and the Pearson correlation between trial-by-trial condence
and accuracy (known as Phi33). Another simple but less frequently used
measure is the difference between average condence on correct trials
and the average condence on error trials (which I call ΔConf).
While all four of these traditional measures are intuitively
appealing, they are all thought to be inuenced by the primary task
performance18. To address this issue, Maniscalco and Lau34 developed
a new approach to measuring metacognitive ability where one can
estimate the sensitivity, meta-d, exhibited by the condence ratings.
Because meta-dis expressed in the units of d, Maniscalco and Lau then
reasoned that meta-dcan be normalized by the observed dto obtain
either a ratio measure (M-Ratio,equaltometa-d/d) or a difference
measure (M-Diff,equaltometa-dd). These measures are often
assumed to be independent of task performance18.
The normalization introduced by Maniscalco and Lau34 has only
been applied to the measure meta-d(resulting in the measures M-Ratio
and M-Diff), but there is no theoretical reason why a conceptually similar
correction cannot be applied to the traditional measures above. Con-
sequently, here I develop eight new measures where one of the tradi-
tional measures of metacognitive ability is turned into either a ratio
(AUC2-Ratio,Gamma-Ratio,Phi-Ratio,andΔConf-Ratio) or a difference
(AUC2-Diff,Gamma-Diff,Phi-Diff,andΔConf-Diff)measure.Thelogicis
that a given measure (e.g., AUC2)iscomputedonceusing the observed
data (obtaining, e.g., AUC2
observed
)andasecondtimeusingthepredic-
tions of SDT given the observed sensitivity and decision criterion
(obtaining, e.g., AUC2
expected
). One can then take either the ratio or the
difference between the observed and the SDT-predicted quantities.
Finally, one important limitation of all measures above is that they
are not derived from a process model of metacognition. In other words,
none of these measures are based on an explicit model of how con-
dence judgments may be corrupted. Recently, Shekhar and Rahnev27
developed a process model of metacognitionthe lognormal meta noise
modelthat is based on SDT assumptions but with the addition of log-
normally distributed metacognitive noise. This metacognitive noise
corrupts the condence ratings but not the initial decision and, in the
model, takes the form of condence criteria that are sampled from a
lognormal distribution rather than having constant values. The meta-
cognitive noise parameter (σmeta, referred here as meta-noise) can then
be used as a measure of metacognitive ability. A similar approach was
taken by Boundy-Singer et al.35 who developed another process model of
metacognition, CASANDRE, based on the notion that people are
uncertain about the uncertainty in their internal representations. The
second-order uncertainty parameter (meta-uncertainty) thus represents
another possible measure of metacognitive ability.
This paper examines the properties of all 17 measures of meta-
cognition introduced above (for a summary, see Table 1). Before then,
however, I briey review the previous literature on the properties of
these measures.
Given the importance of using measures with good psychometric
properties, it is perhaps surprising that the published literature con-
tains very little empirical investigation into the properties of the dif-
ferent measures of metacognition. For example, no paper to date
has examined the precision of any existing measure. Several papers
have relied exclusively on simulations to investigate some of the
properties of measures of metacognition36,37.Suchinvestigationsare
important but cannot substitute empirical studies because it is a priori
unknown how well the process models used to simulate data reect
empirical reality. Evans and Azzopardi38 empirically showed that a
specic measure of metacognition, Kunimotosa39, exhibits a strong
Article https://doi.org/10.1038/s41467-025-56117-0
Nature Communications | (2025) 16:701 2
Content courtesy of Springer Nature, terms of use apply. Rights reserved
dependence on response bias. Because Kunimotosais built on wrong
distributional assumptions40, it is not investigated here. Finally, several
older papers investigated the theoretical properties of several mea-
sures independent of any simulations or empirical data32,butthis
approach cannot be used to establish the empirical properties of the
measures under consideration.
Only recently, Shekhar and Rahnev27 examined the dependence
on both task performance and metacognitive bias for ve measures:
meta-d,M-Ratio,AUC2,Phi,andmeta-noise.Theyfoundthatmeta-d,
AUC2,andPhi strongly depend on task performance, but M-Ratio and
meta-noise do not. On the other hand, meta-d,M-Ratio,AUC2,andPhi
have a complex dependence on metacognitive bias, while only meta-
noise appeared independent of it. Guggenmos41 examined both the
split-half reliability and the across-participant correlation between d
and several measures of metacognition (meta-d,M-Ratio,M-Diff,and
AUC2)nding surprisingly low reliability and signicant correlations
with dfor all measures. Relatedly, Kopcanova et al.14 examined the
test-retest reliability of M-Ratio and also found low-reliability values.
Another paper developed a new technique to examine dependence on
metacognitive bias and found that meta-dand M-Ratio are not inde-
pendent of metacognitive bias28. Finally, Boundy-Singer et al.35 showed
that meta-uncertainty appears to have high testretest reliability, and
only a weak dependence on task performance and metacognitive bias.
As this brief overview demonstrates, most previous investigations
only focused on a few measures of metacognition, only examined a few
of the critical properties of interest, and often did not make use of
empirical data. Here, I empirically examine each of the critical prop-
erties for all 17 measures of metacognition introduced above. To do so,
I make use of six large datasets27,4246 (Table 2) all made availableon the
Condence Database47. All datasets involve 2-choice tasks because
most measures of metacognition only apply to 2-choice tasks.
Overall, I nd that no current measure of metacognitive ability is
perfectin the sense of possessing all desirable properties. Never-
theless, they arenot equivalent either with many important differences
between measures emerging. Based on these results, I make recom-
mendations for the use of different measures of metacognition based
on the specic analysis goals.
Results
Here I assess the properties of 17 measures of metacognition. Speci-
cally, I focus on each measures (1) validity and precision, (2) dependence
on nuisance variables, and (3) reliability. To examine each of these
properties, I use six existing datasets (Table 2)fromtheCondence
Database. For each property, I analyze the data from between one and
three of the six datasets. In addition, I compute precision and reliabilities
using 50, 100, 200, or 400 trials at a time to clarify how these measures
behave for different amounts of underlying data.
Validity and precision
Perhaps the most important requirement for any measure is that it is
both valid and precise1922,48. In other words, a measure should reect
the quantity it purports to measure, and it should do so with a high
level of quantitative accuracy. However, despite the importance of
both criteria, there has been no formal method to assess the validity or
precision of measures of metacognition.
Here I develop a simple method for assessing both properties. The
methodselectsasmallproportionoftrialsanddecreasescondence by
1 point for each correct trial and increases condence by 1 point for each
incorrect trial. This manipulation articially decreases the informative-
ness of condence ratings. A valid measure of metacognition should
thereforeshowadropwhenappliedtothesealtereddata.Thesizeofthe
drop relative to the normal uctuations of the measure quanties the
precision of the measure (i.e., if the drop is large relative to background
uctuations, this indicates that the measure has a high level of precision).
To quantify the precision of existing measures of metacognition,
one would ideally use a dataset with very largenumber of trials coming
from a single experimental condition because mixing conditions can
strongly impact metacognitive scores49. Consequently, I selected the
two datasets from the Condence Database with the largest number of
trials per participant that also had a single experimental condition:
Haddara (3000 trials per participant) and Maniscalco (1000 trials per
participant). In each case, I examined the results of altering 2, 4, and 6%
of all trials and computed metacognitive scores using bins of 50, 100,
200, and 400 trials.
The results showed that all 17 measures are valid in that meta-
cognitive scores decreased when condence ratings were articially
corrupted (Fig. 1). The decrease in each measure was roughly a linear
function of the percent of trials corrupted. For example, in the Had-
dara dataset, the values of meta-ddecreased from an average of 1.14
without any corruption to averages of 0.98, 0.84, and 0.72 when 2%,
4%, and 6% of trials were corrupted, respectively (for an average drop
of about 0.14 for every 2% of trials corrupted). However, this drop is
Table 1 | Measures of metacognition examined in the current paper
Measure Calculation Based on a process model
meta-d'dvalue that provides best t to Type 2 ROC No
AUC2 Area under the Type 2 ROC curve No
Gamma Rank correlation between condence and accuracy No
Phi Pearson correlation between condence and accuracy No
ΔConf Difference between average condence for correct and error trials No
M-Ratio meta-ddivided by d' No
AUC2-Ratio AUC2 divided by expected AUC2 under SDT assumptions No
Gamma-Ratio Gamma divided by expected Gamma under SDT assumptions No
Phi-Ratio Phi divided by expected Phi under SDT assumptions No
ΔConf-Ratio ΔConf divided by expected ΔConf under SDT assumptions No
M-Diff meta-dminus d' No
AUC2-Diff AUC2 minus expected AUC2 under SDT assumptions No
Gamma-Diff Gamma minus expected Gamma under SDT assumptions No
Phi-Diff Phi minus expected Phi under SDT assumptions No
ΔConf-Diff ΔConf minus expected ΔConf under SDT assumptions No
meta-noise Metacognitive noise computed using the lognormal meta noise model Yes
meta-uncertainty Metacognitive uncertainty computed using the CASANDRE model Yes
Article https://doi.org/10.1038/s41467-025-56117-0
Nature Communications | (2025) 16:701 3
Content courtesy of Springer Nature, terms of use apply. Rights reserved
difcult tocompare between measures because different measures are
on different scales (e.g., meta-dnormally takes values between 0 and
1,whereasAUC2 normally takes values between 0.5 and 1). Therefore,
to obtain values that are easy to interpret and compare, one can nor-
malize the average drop after corruption by the standard deviation
(SD) of the observed values across different subsets of trials in the
absence of any corruption. Because the SD value is larger for smaller
bin sizesreecting the larger noisiness of each measure when few
trials are usedthe results show that larger bin sizes lead to greater
precision of the measures (Fig. 1a). Indeed, across the 17 measures,
corrupting 2% of the trials led to an average decrease of 0.35, 0.50,
0.70, and 1.04 SDs inthe measured metacognitive ability value for bins
of 50, 100, 200, and 400 trials, respectively.
This technique allows us to compare the precision of different
measures. To simplify the comparison, I averaged the decreases across
the four different bin sizes and the three levels of corruption (2, 4, and
6%; Fig. 1b,c). These analyses revealed that the precision scores were
overall higher in the Haddara compared to the Maniscalco datasets.
This difference is likely due to differences in variables such as sensi-
tivity and metacognitive bias that are likely to vary across datasets.
Therefore, the technique introduced here is useful for comparing
between different measures but is unlikely to be useful ifone wants to
compare values across different datasets.
More importantly, most measures of metacognition showed
comparable levels of precision (Fig. 1b,c). The one exception was the
measure meta-uncertainty, which had substantially lower average
precision score in both the Haddara (meta-uncertainty: 0.37; average
of other measures: 0.67; ratio = 0.56) and the Maniscalco datasets
(meta-uncertainty: 0.30; average of other measures: 0.53; ratio =
0.58). Indeed, pairwise comparisons showed that, without multiple
comparison correction, the precision for meta-uncertainty was sig-
nicantly lower than every one of the other 16 measures in both
datasets (p< 0.05 for all 32 comparisons). In the Haddara dataset, 15
of the 16 comparisons remained signicant even after applying a
very conservative Bonferroni correction for the existence of
17*16
2= 136 pairwise comparisons; in the smaller Maniscalco dataset,
no comparison remained signicant after this conservative correc-
tion. This difference between meta-uncertainty and the remaining
measures may stem from the noisiness of the process of estimating
meta-uncertainty in the presence of relatively few trials. In fact, the
original authors who introduced meta-uncertainty already warned
about the dangers of trying to compute this variable using low trial
numbers35.
The differences between the remaining measures were much
smaller and, in some cases, inconsistent across the two datasets. The
differences between all other measures of pairs were never signicant
(at p< 0.05 uncorrected) in both the Haddara and Maniscalcodatasets.
Nevertheless, there appear to be some small but consistent difference
between measures, such that meta-d,Gamma,Phi,Gamma-Diff,Phi-
Diff,andmeta-noise show above-average precision, whereas AUC2,
ΔConf,andΔConf-Diff show below-average precision (Fig. 1d). Overall,
these analyses suggest that all measures of metacognition investigated
here are valid, and that most have comparable level of preci sion except
for meta-uncertainty, which appears to be noisier than the remaining
measures. Whether the differences between the remaining measures
are meaningful remains to be demonstrated.
Dependence on nuisance variables
Beyond validity and precision, another important feature for good
measures of metacognition is that they should not be inuenced by
nuisance variables. Here I examine three nuisance variablestask per-
formance, metacognitive bias, and response biasand test how much
each of these variables affects each of the 17 measures of metacognition.
Dependence on task performance
The most widely recognized nuisance variable for measures of
metacognition is task performance18. The reason that task perfor-
mance is a nuisance variable is that an ideal measure of metacogni-
tion should not be affected by whether a participant happens to be
given an easier or a more difcult task. That is, the participants
estimated ability to provide informative condence ratings should
not change based on the difculty of the object-level task that they
are asked to perform. As mentioned earlier, this logic does not apply
well to task manipulations, which is why I only examine stimulus
manipulations here.
To quantify how task performance affects measures of metacog-
nition, one needs datasets with multiple difculty conditions and a
large number of trials (either because of including many participants
or many trials per participant). I selected three datasets from the
Condence Database that meet these characteristics: Shekhar (3 dif-
culty levels, 20 participants, 2800 trials/sub, 56,000 total trials),
Rouault1 (70 difculty levels, 466 participants, 210 trials/sub, 97,860
Table 2 | Datasets used in the current paper
Dataset Haddara Locke Maniscalco Rouault1 Rouault2 Shekhar
# participants analyzed 70 10 22 466 484 20
# excluded participants 5 0 8 32 13 0
% excluded participants 7% 0% 27% 6% 3% 0%
# trials/participant 3000 4900 1000 210 210 2800
# total trials in experiment 210,000 49,000 22,000 97,860 101,640 56,000
#difculty levels 1 1 1 70 staircase 3
Criterion manipulated ——
Original condence scale 4-point 2-point 4-point 11-point 6-point Continuous
Analyses of each dataset
Precision ——
Dependence on task performance —— ✓✓
Dependence on metacognitive bias ——
Dependence on response bias ——
Split-half reliability ——
Testretest reliability ——
Inter-measure correlations ——
The table lists details of each dataset and indicates which analyses in the present paper each dataset was used for.
Article https://doi.org/10.1038/s41467-025-56117-0
Nature Communications | (2025) 16:701 4
Content courtesy of Springer Nature, terms of use apply. Rights reserved
total trials), and Rouault2 (many difculty levels, 484 participants, 210
trials/sub, 101,640 total trials). Both Rouault datasets have a large
range of difculty levels which I split into low/high by taking a median
split. I then computed each measure separately for each difculty level
andcomparedthemusingt-tests.
The results showed that all traditional measures that are not
normalized in any way (i.e., meta-d,AUC2,Gamma,Phi, and
ΔConf) are strongly dependent on task performance: they all sub-
stantially increase as the task becomes easier (p< 0.001 for all ve
measures and three datasets; Fig. 2a; Supplementary Tables 35; see
2% 4% 6%
% trials altered
0
1
Decrease in
SD units
meta-d'
2% 4% 6%
% trials altered
0
1
AUC2
2% 4% 6%
% trials altered
0
1
Gamma
2% 4% 6%
% trials altered
0
1
Phi
2% 4% 6%
% trials altered
0
1
Conf
2% 4% 6%
% trials altered
0
1
Decrease in
SD units
M-Ratio
2% 4% 6%
% trials altered
0
1
AUC2-Ratio
2% 4% 6%
% trials altered
0
1
Gamma-Ratio
2% 4% 6%
% trials altered
0
1
Phi-Ratio
2% 4% 6%
% trials altered
0
1
Conf-Ratio
2% 4% 6%
% trials altered
0
1
Decrease in
SD units
M-Diff
2% 4% 6%
% trials altered
0
1
AUC2-Diff
2% 4% 6%
% trials altered
0
1
Gamma-Diff
2% 4% 6%
% trials altered
0
1
Phi-Diff
2% 4% 6%
% trials altered
0
1
Conf-Diff
2% 4% 6%
% trials altered
0
1
Decrease in
SD units
meta-noise
2% 4% 6%
% trials altered
0
1
meta-uncertainty
bin=50
bin=100
bin=200
bin=400
meta-d'
AUC2
Gamma
Phi
Conf
M-Ratio
AUC2-Ratio
Gamma-Ratio
Phi-Ratio
Conf-Ratio
M-Diff
AUC2-Diff
Gamma-Diff
Phi-Diff
Conf-Diff
meta-noise
meta-uncertainty
Measure
0.6
0.7
0.8
0.9
1
1.1
1.2
Normalized precision
Average normalized precision Haddara dataset
Maniscalco dataset
Validity and precision of each measure
a
b
Fig. 1 | Validity and precision of each measure. Results of an articial corruption
of the condence ratings where condence for correct trials was decreased by 1,
and condence for incorrect trials was increased by 1. aDetailed results for the
Haddaradataset (fordetailed resultson the Maniscalcodataset, seeSupplementary
Fig. 1). Each one of the 17 measures of metacognition showed a decrease with this
manipulation. The plot shows the decrease in units of the standard deviation (SD)
of the measuresuctuations across different bins. The decrease was computed for
bin sizes of 50, 100, 200, and 400 trials, as well as for 2, 4, and 6% of trials being
corrupted. bNormalized precision for all 17 measures in each of the two datasets
(Haddara and Maniscalco). The precision values are normalized such that the
average precision level of the rst 16 measures equals 1 in each of the two datasets.
As can be seen, meta-uncertainty has substantially lower level of precision than the
rest of the measures. The differences between the remaining measures are not
always trivial but tend to be smaller.
Article https://doi.org/10.1038/s41467-025-56117-0
Nature Communications | (2025) 16:701 5
Content courtesy of Springer Nature, terms of use apply. Rights reserved
Supplementary Fig. 2 for the same plots as a function of difculty
level instead of dlevel). Critically, the increase across the
ve measures from the most difcult to the easiest had a very
large effect size (Cohensd= 2.47, 2.29, 2.95, 1.34, and 1.81 for each
of the ve measures after averaging across the four data-
sets; Fig. 2b).
Having established that these ve measures strongly depend on
task performance, I then examined whether normalizing them
removes this dependence. The more popular method of normal-
izationthe ratio methodindeed performed well. The average effect
size (Cohensd)forM-Ratio,AUC2-Ratio,Gamma-Ratio,Phi-Ratio,and
ΔConf-Ratio was 0.18, 0.39, 0.11, 0.17, and 0.23, respectively.
0123
d'
1
1.5
2
meta-d'
*** *** ***
0123
d'
0.6
0.7
0.8
AUC2
*** *** ***
0123
d'
0.4
0.6
Gamma
*** *** ***
0123
d'
0.2
0.3
Phi
*** *** ***
0123
d'
0
1
2
Conf
*** *** ***
0123
d'
0.8
0.9
1
M-Ratio
ns ns ***
0123
d'
0.95
1
AUC2-Ratio
** *** ***
0123
d'
0.8
0.9
1
Gamma-Ratio
ns ns **
0123
d'
0.8
0.9
1
Phi-Ratio
ns ** **
0123
d'
0.8
0.9
1
Conf-Ratio
ns *** ***
0123
d'
-0.5
0
M-Diff
*** *** ***
0123
d'
-0.05
0
AUC2-Diff
*** *** ***
0123
d'
-0.1
-0.05
0
Gamma-Diff
** *** ***
0123
d'
-0.05
0
Phi-Diff
* *** ***
0123
d'
-0.4
-0.2
0
Conf-Diff
*** *** ***
0123
d'
0.2
0.4
meta-noise
* *** ns
0123
d'
0.4
0.6
meta-uncertainty
ns ns *
0123
d'
1
2
d'
*** *** ***
0123
d'
-0.1
-0.05
Criterion
ns ns *
0123
d'
3
3.5
4
Confidence
*** *** ***
Shekhar
Rouault1
Rouault2
Effect sizes for dependence on task performance
meta-d'
AUC2
Gamma
Phi
Conf
M-Ratio
AUC2-Ratio
Gamma-Ratio
Phi-Ratio
Conf-Ratio
M-Diff
AUC2-Diff
Gamma-Diff
Phi-Diff
Conf-Diff
meta-noise
meta-uncertainty
Measure
-1
0
1
2
3
4
5
6
7
Effect size (Cohen's d)
Shekhar
Rouault1
Rouault2
Dependence on task performance
b
a
Fig. 2 | Dependence of estimated metacognitive scores on task performance.
aEstimated metacognitive ability for all 17 measures, as well as d, criterion, and
condence for different difculty levels in the Shekhar (n= 20), Rouault1 (n=466),
and Rouault2 (n= 484) datasets. Traditional measures of metacognition (top row)
all showed a strong positive relationship with task performance, whereas all dif-
ference measures (third row) show a strong negative relationship. Ratio measures
(second row) and the two model-based measures (meta-noise and meta-uncer-
tainty) performed much better but still showed weak relationships with task per-
formance. Errorbars showing SEMare displayed onboth the x and y axes. Statistical
resultsare based on uncorrected two-sided t-tests comparing the highest to lowest
difculty levelwithin each dataset for eachmeasure (see Supplementary Tables35
for complete results). ***, p<0.001;**,p<0.01;*, p< 0.05; ns, not signicant.
bEffectsizes for dependenceon task performance. Effect size (Cohensd)isplotted
for eachmetric and dataset. As canbe seen in the gure,non-normalizedtraditional
measures (i.e., meta-d,AUC2,Gamma,Phi,andΔConf) show strong positive rela-
tionshipwith task performance. Corrections with the ratio and difference methods
reverse this relationship, with the ratio correction being clearly superior. The
model-based metricsmeta-noise and meta-uncertainty perform well too, with meta-
uncertainty showing particularly low effect sizes.
Article https://doi.org/10.1038/s41467-025-56117-0
Nature Communications | (2025) 16:701 6
Content courtesy of Springer Nature, terms of use apply. Rights reserved
These are small effect sizes, except for AUC2-Ratio which has medium
effect size. Nevertheless, it should be noted that the negative direc-
tion of the effect between task performance on metacognitive scores
was consistent across all ve measures and three datasets (with 9/15
tests being signicant at p< 0.05; Supplementary Tables 35). Thus,
while all ratio measures perform much better than the original
metrics they are derived from, they tend to slightly overcorrect.
The ve difference measures (M-Diff,AUC2-Diff,Gamma-Diff,Phi-
Diff,andΔConf-Diff) were much less effective in removing the depen-
dence on task performance compared to their ratio counterparts.
Indeed, they all exhibited an over-correction where easier conditions
led to lower scores with medium average Cohensdeffect sizes (M-Diff:
0.58; AUC2-Diff:0.49; Gamma-Diff:0.39; Phi-Diff:0.30; ΔConf-Diff:
0.55). Further, the relationship between task performance and the
metacognitive score was signicantly negative for all ve measures and
three datasets (p< 0.05 for all 15 tests; Supplementary Tables 35).
These results demonstrate that the difference measures uniformly fail
at their main purpose, which is to remove the dependence of meta-
cognitive measures on task performance.
Finally, the two model-based measures (meta-noise and meta-
uncertainty) showed relatively weak but still systematic relationships
with task difculty. Specically, meta-noise decreased for easier
low
high
low
high
low
high
1
1.5
2
meta-d'
* * ns
low
high
low
high
low
high
0.6
0.65
0.7
AUC2
ns ** *
low
high
low
high
low
high
0.4
0.6
Gamma
*** ns ***
low
high
low
high
low
high
0.2
0.25
0.3
Phi
ns *** ***
low
high
low
high
low
high
0.2
0.4
0.6
0.8
1
Conf
ns *** ns
low
high
low
high
low
high
0.7
0.8
0.9
M-Ratio
ns ns ns
low
high
low
high
low
high
0.9
0.92
0.94
0.96
0.98
AUC2-Ratio
ns ns ns
low
high
low
high
low
high
0.75
0.8
0.85
0.9
0.95
Gamma-Ratio
ns ns ns
low
high
low
high
low
high
0.7
0.8
0.9
Phi-Ratio
ns ns ns
low
high
low
high
low
high
0.7
0.8
0.9
Conf-Ratio
ns ns **
low
high
low
high
low
high
-0.8
-0.6
-0.4
-0.2
M-Diff
* * ns
low
high
low
high
low
high
-0.08
-0.06
-0.04
-0.02
AUC2-Diff
ns ns ns
low
high
low
high
low
high
-0.2
-0.15
-0.1
-0.05
Gamma-Diff
* ns ns
low
high
low
high
low
high
-0.1
-0.08
-0.06
-0.04
-0.02
Phi-Diff
ns ns ns
low
high
low
high
low
high
-0.25
-0.2
-0.15
-0.1
-0.05
Conf-Diff
ns * ns
low
high
low
high
low
high
Confidence recode
0.5
1
meta-noise
ns ns ns
low
high
low
high
low
high
Confidence recode
0.5
1
meta-uncertainty
ns ns ns
low
high
low
high
low
high
Confidence recode
1.4
1.6
1.8
2
2.2
d'
ns ns ns
low
high
low
high
low
high
Confidence recode
-0.15
-0.1
-0.05
Criterion
ns ns ns
low
high
low
high
low
high
Confidence recode
1.5
2
2.5
3
3.5
Confidence
*** *** ***
Haddara
Maniscalco
Shekhar
Effect sizes for dependence on metacognitive bias
meta-d'
AUC2
Gamma
Phi
Conf
M-Ratio
AUC2-Ratio
Gamma-Ratio
Phi-Ratio
Conf-Ratio
M-Diff
AUC2-Diff
Gamma-Diff
Phi-Diff
Conf-Diff
meta-noise
meta-uncertainty
Measure
-1
-0.5
0
0.5
1
1.5
Effect size (Cohen's d)
Haddara
Maniscalco
Shekhar
Dependence on metacognitive bias
a
b
Fig. 3 | Dependence of estimated metacognitive scores on metacognitive bias.
aEstimated metacognitive ability for all 17 measures, as well as d, criterion, and
condence for data recoded to have lower or higher condence in the Haddara
(n= 70), Maniscalco(n=22),andShekhar(n= 20) datasets. Traditional measures of
metacognition (top row) showed a medium-to-large positive relationship with
metacognitive bias (except for Gamma, which showed a negative relationship).
Ratio measures (second row) and the two model-based measures (meta-noise and
meta-uncertainty) performed the best. Error bars show SEM. Statistical results are
based on uncorrected two-sided t-tests comparing the high vs. low condence
recode within each dataset for each measure (see Supplementary Tables 68for
complete results). ***, p<0.001;**, p<0.01; *,p< 0.05; ns, not signicant. bEffect
sizes for dependence on metacognitive bias. Effect size (Cohensd) is plotted for
each metric and dataset. As can be seen in the gure, all metrics except for Gamma
and meta-noise have a mostly positive relationship with metacognitive bias (i.e.,
higher condence leads to higher estimates of metacognition). The smallest
absolute effect sizes (under 0.15) occurred for AUC2-Ratio,Gamma-Ratio,AUC2-
Diff,andPhi-Diff, but many other measures exhibited effect sizes in the small-to-
medium range.
Article https://doi.org/10.1038/s41467-025-56117-0
Nature Communications | (2025) 16:701 7
Content courtesy of Springer Nature, terms of use apply. Rights reserved
conditions in all three datasets (average Cohensd=0.29), whereas
meta-uncertainty increased for easier conditions in all three datasets
(average Cohensd=0.06).Botheffectswereassociatedwithrelatively
small Cohensdeffect sizes that were comparable to what was
observed for the ratio measures. As such, both model-based measures
perform as well as the ratio measures in controlling for task perfor-
mance. Given that meta-uncertainty corrected in the opposite direc-
tion of the other viable measures (the ratio measures and meta-noise)
and had the lowest absolute Cohensd, studies that feature task per-
formance confounds may benet from performing analyses using both
meta-uncertainty and at least one more measure.
Dependence on metacognitive bias
A less appreciated nuisance variable is metacognitive bias: the ten-
dency to give low or high condence ratings for a given level of per-
formance. Metacognitive bias can be measured simply as the average
condence in a condition.Recently, Shekhar and Rahnev27 developed a
method that involves recoding the original condence ratings to
examine how measures of metacognition depend on metacognitive
bias. The method was further improved by Xue et al.28.TheXueetal.
method consists of recoding condence ratings as to articially induce
metacognitive bias toward lower or higher condence ratings. Com-
paring the obtained values for a given measure of metacognition
appliedtotherecodedcondence ratings allows us to evaluate whe-
ther the measure is independent of metacognitive bias.
Similar to quantifying precision, quantifying how metacognitive
bias affects measures of metacognition requires datasets with very
large number of trials coming from a single experimental condition.
Consequently, I selected the same two datasets used to quantify pre-
cision since they have the largest number of trials per participant while
also featuring a single experimental condition: Haddara (3000 trials
per participant) and Maniscalco (1000 trials per participant). In addi-
tion, I also used the Shekhar dataset (3 difculty levels, 2800 trials per
participant) but analyzed each difculty level in isolation and then
1234567
0.5
1
meta-d'
ns
1234567
0.55
0.6
AUC2
ns
1234567
0.2
0.4
0.6
Gamma
ns
1234567
0.1
0.15
0.2
Phi
ns
1234567
0.1
0.15
0.2
Conf
ns
1234567
0.6
0.8
1
M-Ratio
ns
1234567
0.9
0.95
1
AUC2-Ratio
ns
1234567
0.5
1
Gamma-Ratio
ns
1234567
0.5
1
Phi-Ratio
ns
1234567
0.5
1
Conf-Ratio
ns
1234567
-0.4
-0.2
0
M-Diff
ns
1234567
-0.05
0
AUC2-Diff
ns
1234567
-0.2
-0.1
0
Gamma-Diff
ns
1234567
-0.1
-0.05
0
Phi-Diff
ns
1234567
-0.1
-0.05
0
Conf-Diff
ns
1234567
Condition
0
1
2
meta-noise
ns
1234567
Condition
0
1
2
meta-uncertainty
ns
1234567
Condition
1
1.2
d'
ns
1234567
Condition
-0.5
0
0.5
Criterion
***
1234567
Condition
1.4
1.5
1.6
Confidence
ns
Correlation with absolute response bias
meta-d'
AUC2
Gamma
Phi
Conf
M-Ratio
AUC2-Ratio
Gamma-Ratio
Phi-Ratio
Conf-Ratio
M-Diff
AUC2-Diff
Gamma-Diff
Phi-Diff
Conf-Diff
meta-noise
meta-uncertainty
-1
-0.5
0
0.5
1
Correlation coefficient (r-value)
Dependence on response bias
a
b
Fig. 4 | Dependence of estimated metacognitive scores on response bias.
aEstimated metacognitive ability for all 17 measures, as well as d, criterion, and
condence for the seven conditions in the Locke (n= 10) data set. As expe cted, the
condition strongly affected response criterion, c. Despite that, condition did not
signicantly modulate any of the 17 measures of metacognition. The seven condi-
tions in the graph are arranged based on their average criterion values. Error bars
show SEM. Statistical results are based on repeated measures ANOVAs testing for
the effect of condition on each measure (see Supplementary Table 9 for complete
results). ***, p< 0.001; ns, not signicant. bCorrelationwith absolute response bias.
Average correlation between estimated metacognitive ability and absolute
response bias (i.e., |c|) for all 17 measures (n= 10). As can be seenfrom the gure, all
relationships are relatively small, but there is still a fair amount of uncertainty
around each value. Error bars show SEM.
Article https://doi.org/10.1038/s41467-025-56117-0
Nature Communications | (2025) 16:701 8
Content courtesy of Springer Nature, terms of use apply. Rights reserved
averaged the results across the three difculty levels. For that dataset,
the continuous condence scale was rst binned into six levelsas in the
original publication27.
The results demonstrated that meta-d,AUC2,Phi,andΔConf tend
to increase with higher average condence, whereas Gamma tends to
decrease (Fig. 3a). The average (across the three datasets) Cohensd
effect size was in the medium-to-large range for all ve measures
(meta-d:0.44;AUC2:0.51;Gamma:0.61; Phi:0.81;ΔConf:0.54;
Fig. 3b). In other words, all ve non-normalized measures of meta-
cognitiondependonmetacognitivebias.Allve ratio measures had a
positive relationship with metacognitive bias but with smaller Cohens
d effect sizes (M-Ratio:0.27;AUC2-Ratio:0.09;Gamma-Ratio: 0.001;
Phi-Ratio:0.23;ΔConf-Ratio: 0.42). Difference measures performed
similarly to ratio measures (M-Diff:0.43;AUC2-Diff:0.10;Gamma-Diff:
0.24; Phi-Diff: 0.11; ΔConf-Diff: 0.34). Finally, the two model-based
measures performed similar to the ratio and difference measures and
exhibited low-to-medium effect sizes that again went in opposite
directions of each other (meta-noise:0.21; meta-uncertainty:0.27).
Note thatthe scores after recoding were similar but slightly larger than
the original metacognitive scores before recoding (Supplementary
Fig. 3). Overall, researchers who want to control for metacognitive bias
would appear to do best if they used AUC2-Ratio,Gamma-Ratio,AUC2-
Diff,orPhi-Diff as these all featured absolute effect sizes under 0.15.
Nevertheless, given that meta- noise corrected in the opposite direction
of the ratio and difference measures, it may be advisable for results
obtained using one of those metrics to be reproduced with meta-noise.
Dependence on response bias
The nal nuisance variable examined here is response bias. Response
bias can be measured simply as the decision criterion cin signal detec-
tion theory. To understand how response bias affects measures of
metacognition, one needs datasets where the response criterion is
experimentally manipulated and condence ratings are simultaneously
collected. Very few such datasets exist and only a single such dataset is
featured in the Condence Database. The datasetnamed here Locke44
features seven conditions with manipulations of both prior and reward.
Rewards were manipulated by changing the payoff for correctly
choosing category 1 vs. category 2 (e.g., R=4:2meansthat4vs.2points
were given for correctly identifying categories 1 and 2, respectively),
whereas priors were manipulated by informing participants about the
probability of category 2 (e.g., P= 0.75 means that there was 75% prob-
ability of presenting category 2 and 25% probability of presenting cate-
gory 1). The seven categories were as follows (1) P=0.5, R=3:3, (2)
P=0.75,R=3:3,(3)P=0.25,R=3:3,(4)P=0.5,R=4:2,(5)P=0.5,R=2:4,
(6) P = 0.75, R=2:4, and(7)P=0.25,R= 4:2. The Locke dataset included
many trials per condition (700) but relatively few participants (N=10)
and collected condence on a 2-point scale.
The results suggested that none of the 17 measures of metacogni-
tion are strongly inuenced by response bias (Fig. 4a). Indeed, while a
repeated measures ANOVA revealed a very strong effect ofcondition on
response criterion (F(6,54) = 12.18, p< 0.001, η2
p=0.58), it showed no
signicant effect of condition on any of the measures of metacognition
(all ps >0.13 for 17 tests; Supplementary Table 9). Critically, I computed
Hadda Shekh Manis
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Pearson r
meta-d'
Hadda Shekh Manis
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Pearson r
AUC2
Hadda Shekh Manis
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Pearson r
Gamma
Hadda Shekh Manis
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Pearson r
Phi
Hadda Shekh Manis
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Pearson r
Conf
Hadda Shekh Manis
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Pearson r
M-Ratio
Hadda Shekh Manis
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Pearson r
AUC2-Ratio
Hadda Shekh Manis
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Pearson r
Gamma-Ratio
Hadda Shekh Manis
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Pearson r
Phi-Ratio
Hadda Shekh Manis
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Pearson r
Conf-Ratio
Hadda Shekh Manis
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Pearson r
M-Diff
Hadda Shekh Manis
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Pearson r
AUC2-Diff
Hadda Shekh Manis
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Pearson r
Gamma-Diff
Hadda Shekh Manis
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Pearson r
Phi-Diff
Hadda Shekh Manis
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Pearson r
Conf-Diff
Hadda Shekh Manis
Dataset
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Pearson r
meta-noise
Hadda Shekh Manis
Dataset
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Pearson r
meta-uncertainty
Hadda Shekh Manis
Dataset
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Pearson r
d'
Hadda Shekh Manis
Dataset
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Pearson r
Criterion
Hadda Shekh Manis
Dataset
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Pearson r
Confidence
bin=50
bin=100
bin=200
bin=400
Split-half reliability
Fig. 5 | Split-half reliability of metacognitive scores. Correlations between each
measure were computed based on odd vs. even trials for sample sizes of 50, 100,
200, and 400 trials. The gure shows that split-half correlations are high when at
least 100trials are used for computations but becomeunacceptably lowwhen only
50 trials are used. The x-axis shows the results for three different datasets: Hadda
(Haddara), Shekh (Shekhar), and Manis (Maniscalco).
Article https://doi.org/10.1038/s41467-025-56117-0
Nature Communications | (2025) 16:701 9
Content courtesy of Springer Nature, terms of use apply. Rights reserved
the correlation between the estimated metacognitive ability for each of
the17measuresandtheabsolutevalueoftheresponsecriterion(i.e.,|c|).
Theideabehindthisanalysisistoinvestigatewhethermoreextreme
response bias (either positive or negative) is associated with increases or
decreases in estimated metacognitive ability. The results demonstrated
that all correlation coefcients were very small (all r-values were between
0.04 and 0.21; Fig. 4b). There was a fair amount of uncertainty about
thesevalues,asseenbythewideerrorbarsinFig.4b, so it is possible
some of these relationships may be stronger than the current data
suggest. Overall, these results should be interpreted with caution given
the small sample size and the fact that a 2-point condence scale may be
noisier for estimating metacognitive scores. Nonetheless, these initial
ndings suggest that response bias may not have a large biasing effect
on measures of metacognition.
Reliability
Measures of metacognition are often used in studies of individual
differences to examine across-participant correlations between
metacognitive ability and many different factors such as brain
activity and structure10,11,50, metacognitive ability in other
domains51,52, psychiatric symptom dimensions46, cognitive pro-
cesses such as condence leak12, etc. These types of studies require
measures of metacognition to have high reliability. (Note that the
reliability of a measure is enhanced by both high precision and
large spread of scores across participants, so both of these two
factors are important for between-subject analyses. In contrast,
within-subject analyses only require high precision. Therefore, low-
reliability scores are not necessarily problematic for within-subject
designs.)
Perhaps surprisingly,relatively little has been done to quantifythe
reliability of measures of metacognition (but see refs. 14,41). Here I
examine split-half reliability (correlation between estimates obtained
from odd vs. even trials) and test-retest reliability (correlation between
estimates obtained on different days).
Split-half reliability
To examine split-half reliability for different sample sizes, one needs
datasets with many trials per participant and a single condition (or
largenumberoftrialsperconditionifmultipleconditionsarepresent).
Consequently, I selected the same three datasets used to examine the
dependence of measures of metacognition on metacognitive bias:
Haddara (3000 trials per participant), Maniscalco (1000 trials per
participant), and Shekhar (3 difculty levels, 2800 trials per partici-
pant). As before, I analyzed each difculty level in the Shekhar dataset
in isolation and then averaged the results across the three difculty
levels. For each dataset, I computed each measure of metacognition
based on odd and even trials separately and correlated the two. To
examine how split-half reliability depends on sample size, I performed
the procedure above for bins of 50, 100,200, and 400 trials separately.
Because the datasets contained multiple bins of each size, I averaged
the results across all bins of a given size.
The results showed that measures of metacognition have good
split-half reliability as long as the measures are computed using at least
100 trials (Fig. 5). Indeed, bin sizes of 100 trials produced split-half
correlations of r> 0.837 for all 17 measures when averaged across the
three datasets with an average split-half correlation of r= 0.861. These
numbers increased further for bin sizes of 200 (all rs > 0.938, average
r = 0.946) and 400 trials (all rs > 0.961, average r= 0.965). Further,
these numbers were only a little lower than the split-half correlations
for d(100 trials: r= 0.913; 200 trials: r=0.958;400 trials:r=0.970).
However, the split-half correlations strongly diminished when the
measures of metacognition were computed based on 50 trials with an
average r= 0.424 and no measure exceeding r=0.6.Itshouldbenoted
that while performing better, dalso had a relatively low split-half
reliability of r= 0.685 when computed based on 50 trials. These results
suggest that individual difference studies should employ 100 trials per
participant at a minimum and that there is little benet in terms of
split-half reliability for using more than 200 trials.
Test-retest reliability
Split-half reliability is a useful measure of the intrinsicnoise present in
the across-subject correlations that can be expected in studies of
individual differences. However, they do not account for uctuations
that could occur from day to day. These uctuations can be examined
by computing measures of metacognition obtained from different
days, thus estimating what is known as test-retest reliability. Such
estimation requires datasets with multiple days of testing and a large
number of trials per participant per day. Only one dataset in the
Condence Database meets these criteria: Haddara (6 days; 3000 total
trialsper participant; 70 participants). I examinedtestretest reliability
by computing both intraclass correlation (ICC) and Pearson correla-
tion between all pairs of days and then averaged across the
different pairs.
The results showed very low testretest reliability values (Fig. 6).
Even with 400 trials used for estimation, no measure of metacognition
exceeded an average ICC reliability of 0.75 and none of the measures
outside of the ve non-normalized and non-model-based measures
(i.e., meta-d,AUC2,Gamma,Phi,andΔConf) reached ICC reliability of
0.5, which is often considered the threshold for poor reliability. For
example, the widely used measure M-Ratio had average ICC reliability
of r= 0.16 (for 50 trials), 0.23 (for 100 trials), 0.29 (for 200 trials), and
0.42 (for 400 trials). The measure with highest testretest correlation
was ΔConf with ICC reliability of 0.39 (for50 trials), 0.53 (for 100 trials),
0.65 (for 200 trials), and 0.75 (for 400 trials). Notably, test-retest
reliability was not much higher for dor criterion ccompared to ΔConf
(average difference of about 0.1) and was only robustly high for con-
dence (above 0.86 regardless of sample size). Similar testretest
correlation coefcients were obtained when Pearson correlation was
computed instead of ICC (Fig. 6). These results are in line with the
ndings of Kopcanova et al.14 and suggest that correlations between
measures of metacognition and measures that do not substantially
uctuate on a day-by-day basis (e.g., structural brain measures) are
likely to be particularly noisy such that very large sample sizes may be
needed to nd reliable results.
Across-subject correlations between different measures
Lastly, I examined how different measures are related to each other by
performing across-subject correlations. Note that these analyses
should be interpreted with extreme caution because the correlation
between two measures could be driven by a third factor. For these
analyses, I again used the Haddara (3000 trials per participant), Man-
iscalco (1000 trials per participant), and Shekhar (3 difculty levels,
2800 trials per participant) datasets. As in previous analyses, I exam-
ined each difculty level in the Shekhar dataset in isolation and then
averaged the results across the three difculty levels. For each dataset,
I computed each measure of metacognition based on all trials in the
experiment and examined the across-subject correlations between
different measures.
Overall, the 17 measures of metacognition showed medium-sized
across-subject correlations with each other (average r= 0.49, 0.55, and
0.56 for the Haddara, Maniscalco, and Shekhar datasets, respectively;
Supplementary Fig. 4). These analyses seemed to reveal three groups
of measures. The rst group consists of the ve non-normalized
measures (meta-d,AUC2,Gamma,Phi,andΔConf), which exhibited
average inter-measures correlation of 0.60 (r= 0.60, 0.63, and 0.58 in
each dataset). The second group consists of the ve ratio and ve
difference measures, which exhibited average inter-measures corre-
lation of 0.63 (r= 0.62, 0.62, and 0.63 in each dataset). The average
correlation between the rst two groups of measures was slightly
weaker than the within-group correlations (r=0.51 on average;
Article https://doi.org/10.1038/s41467-025-56117-0
Nature Communications | (2025) 16:701 10
Content courtesy of Springer Nature, terms of use apply. Rights reserved
r= 0.42, 0.55, and 0.55 in each dataset). Note that these results could
be driven by the fact that all ve non-normalized measures are strongly
driven by d, thus increasing the correlations between them. It may also
be that the SDT-based normalization makes all ratios and difference
measures similar to each other.
Finally, the third group of measures consists of the two model-
based measures, which showed the strongest divergence from the rest
of the measures. Specically, meta-noise had an average correlation of
0.35 with the remaining measures (r= 0.35, 0.34, and 0.37 in each
dataset) and meta-uncertainty had an average correlation of 0.44 with
the remaining measures (r= 0.33, 0.45, and 0.53 in each dataset). The
measures meta-noise and meta-uncertainty had a very weak correlation
with each other (r= 0.15, 0.03, and 0.06 in each dataset). These results
suggest that the two model-based measures may capture unique var-
iance related to metacognitive ability.
Discussion
Despite substantial interest in developing good measures of meta-
cognition, there has been surprisingly little empirical work into the
psychometric properties of current measures. Here I investigate the
properties of 17 measures of metacognition, including eight new var-
iants. I develop a method of determining the validity and precision of a
measure of metacognition and examine each measuresdependence
on nuisance variables and its split-half and test-retest reliability. The
results paint a complex picture. No measure of metacognition is
perfectin the sense of having the best psychometric properties
across all criteria. Researchers need to make informed decisions about
which measures to use based on the empirical properties of the dif-
ferent measures. The results are summarized in Fig. 7.
Validity and precision
I found that all 17 measures of metacognition examined here are valid.
With the exception of meta-uncertainty, all measures seem to have
comparable level of precision. This result is rather surprising and
suggests that precision may be limited by measurement error such that
it is unlikely that any new measure of metacognition can substantially
exceed the precision level found for the rst 16 measures here. Never-
theless, new measures can be noisier and therefore it is critical to
demonstrate their level of precision. Note that less precise measures can
also appear to depend less on nuisance factors not because of their
better psychometric properties but due to their noisiness.
Dependence on task performance
Task performance is arguably the most important and best-
appreciated nuisance variable for measures of metacognition. As has
been previously suspected18, the results here show that all traditional
measures of metacognition are strongly dependent on task perfor-
mance. However, the ratio method does a very good job of correcting
for this dependence with M-Ratio,Gamma-Ratio,Phi-Ratio,andΔConf-
Ratio showing only weak dependence on task performance. On the
other hand, the difference method performed poorly in removing the
dependence on task performance. The model-based measures meta-
noise and meta-uncertainty also performed well.
Dependence on metacognitive bias
Previous research has shown that meta-dand M-Ratio are positively
correlated with metacognitive bias such that a bias toward higher con-
dence also leads to high values for these measures27,28.Thecurrent
investigation replicated these previous results and showed that similar
effects are observed for many other measures. Nevertheless, the
dependence was of low to medium effect size for M-Ratio and com-
parable to newer measures such as meta-noise and meta-uncertainty.
Dependence on response bias
The results for response bias should be considered preliminary
because they are based on a single dataset that consists of 10 partici-
pants. As such, the results should not be taken as strong evidence for
an absence of dependence on response bias (hence, all measures are
meta-d'
AUC2
Gamma
Phi
Conf
M-Ratio
AUC2-Ratio
Gamma-Ratio
Phi-Ratio
Conf-Ratio
M-Diff
AUC2-Diff
Gamma-Diff
Phi-Diff
Conf-Diff
meta-noise
meta-uncertainty
d'
Criterion
Confidence
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
ICC
Test-retest reliability (ICC)
bin=400
bin=200
bin=100
bin=50
meta-d'
AUC2
Gamma
Phi
Conf
M-Ratio
AUC2-Ratio
Gamma-Ratio
Phi-Ratio
Conf-Ratio
M-Diff
AUC2-Diff
Gamma-Diff
Phi-Diff
Conf-Diff
meta-noise
meta-uncertainty
d'
Criterion
Confidence
Measure
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
r-value
Test-retest reliability (Pearson correlation)
bin=400
bin=200
bin=100
bin=50
Fig. 6 | Testretestreliability of metacognitive scores.Testretest correlations in
the Haddaradataset (6 days,500 trials per day,70 participants)show generally low
test-retest reliability. Upper panel shows ICC values, whereas lower panel shows
Pearson c orrelati on. The testretest reliability was low-to-moderate for the mea-
sures meta-d,AUC2,Gamma,Phi,andΔConf and very low for the remaining
measures.
Article https://doi.org/10.1038/s41467-025-56117-0
Nature Communications | (2025) 16:701 11
Content courtesy of Springer Nature, terms of use apply. Rights reserved
colored in yellow rather than green in Fig. 7). Yet, it does appear that
any dependencies are unlikely to be particularly strong, at least for the
range of response bias likely to occur in most experiments.
Alternative ways of quantifying dependence on nuisance
variables
Iquantied the dependence on nuisance factors by examining effect
sizes (Cohensdand r-values). Alternative ways of examining the
dependence on nuisance variables make it difcult to compare the
measures. For example, the difference or ratio of the raw values
across easy vs. difcult conditions is not readily comparable across
metrics that take different ranges. The main limitation of the
approach I adopted (examining effect sizes) is that noisier measures
will have an advantage. In practice, the precision analysis found that
16 of the 17 measures examined here have a similar level of precision,
and thus do not substantially differ in their noisiness. Nevertheless, it
is possible that the relatively low dependence of meta-uncertainty on
nuisance variables is in part due to its lower precision (higher
noisiness).
Split-half reliability
Guggenmos41 recently examined many datasets in the Condence
Database and concluded that split-half reliability for M-Ratio is rela-
tively poor (r~ 0.7 for bin sizes between 400 and 600). (Note that the
paper computes split-half reliability but it calls it test-retest relia-
bility.) One issue with the approach by Guggenmos is that many of
the analyzed datasets in the Condence Database feature a variety of
conditions, manipulations, and sample sizes. These factors may
reduce the observed split-half reliability. Indeed, focusing on a select
number of large datasets with a single condition at a time, the current
paper nds much higher split-half reliabilities (between 0.84 and 0.9
for a bin size of 100). These results suggest that for sample sizes of
100 or more, one can expect reliable estimates of metacognition for
every measure when using a single experimental condition. It is likely
that studies that mix different conditions and estimate metacogni-
tive scores across all of them would produce lower split-half relia-
bility in line with the results of Guggenmos. Note that
sample sizes of 50 produced unacceptably low reliabilities, so
100 should be considered as a rough lower bound for the necessary
number of trials when estimating metacognition in studies of indi-
vidual differences.
Testretest reliability
One of the most striking results here is the very low test-retest reli-
abilities observed. Besides the ve non-normalized measures (meta-d,
AUC2,Gamma,Phi,andΔConf), no other measure showed test-retest
reliability exceeding 0.5 even for sample sizes of 400 trials. However,
the non-normalized ve measures are strongly dependent on task
performance, and thus their higher reliability maybe partly (or wholly)
due to the higher reliability of task performance itself (test-retest
reliability of dwas 0.84 for a sample size of 400). Therefore, studies
that match dfor all participants may result in test-retest reliability
values for these ve measures of metacognition that are as low as the
remaining measures. Nevertheless, these results are based on a single
dataset and should therefore be replicated before strong recommen-
dations can be made. That said, the results are consistent with a recent
paper that examined the test-retest reliability of M-Ratio in a sample of
25 participants14. Therefore, researchers who study individual differ-
ences in metacognition should be aware of the potential low test-retest
reliability of measures of metacognition, which may explain previous
failures to nd signicant correlations between metacognitive abilities
across domains.
Measure
Precision
Dependence on
task performance
Dependence on
metacognitive bias
Dependence on
response bias
Split
-
half
reliability
Test
-
retest
reliability
Unique limitations
Unique
meta-d' Pr = .65 d = 2.47 d = 0.44 r = -.04 r = .89 ICC = .71
AUC2 Pr = .54 d = 2.29 d = 0.51 r = .18 r = .89 ICC = .73 Continuous
Gamma Pr = .65 d = 2.95 d = -0.61 r = .12 r = .88 ICC = .71 Continuous
Phi Pr = .61 d = 1.34 d = 0.81 r = .11 r = .87 ICC = .63 Continuous
ΔConf Pr = .50 d = 1.81 d = 0.54 r = .18 r = .90 ICC = .75 Continuous
M-Ratio Pr = .61 d = -0.18 d = 0.27 r = .07 r = .85 ICC = .42 Unstable for low d'
AUC2-Ratio Pr = .60 d = -0.39 d = 0.09 r = .13 r = .85 ICC = .36
Gamma-Ratio Pr = .61 d = -0.11 d = 0.001 r = .08 r = .84 ICC = .30 Unstable for low d'
Phi-Ratio Pr = .62 d = -0.17 d = 0.23 r = .01 r = .84 ICC = .28 Unstable for low d'
ΔConf-Ratio Pr = .58 d = -0.23 d = 0.42 r = .11 r = .84 ICC = .27 Unstable for low d'
M-Diff Pr = .56 d = -0.58 d = 0.43 r = -.002 r = .87 ICC = .47
AUC2-Diff Pr = .59 d = -0.49 d = 0.10 r = .12 r = .85 ICC = .29
Gamma-Diff Pr = .65 d = -0.39 d = 0.24 r = .06 r = .85 ICC = .43
Phi-Diff Pr = .62 d = -0.30 d = 0.11 r = .001 r = .85 ICC = .35
ΔConf-Diff Pr = .53 d = -0.55 d = 0.34 r = .12 r = .85 ICC = .31
meta-noise Pr = .63 d = -0.29 d = -0.21 r = .03 r = .84 ICC = .29
Cannot be negative
-
meta
-
uncertainty
Pr = .34 d = 0.06 d = 0.27 r = .13 r = .86 ICC = .21
Cannot be negative
-
Fig. 7 | Summaryof results. The gurelists the valuesobtained for eachmeasure of
metacognition forvarious criteria. Precision is the measure developed in thispaper
and the values listed are the average of the values in Fig. 1b, c. Higher precision
values are better.For dependence oftask performanceand metacognitive bias, the
gure lists the average Cohensdvalues reported in the paper.For dependence on
response bias, the gure lists the average correlation between each measure of
metacognitionand the absolutevalue of response bias ( c
jj
). Lowerabsolute valueof
these dependencies is better. The reported split-half reliability is the average value
across datasets obtained for a bin size of 100, whereas the reported test-retest
reliability (ICC)is the average value obtained for a bin size of400. Higher reliability
values are better. Color coding is meant as a general indicator but should be
interpreted with caution. Green indicates very good properties, yellow indicates
good properties, orange indicates problematic properties, and red indicates bad
properties. Colors were assigned based on the following thresholds: 0.5 for preci-
sion, 0.3 and 1 for Cohensd,0.5fortestretest reliability.Green was notused in any
of the columns regarding dependence on nuisance variables as to not give the
impression that any measure is certainly independent of any of the nuisance vari-
ables. The gure also lists several unique advantages and disadvantages of each
measure discussed in the main text.
Article https://doi.org/10.1038/s41467-025-56117-0
Nature Communications | (2025) 16:701 12
Content courtesy of Springer Nature, terms of use apply. Rights reserved
Unique advantages and disadvantages of different measures
Several measures feature unique advantages and disadvantages
(Fig. 7). For example, four of the ratio measures (M-Ratio,Gamma-
Ratio,Phi-Ratio,andΔConf-Ratio) become unstable for difcult con-
ditions because they include division by variables (d,expected
Gamma, expected Phi, and expected ΔConf, respectively) that are very
close to 0 in such conditions. These measures should therefore be
used preferentially when performance levels are relatively high (e.g.,
one should aim for dvalues above 1, which roughly corresponds to
accuracy values above 69%).
An advantage of AUC2,Gamma,Phi,andΔConf isthat they all work
well with continuous condence scales. All other measures rely on
SDT-based computations that necessitate that continuous scales are
binned before analyses. Such binning may lead to loss of information,
but it is currently unclear how much signal may be lost by different
binning methods.
The two model-based measuresmeta-noise and meta-uncertainty
have unique advantages and disadvantages. Their main advantage is
that all their underlying assumptions are explicitly known. Conversely,
other measures must necessarily include hidden assumptions that are
difcult to reveal without linking them to a process model of
metacognition3. Another unique advantage of these measures is that
they can in principle be applied much more exibly. For example,
when an experiment contains several conditions, other measures do
not allow the estimation of a single measure of metacognition and
simply ignoring the different conditions can lead to inated scores49.
Conversely, both meta-noise and meta-uncertainty allow different
conditions to be modeled as part of their underlying process models
and thus a single metacognitive score can be computed in a principled
way across many conditions. A possible disadvantage of both mea-
sures is that they can only take positive values and therefore cannot be
used for situations where metacognition may contain more informa-
tion than the decision itself, such as in the presence of additional
information that arrives after the decision53,54.
Several measures showed dependence on nuisance variables that
went in the opposite direction from most other measures (meta-
uncertainty for task performance, as well as meta-noise and Gamma for
metacognitive bias). As such, these measures may be especially useful
to use when there is a concern that results may be driven by a specic
nuisance variable. Unfortunately, it is currently difcult to determine
why thesemeasures show the opposite effects (or,for that matter, why
most measures show the dependencies they show). Understanding the
nature of these relationships will likely require further progress in
developing well-tting process models of metacognition55,56.
Is M-Ratio still the gold standard for measuring metacognition?
In the last decade, M-Ratio has become the dominant measure of
metacognition due to its assumed better psychometric
properties18,34,57. This status has natura lly attracted greater scrutiny and
many recent papers have criticized some of the properties of
M-Ratio27,28,37,41,58. However, while thesecriticisms arevalid, papershave
rarely tested how alternative measures perform on the same tests. The
results here demonstrate that across all examined dimensions, there
are no measures that clearly outperform M-Ratio.Threemeasures
meta-noise,Gamma-Ratio,andPhi-Ratioshowed very similar perfor-
mance to M-Ratio, while all other measures appear inferior to M-Ratio
in at least one critical dimension: they strongly depend on task per-
formance (all ve non-normalized measures, all ve difference mea-
sures, and AUC-Ratio), havelow precision (meta-uncertainty), or strong
dependence on metacognitive bias (ΔConf-Ratio). I see no strong
argument in the present data to choose either Gamma-Ratio or Phi-
Ratio over M-Ratio, especially given how established M-Ratio is con-
trary to Gamma-Ratio and Phi-Ratio. There are good arguments for
using meta-noise in addition to M-Ratio as a way of controlling for
metacognitive bias given that the two measures depend on
metacognitive bias in opposite directions. Similarly, meta-uncertainty
canalsobeusedinadditiontoM-Ratio or meta-noise to control for task
performance given that it depends on task performance in the oppo-
site direction than the other two measures.
There are strong reasons for the eld to transition to model-based
measures of metacognition3since model-based measures are uniquely
positioned to properly capture the inuence of metacognitive
inefciencies59.Themeasuremeta-noise is especially promising given
its good performance on the current tests and the fact that its asso-
ciated model is a successful model of metacognition55.Thatsaid,meta-
noise is currently only implemented in Matlab (see codes associated
with the current paper) and is more computationally intensive. Thus,
although meta-noise or other model-based measures of metacognition
should eventually supplant M-Ratio, for the time being it is hard to
justify abandoning M-Ratio as the gold standard for the eld.
Limitations
The present work has several limitations. First, despite the attempt to
be comprehensive, several measures of metacognition have been
omitted including recent model-based measures30,60, different variants
of M-Ratio41, and legacy measures such as Kunimotosa38. Never-
theless, the current workshould make it much easierfor researchers to
establish the properties of other measures of metacognition and
compare them to the ones examined here. Second, while I have
attempted to use multiple large datasets for each analysis, two of the
analyses only included a single dataset (dependence on response bias
and test-retest reliability) and should be interpreted with caution. Even
in cases where multiple datasets were used, it is clear that adding more
datasets would alter the values in Fig. 7. As such, the values there
should be understood as rough estimates that are bound to be
improved upon by future work that analyzes additional large datasets.
Third,all ratio and difference measures werecomputed usingSDT with
equal variance; computations assuming unequal variance may lead to
different results. Fourth, the current analyses were conducted exclu-
sively in the context of perception. Metacognition has been widely
studied in the context of learning, memory, problem solving, etc1.
While the results here are expected to generalize to these other
domains, additional research is needed to conrm that. Fifth, most
measures examined here only apply to 2-choice tasks and thus cannot
be used for designs with estimation tasks, n-choice tasks, etc.
Recommendations
Based on the current set of results and ndings from the greater lit-
erature, Table 3lists recommendations for researchers interested in
measuring metacognitive ability. The recommendations pertain to
experimental design, analysis, and interpretation.
Researchers interested in measuring metacognition precisely
need to pay special attention to experimental design. They should use
relatively easy tasks (while still avoiding ceiling effects) because ratio
measures become unstable for low d. They should also ideally use a
single difculty level to avoid the ination that arises when multiple
difculty levels are combined49. Finally, researchers need to ensure
adequate sample sizes. I recommend at least 400 trials per participant
for individual differences research and at least 100 trials per partici-
pant for within-subject studies.
At the level of analysis, I recommend using more than one mea-
sure whenever possible, especially if the results could plausibly
depend on task performance or metacognitive bias. Difference mea-
sures should not be assumed to properly correct for task performance.
In cases where performance is very low and ratio measures are
unstable, the results should be conrmed by examining both differ-
ence and non-normalized measures (since these two categories have
opposite dependence on task performance). When multiple condi-
tions are present, researchers should ideally use the model-based
measures meta-noise or meta-uncertainty via custom modeling.
Article https://doi.org/10.1038/s41467-025-56117-0
Nature Communications | (2025) 16:701 13
Content courtesy of Springer Nature, terms of use apply. Rights reserved
Finally, researchers should interpret ndings of M-Ratio and other
ratio measures being larger (or smaller) than 1 with caution. Tradi-
tionally, such ndings have been interpreted as the metacognitive
system having more (or less) signal than the decision-making system.
However, many other factors can drive such results, such asthe mixing
of several difculty levels49 or criterion (as opposed to signal) noise59,
and some researchers have even questioned the separation of
decision-making and metacognitive systems61.
Methods
Ethical regulations
The current study complies with all relevant ethical regulations. All
analyses were performed on deidentied data from publicly available
datasets and thus were exempt from Internal Review Board review.
Datasets
To investigate the empirical properties of measures of metacognition,
I used the datasets from the Condence Database47 that are most
appropriate for each individual analysis. This process resulted in the
selection of six different datasets, briey discussed below in alphabe-
tical order. In each case, participants completed a 2-choice perceptual
task and provided condence ratings. For each dataset, I only con-
sidered trials from the main experiment and removed any staircase or
practice trials that may have been included. In addition, I excluded
participants who had lower than 60% or higher than 95% accuracy, or
who gave the same object-level or condence response on more than
85% of trials. These exclusions were made because such participants
can have unstable metacognitive scores. Overall, these criteria led to
the exclusion of 58 out of 1091 participants(5.32% exclusion rate). Data
were collected in a lab setting unless otherwise indicated.
Haddara dataset
The rst dataset is named Haddara_2022_Expt2in the Condence
Database (simplied to Haddarahere) and consists of 75 participants
each completing 3350 trials over seven days. Because Day 1 consisted
of a smaller number of trials (350) compared to Days 27 (500 trials
each), I only analyzed the data from Days 27 (3000 trials total). All
experimental details can be found in the original publication43. Briey,
the task was to determine the more frequent letter in a 7 × 7 display of
Xes and Os. Condence was provided on a 4-point scale using a
separate button press. The data collection was conducted online and
half the participants received trial-by-trial feedback (all participants
are considered jointly here). Five participantswere excluded from this
dataset (6.67% exclusion rate).
Locke dataset
The second dataset is named Locke_2020in the Condence Data-
base (simplied to Lockehere) and consists of 10 participants each
completing 4900 trials. All experimental details can be found in the
original publication44.Briey, the task was to determine if a Gabor
patch was tilted to the left or right vertical. Condence was provided
on a 2-point scale using a separate button press. There were seven
conditions with manipulations of both prior and reward. Rewards
were manipulated by changing the payoff for correctly choosing
category 1 vs. category 2 (e.g., R= 4:2 means that 4 vs. 2 points were
given for correctly identifying categories 1 and 2, respectively),
whereas priors were manipulated by informing participants about
the probability of category 2 (e.g., P= 0.75 means that there was 75%
probability of presenting category 2 and 25% probability of pre-
senting category 1). The seven categories were as follows (1) P= 0.5,
R= 3:3, (2) P= 0.75, R= 3:3, (3) P= 0.25, R= 3:3, (4) P= 0.5, R= 4:2, (5)
P= 0.5, R= 2:4, (6) P=0.75, R= 2:4, and (7) P= 0.25, R= 4:2. There
were equal number of trials (700) per condition. No participants
were excluded from this dataset.
Maniscalco dataset
The third dataset is named Maniscalco_2017_expt1in the Condence
Database (simplied to Maniscalcohere) and consists of 30 partici-
pants each completing 1000 trials. All experimental details can be
found in the original publication45. Briey, the task was to determine
which of the two patches presented to the left and right of xation
contained a grating. A single difculty condition was used throughout.
Condence was provided on a 4-point scale using a separate button
press. Eight participants were excluded from this dataset (26.67%
exclusion rate).
Rouault1 and Rouault2 datasets
The fourth and fth datasets are named Rouault_2018_Expt1and
Rouault_2018_Expt2in the Condence Database (simplied to
Rouault1and Rouault2here). They consist of 498 and 497 parti-
cipants, respectively, each completing 210 trials. All experimental
details can be found in the original publication that describes both
datasets46.Briey, the task was to determine which of the two
squares presented to the left and right of xation contained more
dots and then rate condence using a separate button press. The
Rouault1 dataset had 70 difculty conditions (where the difference in
dot number between the two squares varied from 1 to 70) with 3 trials
each. It collected condence on an 11-point scale that goes from 1
(certainly wrong) to 11 (certainly correct). However, because the rst
six condence ratings were used very infrequently, I combined them
into a single rating, thus transforming the 11-point scale into a 6-point
scale. On the other hand, Rouault2 used a continuously running
staircase that adaptively modulated the difference in dots. It col-
lected condence on a 6-point scale that goes from 1 (guessing) to 6
(certainly correct), which is equivalent to the modied scale from
Rouault1 and thus did not require additional modication. Data
collection for both studies was conducted online. Thirty-two parti-
cipants were excluded from Rouault1 and 13 participants were
excluded from Rouault2 (6.43% and 2.62% exclusion rates,
respectively).
Table 3 | Recommendations for metacognition researchers
Recommendations
Experimental design 1. Use relatively easy tasks to avoid instability related to low dvalues
2. Whenever possible, use designs with a single difculty level
3. Collect at least 100 trials per participant
4. For individual differences research, ideally collect at least 400 trials per participant
Analysis 1. Use several measures of metacognition
2. M-Ratio continues to be a good default measure of metacognitive ability
3. If results could plausibly depend on task performance or metacognitive bias, then conrm that results remain the same when using meta-
noise or meta-uncertainty
4. Do not use difference measures to correct for differences in task performance
5. If multiple conditions are present, use meta-noise or meta-uncertainty (custom modeling necessary)
Interpretation 1. Do not automatically assume that M-Ratio < 1 indicates signal loss from the decision to the metacognitive system
2. Do not automatically assume that M-Ratio > 1 indicates signal gain from the decision to the metacognitive system
Article https://doi.org/10.1038/s41467-025-56117-0
Nature Communications | (2025) 16:701 14
Content courtesy of Springer Nature, terms of use apply. Rights reserved
Shekhar dataset
The nal dataset is named Shekhar_2021in the Condence Database
(simplied to Shekharhere) and consists of 20 participants each
completing 2800 trials. All experimental details can be found in the
original publication27.Briey, the task was to determine the orientation
(left vs. right) of a Gabor patch presented at xation. Participants indi-
cated their condence simultaneously with the perceptual decision
using a single mouse click. Condence was provided on a continuous
scale (from 50 to 100) but was binned into six levels as in the original
publication. The dataset featured three different difculty levels
(manipulated by changing the contrast of the Gabor patch), which were
analyzed separately. No participants were excluded from this dataset.
Computation of each measure of metacognition
Previously proposed measures of metacognition.Icomputedatotal
of 17 measures of metacognition and provided Matlab code for their
estimation (available at https://osf.io/y5w2d/). I rst discuss nine of
these measures that have been previously proposed: AUC2,Gamma,
Phi,ΔConf,meta-d,M-Ratio,M-Diff,meta-noise,andmeta-uncertainty.
The rst four of these measures have the longest history. AUC2
was rst proposed in 1950s31 and measures the area under the Type 2
ROC function that plots Type 2 hit rate vs. Type 2 false alarm rate.
Gamma is perhaps the most popular measure inthe memory literature
and measures are the GoodmanKruskall Gamma coefcient, which is
essentially a rank correlation between trial-by-trial condence and
accuracy32.Phi is conceptually similar to Gamma but measures the
Pearson correlation between trial-by-trial condence and accuracy33.
Finally, ΔConf (my terminology) measures the difference between
average condence on correct trials and the average condence on
error trials. ΔConf is perhaps the simplest and most intuitive measure
of metacognition but is used very infrequently in the literature.
The next three measures were developed by Maniscalco and
Lau34. They devised a new approach to measuringmetacognitive ability
where one can estimate the sensitivity, meta-d, exhibited by the con-
dence ratings. Because meta-dis expressed in the units of d,Man-
iscalco and Lau reasoned that meta-dcan be normalized by the
observed dto obtain either a ratio measure (M-Ratio,equaltometa-
d/d) or a difference measure (M-Diff,equaltometa-dd). These
measures are often assumed to be independent of task performance18
but empirical work on this issue is scarce (though see41).
Finally, recent years have seen a concerted effort to build models
of metacognition derived from explicit process models of metacog-
nition. Two such measures examined here were developed by Shekhar
and Rahnev27 and Boundy-Singer et al.35 Shekhar and Rahnev proposed
the lognormal meta-noise model, which is an SDT model with the
additional assumption of lognormally distributed metacognitive noise
that affects the condencecriteria.Thelognormal distribution was
used because it avoids non-sensical situations where a condence
criterion moves on the other side of the decision criterion. The
metacognitive noise parameter (σmeta, referred to here as meta-noise)
can be used as a measure of metacognitive ability. The tting of model
to data is rather expensive because it requires the computation of
many double integrals that do not have numerical solutions. Conse-
quently, the tting method from Shekhar and Rahnev27 takes sub-
stantially longer than other measures examined here, making the
measure less practical. To address this issue, I make substantial mod-
ications to the original code including many improvements in the
efciency of the algorithm and creating a lookup tableso that values of
the double integral do not need to be computed anew but can be
simply loaded. These improvements reduce the computation of meta-
noise from minutes to a few seconds, thus making the measure easy to
use in practical applications. The measure developed by Boundy-
Singer35meta-uncertaintyis based on a different process model of
metacognition, CASANDRE, that implements the notion that people
are uncertain about the uncertainty in their internal representations.
Specically, it denotes the noise present in the estimation of the sen-
sory noise. The second-order uncertainty parameter, meta-uncertainty,
represents another possible measure of metacognition. The code for
estimating meta-uncertainty was provided by Zoe Boundy-Singer.
New measures of metacognition. In addition to the already estab-
lished measures mentioned above, I developed several new measures
that conceptually follow the normalization procedure introduced by
Maniscalco and Lau34. That normalization procedure has previously
only been applied to the measure meta-d(to create M-Ratio and
M-Diff), but there is no theoretical reason why a conceptually similar
correction cannot be applied to other traditional measures of meta-
cognition. Consequently, here I develop eight new measures where
one of the traditional measures of metacognitive ability is turned into
either a ratio (AUC2-Ratio,Gamma-Ratio,Phi-Ratio,andΔConf-Ratio)
or a difference (AUC2-Diff,Gamma-Diff,Phi-Diff,andΔConf-Diff)mea-
sure. The logic is to compute an observed and an expected value for
any given measure (e.g., AUC2),andthenusetheexpectedvalueto
normalize the observed value. First, a measure is computed using the
observed data, thus producing what may be called, e.g., AUC2
observed
.
Critically, the measureis then computed again using the predictions of
SDT given the observed sensitivity (d) and criteria, thus obtaining
what may be called, e.g.,AUC2
expected
. One can then take either theratio
(e.g., AUC2
observed
/AUC2
expected
) or the difference (e.g.,
AUC2
observed
AUC2
expected
) between the observed and the SDT-
predicted quantities to create the new measures of metacognition.
I computed the SDT expectations in the following way. First, I
estimated dusing the formula:
d0=zðHRÞzðFARÞð1Þ
where HR is the observed hit rate and FAR is the observed false alarm
rate. Then, I estimated the location of all condence and decision
criteria using the formula:
ci=zðHRiÞ+zðFARiÞ
2ð2Þ
In the formula above, igoes from ðk1Þto k1, for condence
ratings collected on an k-point scale. Intuitively, one can think of the
condence ratings 1, 2, ...k for category 1 being recoded to
1, 2, ...k,suchthatcondence goes from kto kand simul-
taneously indicates the decision (negative condence values indicat-
ing a decision for category 1; positive condence values indicating a
decision for category 2). HRiand FARiare then simply the proportion
of timesthis rescaledcondence is higher or equal to iwhen category 2
and category 1 are presented, respectively.
Once the values of dand ciare computed, they can be used to
generate predicted HRiand FARivalues (which would be slightly dif-
ferent from the empirically observed ones). The measures AUC2,
Gamma,Phi,andΔConf can then be straightforwardly computed based
on the predicted HRiand FARivalues, thus enabling the computation
of the new ratio and difference measures.
Assessing validity and precision
Any measure of metacognition should be valid and precise19,22,48.
However, there is no established method to assess either validity or
precision of measures of metacognition. Here I developed a method to
jointly assess validity and precision. The underlying idea is to arti-
cially alter condence to be less in line with accuracy and then assess
how measures of metacognition change.
Specically, the method corrupts condence by decreasing con-
dence ratings for correct trials and increasing them for incorrect
trials. For a given set of trials, the method loops over the trialsstarting
from the rst and (1) if the trial has a correct response and condence
Article https://doi.org/10.1038/s41467-025-56117-0
Nature Communications | (2025) 16:701 15
Content courtesy of Springer Nature, terms of use apply. Rights reserved
higher than 1, then it decreases the condence on that trial by 1 point,
and (2) ifthe trial has an incorrectresponseand condence lower than
maximum (that is kon an k-point scale), then it increases the con-
dence on that trial by 1 point. If neither of these conditions apply, the
trial is skipped. The method then continues to corrupt subsequent
trials in the same manner until a pre-set proportion of corrupted trials
is achieved. Then, all measures of metacognition are computed based
on the corrupted condence ratings. A given dataset is rst split into n
bins of a given trial number, and the procedure above is performed
separately for each bin. Finally, to compute a mea sure of precision that
can be compared acrossdifferent measures of metacognition, I use the
following formula:
precision =1
nX
n
i=1
measureOr igimeasureC orruptedi
SD ð3Þ
where measureOrigiand measureC orruptediare the values of a spe-
cic measure computed on the original (uncorrupted) and corrupted
condence ratings, respectively, nis the number of bins analyzed, and
SD is the standard deviation of all measureOrig ifor i=1,2, ...,n.
Positive values of the variable precision indicate valid measures of
metacognition and higher values indicate more precise measures(e.g.,
measures more sensitive to corruption in condence compared to
background uctuations).
I computed the precision of all 17 measures of metacognition for
two datasets from the Condence Database: Maniscalco (1 day; 1000
trials per participant) and Haddara (6 days; 3000 trials per partici-
pant). I separately examined the results of altering 2, 4, and 6% of all
trials and computed metacognitive scores based on bins of 50, 100,
200, and 400 trials. I split the Maniscalco dataset into 20 bins of 50
trials, 10 bins of 100 trials, ve bins of 200 trials, and two bins of 400
trials (by taking into consideration only the rst 800 trials in this last
case). I split the 500 trials from each of the six days in the Haddara
dataset into 10 bins of 50 trials, ve bins of 100 trials, two bins of 200
trials, and one bin of 400 trials (by taking into consideration only the
rst 400 trials for the 200- and 400-trial bins). Across the 6 days, this
process resulted in 60 bins of 50 trials, 30 bins of 100 trials, 12 bins of
200 trials, and six bins of 400 trials.
Assessing dependence on task performance
To assess how task performance affects measures of metacognition, I
examined whether each measure of metacognition changed across
different difculty levels in the same experiment. Specically, I tested
whether each of the 17 measures of metacognition increases or
decreases for more difcult conditions. This process requires datasets
with (1) several difculty conditions and (2) a large number of trials.
Consequently, I selected datasets from the Condence Database that
meet these two criteria but do not include any other manipulations.
This resulted in the selection of three datasets: Shekhar (3 difculty
levels, 20 participants, 2800 trials/participant, 56,000 total trials),
Rouault1 (70 difculty levels, 466 participants, 210 trials/participant,
97,860 total trials), and Rouault2 (many difculty levels, 484 partici-
pants, 210 trials/participant, 101,640 total trials). Because the two
Rouault datasets included very few trials from each difculty level, I
instead used a median split to classify them in easy vs. difcult. To
perform statistical analyses and compute Cohensd, I conducted
t-tests comparing the lowest and highest difculty levels in each
dataset. To avoid outlier values, for each difculty level and each
measure of metacognition, I excluded any values that deviated by
more than 3*SD from the mean of that difculty level. Finally, as a
reference, I performed all the above analyses on the measures d,c,and
average condence.
Assessing dependence on metacognitive bias
To assess how metacognitive bias affects measures of metacognition, I
applied the method developed by Xue et al.28. In this method, con-
dence ratings are recoded in two different ways as to articially
induce metacognitive bias towards lower or higher condence ratings.
Specically, an n-point scale is transformed into an (n1)-point scale in
two ways. In the rst recoding, the ratings from 2 to nare all decreased
by one. In the second recoding, only the rating of nis decreased by
one. When compared to each other, the rst method results in a bias
towards lower condencecompared to the second method (see mean
condence values in the bottom right of Fig. 3a). A measure of meta-
cognition can then be computed for the newly obtained condence
ratings. Comparing the obtained values for the two recodings allows
the assessment of whether each measure of metacognition is inde-
pendent of metacognitive bias.
This process would ideally be applied to datasets with (1) a single
experimental condition and (2) a large number of trials. Consequently,
I selected the same two datasets used to quantify precision: Haddara
(3000 trials per participant) and Maniscalco (1000 trials per partici-
pant). In addition, I also used the Shekhar dataset (3 difculty levels,
2800 trials per participant) but analyzed each difculty level in isola-
tion and then averaged the results across the three difculty levels.The
values of each measure of metacognition for the two recodings were
compared using a paired t-test.
Assessing dependence on response bias
To assess how response bias affects measures of metacognition, I
compared the values of each measure of metacognition in conditions
that differed in their decision criterion. To do so, I analyzed the Locke
datasetthe only dataset in the Condence Database where the
response criterion is experimentally manipulated. I computed each
measure of metacognition for each of the seven conditions in that
dataset and conducted repeated measures ANOVAs to examine whe-
ther each measure of metacognition varied with the condition. In
addition, to estimate an effect size for the relationship between
response bias and each measure of metacognition, I computed the
correlation between the estimated metacognitive ability and the
absolute value of the response bias (i.e., |c|).
Assessing split-half reliability
To assess split-half reliability, I examined the correlation between
the values obtained for different measures of metacognition on odd
vs. even trials41. As with assessing precision, I estimated split-half
correlations for different sample sizes, so researchers can make
informed decisions about the sample sizes needed in future studies.
Specically, I used bin sizes of 50, 100, 200, and 400 trials. Note that
a bin size of khere means that 2ktrials were examined with both the
odd and even trials having a sample size of k. These computations
are best performed using datasets with (1) a single condition,
and (2) a large number of trials per participant. Consequently, I
selected the same three datasets used to examine the dependence
of measures of metacognition on metacognitive bias: Haddara
(3000 trials per participant), Maniscalco (1000 trials per partici-
pant), and Shekhar (3 difculty levels, 2800 trials per participant).
As before, I analyzed each difculty level in the Shekhar dataset in
isolation and then averaged the results across the three difculty
levels. For a bin size of k, the computations were performed on as
many as possible non-overlapping bins of 2ktrials. The obtained
r-values were then z-transformed, averaged, and the resulting
average zvalue was transformed back to an r-value for reporting and
plotting purposes.
Assessing testretest reliability
To assess test-retest reliability, I examined the intraclass correlation
(ICC) coefcients between the values obtained for different
Article https://doi.org/10.1038/s41467-025-56117-0
Nature Communications | (2025) 16:701 16
Content courtesy of Springer Nature, terms of use apply. Rights reserved
measures of metacognition on different days. I report the two-way
absolute consistency ICC, named A-162 computed using the code
provided by Salarian (https://www.mathworks.com/matlabcentral/
leexchange/22099-intraclass-correlationcoefcient-icc). For ease of
comparison with the results by Guggenmos41, in addition to ICC, I
also computed the Pearson correlation. As with split-half reliability, I
estimated test-retest reliability for sample sizes of 50, 100, 200, and
400 trials. Because test-retest computations require data from mul-
tiple days and a large number of trials per participant per day, I
selected the Haddara dataset as it is the only dataset in the Con-
dence Database to meet these criteria. I computed test-retest cor-
relations between all pairs of days for as many as possible non-
overlapping bins. Note that, unlike for split-half analyses, analyses of
bin size of kinvolved the selection of ktrials from each day. As with
the split-half analyses, the obtained correlation coefcients were
then z-transformed, averaged, and the resulting average zvalue was
transformed back to a correlation coefcient (ICC or r-value) for
reporting and plotting purposes.
Statistical analyses and reporting
All conclusions in the paper are based on effect sizes (Cohensd,r,and
ICC values). However, for completeness, I sometimes refer to the
results of null-hypothesis statistical tests. As is standard practice, in
cases where I report the results of multiple tests together, I only
include the p-values. All remaining information, such as test statistics
and degrees of freedom, can be obtained from the provided analysis
codes. All p-values are based on two-tailed statistical tests. Analyses
were performed using MATLAB 2024a (MathWorks).
Reporting summary
Further information on research design is available in the Nature
Portfolio Reporting Summary linked to this article.
Data availability
Raw data for all six experiments were obtained from the Condence
Database (https://osf.io/s46pr).Thedatacomefromthefollowing
publications: Haddara and Rahnev43,Lockeetal.
44, Maniscalco et al.45,
Rouault et al.46, and Shekhar and Rahnev27. Processed data les are
available at https://doi.org/10.17605/osf.io/y5w2d.
Code availability
Code for computing all 17 measures of metacognition, as well as data
and analysis code for reproducing all statistical results and plotting all
gures are available at https://doi.org/10.17605/osf.io/y5w2d.
References
1. Metcalfe, J. & Shimamura, A. P. Metacognition: Knowing About
Knowing (MIT Press, 1994).
2. Fleming, S. M. & Dolan, R. J. The neural basis of metacognitive
ability. Philos.Trans.R.Soc.Lond.B.Biol.Sci.367,
13381349 (2012).
3. Rahnev, D. Visual metacognition: measures, models, and neural
correlates. Am. Psychol. 76, 14451453 (2021).
4. Guggenmos,M.,Wilbertz,G.,Hebart,M.N.&Sterzer,P.Mesolimbic
condence signals guide perceptual learning in the absence of
external feedback. eLife 5, e13388 (2016).
5. Desender, K., Boldt, A. & Yeung, N. Subjective condence predicts
information seeking in decision making. Psychol. Sci. 29,
761778 (2018).
6. Pescetelli,N.&Yeung,N. The role of decision condence in advice-
taking and trust formation. J. Exp. Psychol. Gen. 150,
507526 (2021).
7. Fleming, S. M. Know Thyself: The Science of Self-Awareness (Basic
Books, 2021).
8. Weil, L. G. et al. The development of metacognitive ability in ado-
lescence. Conscious. Cogn. 22,264271 (2013).
9. Kelemen,W.L.,Frost,P.J.&Weaver,C.a.Individualdifferencesin
metacognition: evidence against a general metacognitive ability.
Mem. Cognit. 28,92107 (2000).
10. Fleming,S.M.,Weil,R.S.,Nagy,Z.,Dolan,R.J.&Rees,G.Relating
introspective accuracy to individual differences in brain structure.
Science 329,15411543 (2010).
11. Allen, M. et al. Metacognitive ability correlates with hippocampal
and prefrontal microstructure. NeuroImage 149,415423 (2017).
12. Rahnev, D., Koizumi, A., McCurdy, L. Y., DEsposito,M.&Lau,H.
Condence leak in perceptual decision making. Psychol. Sci. 26,
16641680 (2015).
13. McCurdy, L. Y. et al. Anatomical coupling between distinct meta-
cognitive systems for memory and visual perception. J. Neurosci.
33,18971906 (2013).
14. Kopcanova, M., Ince, R. A. A. & Benwell, C. S. Y. Two distinct
stimulus-locked EEG signatures reliably encode domain-general
condence during decision formation. bioRxiv 155 (2023) https://
doi.org/10.1101/2023.04.21.537831.
15. Fitzgerald, L. M., Arvaneh, M. & Dockree, P. M. Domain-specicand
domain-general processes underlying metacognitive judgments.
Conscious. Cogn. 49,264277 (2017).
16. Rouault,M.,McWilliams,A.,Allen,M.G.&Fleming,S.M.Human
metacognition across domains: insights from individual differences
and neuroimaging. Personal. Neurosci. 1,e17(2018).
17. Morales, J., Lau, H. & Fleming, S. M. Domain-general and domain-
specic patterns of activity supporting metacognition in human
prefrontal cortex. J. Neurosci. 38,35343546 (2018).
18. Fleming, S. M. & Lau, H. C. How to measure metacognition. Front.
Hum. Neurosci.8(2014).
19. Clark, L. A. & Wat son, D. Constructing validity: new developments in
creating objective measuring instruments. Psychol. Assess. 31,
14121427 (2019).
20. Nebe, S. et al. Enhancing precision in human neuroscience. eLife 12,
e85980 (2023).
21. Cumming, G. The new statistics: why and how. Psychol. Sci. 25,
729 (2014).
22. Luck,S.J.,Stewart,A.X.,Simmons,A.M.&Rhemtulla,M.Stan-
dardized measurement error: a universal metric of data quality for
averaged eventrelated potentials. Psychophysiology 58,
e13793 (2021).
23. Konishi, M., Compain, C., Berberian, B., Sackur, J. & de Gardelle, V.
Resilience of perceptual metacognition in a dual-task paradigm.
Psychon. Bull. Rev. 27,110 (2020).
24. Konishi, M., Berberian, B., de Gardelle, V. & Sackur, J. Multitasking
costs on metacognition in a triple-task paradigm. Psychon. Bull.
Rev. 28,20752084 (2021).
25. Maniscalco, B. & Lau, H. Manipulation of working memory
contents selectively impairs metacognitive sensitivity in a con-
current visual discrimination task. Neurosci. Conscious. 2015,
niv002 (2015).
26. Rahnev, D. & Denison, R. N. Suboptimality in perceptual decision
making. Behav. Brain Sci. 41,166 (2018).
27. Shekhar, M. & Rahnev, D. The nature of metacognitive inefciency
in perceptual decision making. Psychol. Rev. 128,4570 (2021).
28. Xue, K., Shekhar, M. & Rahnev, D. Examining the robustness of the
relationship between metacognitive efciency and metacognitive
bias. Conscious. Cogn. 95,103196(2021).
29. Green,D.M.&Swets,J.A.Signal Detection Theory and Psycho-
physics (John Wiley & Sons Ltd, 1966).
30. Desender, K., Vermeylen, L. & Verguts, T. Dynamic inuences
on static measures of metacognition. Nat. Commun. 13,
4208 (2022).
Article https://doi.org/10.1038/s41467-025-56117-0
Nature Communications | (2025) 16:701 17
Content courtesy of Springer Nature, terms of use apply. Rights reserved
31. Clarke, F. R., Birdsall, T. G. & Tanner, W. P. Two Types of ROC curves
and denitions of parameters. J.Acoust.Soc.Am.31,
629630 (1959).
32. Nelson, T. O. A comparison of current measures of the accuracy of
feeling-of-knowing predictions. Psychol. Bull. 95,109133 (1984).
33. Kornell, N., Son, L. K. & Terrace, H. S. Transfer of metacognitive skills
and hint seeking in monkeys. Psychol. Sci. 18,6471 (2007).
34. Maniscalco, B. & Lau, H. A signal detection theoretic approach for
estimating metacognitive sensitivity from condence ratings.
Conscious. Cogn. 21,422430 (2012).
35. Boundy-Singer, Z. M., Ziemba, C. M. & Goris, R. L. T. Condence
reects a noisy decision reliability estimate. Nat. Hum. Behav. 7,
142154 (2023).
36. Barrett,A.B.,Dienes,Z.&Seth,A.K.Measuresofmetacognitionon
signal-detection theoretic models. Psychol. Methods 18,
535552 (2013).
37. Rausch, M., Hellmann, S. & Zehetleitner, M. Measures of meta-
cognitive efciency across cognitive models of decision con-
dence. Psychol. Methods https://doi.org/10.1037/met0000634
(2023).
38. Evans, S. & Azzopardi, P. Evaluation of a bias-freemeasure of
awareness. Spat. Vis. 20,6177 (2007).
39. Kunimoto, C., Miller, J. & Pashler, H. Condence and accuracy of
near-threshold discrimination responses. Conscious. Cogn. 10,
294340 (2001).
40. Galvin, S. J., Podd, J. V., Drga, V. & Whitmore, J. Type 2 tasks in the
theory of signal detectability: discrimination between correct and
incorrect decisions. Psychon. Bull. Rev. 10,843876 (2003).
41. Guggenmos, M. Measuring metacognitive performance: type 1
performance dependence and test-retest reliability. Neurosci.
Conscious. 2021, niab040 (2021).
42. Adler, W. T. & Ma, W. J. Comparing Bayesian and non-Bayesian
accounts of human condence reports. PLOS Comput. Biol. 14,
e1006572 (2018).
43. Haddara, N. & Rahnev, D. The impact of feedback on perceptual
decision-making and metacognition: reduction in bias but no
change in sensitivity. Psychol. Sci. 33,259275 (2022).
44. Locke, S. M., Gafn-Cahn, E., Hosseinizaveh, N., Mamassian, P. &
Landy,M.S.Priorsandpayoffsincondence judgments. Atten.
Percept. Psychophys. 82,31583175 (2020).
45. Maniscalco, B., McCurdy, L. Y., Odegaard, B. & Lau, H. Limited
cognitive resources explain a tradeoff between perceptual and
metacognitive vigilance. J. Neurosci. 37, 22712213 (2017).
46. Rouault, M., Seow, T., Gillan, C. M. & Fleming, S. M. Psychiatric
symptom dimensions are associated with dissociable shifts in
metacognition but not task performance. Biol. Psychiatry 84,
443451 (2018).
47. Rahnev, D. et al. The condence database. Nat. Hum. Behav. 4,
317325 (2020).
48. Mueller, R. O. & Knapp, T. R. Reliability and validity. In (eds Han-
cock, G. R., Stapleton, L. M. & Mueller, R. O.) The Reviewers Guide
to Quantitative Methods in the Social Sciences 5 (Routle-
dge, 2018).
49. Rahnev, D. & Fleming, S. M. How experimental procedures inu-
ence estimates of metacognitive ability. Neurosci. Conscious. 2019,
niz009 (2019).
50. Zheng, Y. et al. Diffusion property and functional connectivity of
superior longitudinal fasciculus underpin human metacognition.
Neuropsychologia 156,107847(2021).
51. Faivre, N., Filevich, E., Solovey, G., Kühn, S. & Blanke, O. Behavioral,
modeling, and electrophysiological evidence for supramodality in
human metacognition. J. Neurosci. 38,263277 (2018).
52. Mazancieux,A.,Fleming,S.M.,Souchay,C.&Moulin,C.J.A.Is
there a G factor for metacognition? Correlations in retrospective
metacognitive sensitivity across tasks. J. Exp. Psychol. Gen. 149,
17881799 (2020).
53. Fleming, S. M., van der Putten, E. J. & Daw, N. D. Neural mediators of
changes of mind about perceptual decisions. Nat. Neurosci. 21,
617624 (2018).
54. Elosegi, P., Rahnev, D. & Soto, D. Think twice: re-assessing con-
dence improves visual metacognition. Atten. Percept. Psychophys.
86,373380 (2024).
55. Shekhar, M. & Rahnev, D. How do humans give condence?
A comprehensive comparison of process models of
perceptual metacognition. J. Exp. Psychol. Gen. 153,
656688 (2024).
56. Hellmann, S., Zehetleitner, M. & Rausch, M. Simultaneous modeling
of choice, condence, and response time in visual perception.
Psychol. Rev. 130,15211543 (2023).
57. Maniscalco, B. & Lau, H. Signal Detection Theory analysis of type 1
and type 2 data: meta-d, response-specicmeta-d, and the
unequal variance SDT model. In (eds Fleming, S. M., Frith, C. D.)
The Cognitive Neuroscience of Metacognition 2566 https://doi.
org/10.1007/978-3-642-45190-4_3 (Springer Berlin Heidel-
berg, 2014).
58. Bang, J. W., Shekhar, M. & Rahnev, D. Sensory noise increases
metacognitive efciency. J. Exp. Psychol. Gen. 148,437452 (2019).
59. Shekhar, M. & Rahnev, D. Sources of metacognitive inefciency.
Trends Cogn. Sci. 25,1223 (2021).
60. Mamassian, P. & de Gardelle, V. Modeling perceptual condence
and the condence forced-choice paradigm. Psychol. Rev. 129,
976998 (2022).
61. Zheng, Y., Recht, S. & Rahnev, D. Common computations for
metacognition and meta-metacognition. Neurosci. Conscious
2023,niad023(2023).
62. McGraw, K. O. & Wong, S. P. Forming inferences about some
intraclass correlation coefcients. Psychol. Methods 1,
3046 (1996).
Acknowledgements
This work was supported by the National Institute of Health (award
R01MH119189) and the Ofce of Naval Research (award N00014-20-1-
2622). The author thanks Zoe Boundy-Singer for sharing scripts for
estimating meta-uncertainty.
Author contributions
D.R. is the sole author and performed all work related to this paper.
Competing interests
The author declares no competing interests.
Additional information
Supplementary information The online version contains
supplementary material available at
https://doi.org/10.1038/s41467-025-56117-0.
Correspondence and requests for materials should be addressed to
Dobromir Rahnev.
Peer review information Nature Communications thanks the anon-
ymous reviewers for their contribution to the peer review of this work. A
peer review le is available.
Reprints and permissions information is available at
http://www.nature.com/reprints
Publishers note Springer Nature remains neutral with regard to jur-
isdictional claims in published maps and institutional afliations.
Article https://doi.org/10.1038/s41467-025-56117-0
Nature Communications | (2025) 16:701 18
Content courtesy of Springer Nature, terms of use apply. Rights reserved
Open Access This article is licensed under a Creative Commons
Attribution 4.0 International License, which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as
long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons licence, and indicate if
changes were made. The images or other third party material in this
article are included in the articles Creative Commons licence, unless
indicated otherwise in a credit line to the material. If material is not
included in the articles Creative Commons licence and your intended
use is not permitted by statutory regulation or exceeds the permitted
use, you will need to obtain permission directly from the copyright
holder. To view a copy of this licence, visit http://creativecommons.org/
licenses/by/4.0/.
© The Author(s) 2025
Article https://doi.org/10.1038/s41467-025-56117-0
Nature Communications | (2025) 16:701 19
Content courtesy of Springer Nature, terms of use apply. Rights reserved
1.
2.
3.
4.
5.
6.
Terms and Conditions
Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center GmbH (“Springer Nature”).
Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers and authorised users (“Users”), for small-
scale personal, non-commercial use provided that all copyright, trade and service marks and other proprietary notices are maintained. By
accessing, sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of use (“Terms”). For these
purposes, Springer Nature considers academic use (by researchers and students) to be non-commercial.
These Terms are supplementary and will apply in addition to any applicable website terms and conditions, a relevant site licence or a personal
subscription. These Terms will prevail over any conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription
(to the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of the Creative Commons license used will
apply.
We collect and use personal data to provide access to the Springer Nature journal content. We may also use these personal data internally within
ResearchGate and Springer Nature and as agreed share it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not
otherwise disclose your personal data outside the ResearchGate or the Springer Nature group of companies unless we have your permission as
detailed in the Privacy Policy.
While Users may use the Springer Nature journal content for small scale, personal non-commercial use, it is important to note that Users may
not:
use such content for the purpose of providing other users with access on a regular or large scale basis or as a means to circumvent access
control;
use such content where to do so would be considered a criminal or statutory offence in any jurisdiction, or gives rise to civil liability, or is
otherwise unlawful;
falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association unless explicitly agreed to by Springer Nature in
writing;
use bots or other automated methods to access the content or redirect messages
override any security feature or exclusionary protocol; or
share the content in order to create substitute for Springer Nature products or services or a systematic database of Springer Nature journal
content.
In line with the restriction against commercial use, Springer Nature does not permit the creation of a product or service that creates revenue,
royalties, rent or income from our content or its inclusion as part of a paid for service or for other commercial gain. Springer Nature journal
content cannot be used for inter-library loans and librarians may not upload Springer Nature journal content on a large scale into their, or any
other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not obligated to publish any information or
content on this website and may remove it or features or functionality at our sole discretion, at any time with or without notice. Springer Nature
may revoke this licence to you at any time and remove access to any copies of the Springer Nature journal content which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or guarantees to Users, either express or implied
with respect to the Springer nature journal content and all parties disclaim and waive any implied warranties or warranties imposed by law,
including merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published by Springer Nature that may be licensed
from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a regular basis or in any other manner not
expressly permitted by these Terms, please contact Springer Nature at
onlineservice@springernature.com
... To overcome this methodological limitation, metamemory sensitivity can be estimated using a model-based approach Lau 2012, 2014). This approach provides a measure of metacognitive efficiency (M ratio ), which is known to be more independent of first-order performance, compared to previous methods (Guggenmos 2021, Rahnev 2023. Using this framework, a few studies have focused on the evolution of RCJs in episodic and/or semantic memory across the lifespan with one finding no episodic metacognitive efficiency deficit in older adults (Palmer et al. 2014) and another a reduction in episodic metacognitive efficiency with aging (Meunier-Duperray et al. 2024). ...
... As previously mentioned, this can be explained by the fact that (i) an effective memory is necessary for accurate metamemory because both are partly based on the same processes and (ii) memory and metamemory are based on different processes but a memory deficit would contaminate measures of metacognitive sensitivity creating a spurious deficit in metacognitive sensitivity. Despite some unstable fits for very low first-order performance which we do not have in the current dataset, M ratio has been shown to only have very weak dependence on task performance (Rahnev 2023). This suggests that the pattern observed in our previous study (Meunier-Duperray et al. 2024) and others (McWilliams et al. 2023) is likely due to shared processes between memory and metamemory. ...
Article
Full-text available
Dissociations in types of memory tasks emerge when comparing feeling-of-knowing (FOK) judgments, predictions of upcoming performance, and retrospective confidence. This pattern has been used to construct theories of metacognitive access to memory, particularly in memory-impaired groups. In particular, older adults’ metacognitive sensitivity appears to vary between episodic (impaired) and semantic (intact) memory. However, this could be explained by the limitations of metacognitive measures and/or memory differences. We aimed to test these dissociations of metacognition with aging by comparing metacognitive efficiency in episodic and semantic tasks using two types of judgment: retrospective confidence judgments (RCJs) and FOK judgments. Metacognitive efficiency was estimated in 240 participants aged 19–79 years using a hierarchical Bayesian framework. Results showed that metacognitive efficiency for RCJs declined with age in the semantic task, even though task performance increased with age, while metacognitive efficiency was stable in the episodic task. Surprisingly, metacognitive efficiency was very low (although significantly higher than zero) for both FOK tasks regardless of age compared to similar previous studies. We suggested this might be due to the online testing. These results point to metacognition being multifaceted and varying according to judgment, domains, and populations.
... Metacognitive strategies in LLS have been ascribed to students' success in achieving learning goals by identifying their own strengths and weaknesses, self-monitoring their studying progress, and adjusting actions that suit their learning needs. In the same vein, Anderson (2012), He et al., (2024) and Rahnev (2025) highlight that metacognitive ability makes learners' thinking noticeable and eventually bridges them to make specific changes in their learning to improve their academic achievements. In other words, the learners understand what areas they lack that need to be improved and what areas they excel at that need to be maintained. ...
Article
Full-text available
Metacognition has long touched upon critical issues related to educational psychology and gained advancing recognition in language learning. However, empirical research on its effect on academic writing remains limited. This study investigated metacognitive awareness among 167 Indonesian university students enrolled in an academic writing class, focusing on two self-assessment patterns: superficial (overestimating abilities) and hypercritical (underestimating abilities). Using an explanatory sequential mixed methods design, the researchers first compared students' actual writing exam scores with their self-assessments, which mirrored the lecturer’s grading rubric. Students were grouped into quartiles based on performance, and a Wilcoxon Signed Rank Test was conducted to examine differences between their actual and perceived scores. Results revealed distinct patterns. Among lower-performing students (bottom quartile), only 11 out of 53 students (approximately 21%) exhibited superficial self-assessment. In contrast, among higher-performing students (top quartile), 61 out of 63 students (97%) demonstrated hypercritical self-assessment. These results suggest that while only a minority of low-performing students were superficial, the majority of high-performing students tended to be hypercritical about their writing abilities. Follow-up interviews with four selected students further explored the reasons behind these patterns. Qualitative analysis identified three main contributing factors: person and task variables, response to feedback, and self-regulation strategies. The findings underscore the importance of fostering metacognitive awareness and accurate self-assessment in foreign language writing instruction to promote effective learning and self-regulated development.
Article
Full-text available
Prior research has found inconsistent results regarding gender differences in confidence and metacognitive ability. Different studies have shown that men are either more or less confident and have either higher or lower metacognitive abilities than women. However, this research has generally not used well-controlled tasks or used performance-independent measures of metacognitive ability. Here, we test for gender differences in performance, confidence, and metacognitive ability using data from 10 studies from the Confidence Database (total N = 1,887, total number of trials = 633,168). We find an absence of strong gender differences in performance and no gender differences in either confidence or metacognitive ability. These results were further confirmed by meta-analyses of the 10 datasets. These findings show that it is unlikely that gender has a strong effect on metacognitive evaluation in low-level perceptual decision-making and suggest that previously observed gender differences in confidence and metacognition are likely domain-specific.
Article
In daily life, we can not only estimate confidence in our inferences (‘I’m sure I failed that exam’), but can also estimate whether those feelings of confidence are good predictors of decision accuracy (‘I feel sure I failed, but my feeling is probably wrong; I probably passed’). In the lab, by using simple perceptual tasks and collecting trial-by-trial confidence ratings visual metacognition research has repeatedly shown that participants can successfully predict the accuracy of their perceptual choices. Can participants also successfully evaluate ‘confidence in confidence’ in these tasks? This is the question addressed in this study. Participants performed a simple, two-interval forced choice numerosity task framed as an exam. Confidence judgements were collected in the form of a ‘predicted exam grade’. Finally, we collected ‘meta-metacognitive’ reports in a two-interval forced-choice design: trials were presented in pairs, and participants had to select that in which they thought their confidence (predicted grade) best matched their accuracy (actual grade), effectively minimizing their quadratic scoring rule (QSR) score. Participants successfully selected trials on which their metacognition was better when metacognitive performance was quantified using area under the type 2 ROC (AUROC2) but not when using the ‘gold-standard’ measure m-ratio. However, further analyses suggested that participants selected trials on which AUROC2 is lower in part via an extreme-confidence heuristic, rather than through explicit evaluation of metacognitive inferences: when restricting analyses to trials on which participants gave the same confidence rating AUROC2 no longer differed as a function of selection, and likewise when we excluded trials on which extreme confidence ratings were given. Together, our results show that participants are able to make effective metacognitive discriminations on their visual confidence ratings, but that explicit ‘meta-metacognitive’ processes may not be required.
Article
Full-text available
Metacognition is a fundamental feature of human behavior that has adaptive functional value. Current understanding of the factors that influence metacognition remains incomplete, and we lack protocols to improve metacognition. Here, we introduce a two-step confidence choice paradigm to test whether metacognitive performance may improve by asking subjects to reassess their initial confidence. Previous work on perceptual and mnemonic decision-making has shown that (type 1) perceptual sensitivity benefits from reassessing the primary choice, however, it is not clear whether such an effect occurs for type 2 confidence choices. To test this hypothesis, we ran two separate online experiments, in which participants completed a type 1 task followed by two consecutive confidence choices. The results of the two experiments indicated that metacognitive sensitivity improved after re-evaluation. Since post-decisional evidence accumulation following the first confidence choice is likely to be minimal, this metacognitive improvement is better accounted for by an attenuation of metacognitive noise during the process of confidence generation. Thus, here we argue that metacognitive noise may be filtered out by additional post-decisional processing, thereby improving metacognitive sensitivity. We discuss the ramifications of these findings for models of metacognition and for developing protocols to train and manipulate metacognitive processes.
Article
Full-text available
Meta-d′/d′ has become the quasi-gold standard to quantify metacognitive efficiency because meta-d′/d′ was developed to control for discrimination performance, discrimination criteria, and confidence criteria even without the assumption of a specific generative model underlying confidence judgments. Using simulations, we demonstrate that meta-d′/d′ is not free from assumptions about confidence models: Only when we simulated data using a generative model of confidence according to which the evidence underlying confidence judgments is sampled independently from the evidence utilized in the choice process from a truncated Gaussian distribution, meta-d′/d′ was unaffected by discrimination performance, discrimination task criteria, and confidence criteria. According to five alternative generative models of confidence, there exist at least some combination of parameters where meta-d′/d′ is affected by discrimination performance, discrimination criteria, and confidence criteria. A simulation using empirically fitted parameter sets showed that the magnitude of the correlation between meta-d′/d′ and discrimination performance, discrimination task criteria, and confidence criteria depends heavily on the generative model and the specific parameter set and varies between negligibly small and very large. These simulations imply that a difference in meta-d′/d′ between conditions does not necessarily reflect a difference in metacognitive efficiency but might as well be caused by a difference in discrimination performance, discrimination task criterion, or confidence criteria.
Article
Full-text available
Humans have the metacognitive ability to assess the accuracy of their decisions via confidence judgments. Several computational models of confidence have been developed but not enough has been done to compare these models, making it difficult to adjudicate between them. Here, we compare 14 popular models of confidence that make various assumptions, such as confidence being derived from postdecisional evidence, from positive (decision-congruent) evidence, from posterior probability computations, or from a separate decision-making system for metacognitive judgments. We fit all models to three large experiments in which subjects completed a basic perceptual task with confidence ratings. In Experiments 1 and 2, the best-fitting model was the lognormal meta noise (LogN) model, which postulates that confidence is selectively corrupted by signal-dependent noise. However, in Experiment 3, the positive evidence (PE) model provided the best fits. We evaluated a new model combining the two consistently best-performing models—LogN and the weighted evidence and visibility (WEV). The resulting model, which we call logWEV, outperformed its individual counterparts and the PE model across all data sets, offering a better, more generalizable explanation for these data. Parameter and model recovery analyses showed mostly good recoverability but with important exceptions carrying implications for our ability to discriminate between models. Finally, we evaluated each model’s ability to explain different patterns in the data, which led to additional insight into their performances. These results comprehensively characterize the relative adequacy of current confidence models to fit data from basic perceptual tasks and highlight the most plausible mechanisms underlying confidence generation.
Article
Full-text available
Recent evidence shows that people have the meta-metacognitive ability to evaluate their metacognitive judgments of confidence. However, it is unclear whether meta-metacognitive judgments are made by a different system and rely on a separate set of computations compared to metacognitive judgments. To address this question, we asked participants (N = 36) to perform a perceptual decision-making task and provide (i) an object-level, Type-1 response about the identity of the stimulus; (ii) a metacognitive, Type-2 response (low/high) regarding their confidence in their Type-1 decision; and (iii) a meta-metacognitive, Type-3 response (low/high) regarding the quality of their Type-2 rating. We found strong evidence for the existence of Type-3, meta-metacognitive ability. In a separate condition, participants performed an identical task with only a Type-1 response followed by a Type-2 response given on a 4-point scale. We found that the two conditions produced equivalent results such that the combination of binary Type-2 and binary Type-3 responses acts similar to a 4-point Type-2 response. Critically, while Type-2 evaluations were subject to metacognitive noise, Type-3 judgments were made at no additional cost. These results suggest that it is unlikely that there is a distinction between Type-2 and Type-3 systems (metacognition and meta-metacognition) in perceptual decision-making and, instead, a single system can be flexibly adapted to produce both Type-2 and Type-3 evaluations recursively.
Article
Full-text available
Human neuroscience has always been pushing the boundary of what is measurable. During the last decade, concerns about statistical power and replicability - in science in general, but also specifically in human neuroscience - have fueled an extensive debate. One important insight from this discourse is the need for larger samples, which naturally increases statistical power. An alternative is to increase the precision of measurements, which is the focus of this review. This option is often overlooked, even though statistical power benefits from increasing precision as much as from increasing sample size. Nonetheless, precision has always been at the heart of good scientific practice in human neuroscience, with researchers relying on lab traditions or rules of thumb to ensure sufficient precision for their studies. In this review, we encourage a more systematic approach to precision. We start by introducing measurement precision and its importance for well-powered studies in human neuroscience. Then, determinants for precision in a range of neuroscientific methods (MRI, M/EEG, EDA, Eye-Tracking, and Endocrinology) are elaborated. We end by discussing how a more systematic evaluation of precision and the application of respective insights can lead to an increase in reproducibility in human neuroscience.
Preprint
Full-text available
Decision confidence, an internal estimate of how accurate our choices are, is essential for metacognitive self-evaluation and guides behaviour. However, it can be suboptimal and hence understanding the underlying neurocomputational mechanisms is crucial. To do so, it is essential to establish the extent to which both behavioural and neurophysiological measures of metacognition are reliable over time and shared across cognitive domains. The evidence regarding domain-generality of metacognition has been mixed, while the test-retest reliability of the most widely used metacognitive measures has not been reported. Here, in human participants of both sexes, we examined behavioural and electroencephalographic (EEG) measures of metacognition across two tasks that engage distinct cognitive domains – visual perception and semantic memory. The test-retest reliability of all measures was additionally tested across two experimental sessions. The results revealed a dissociation between metacognitive bias and efficiency, whereby only metacognitive bias showed strong test-retest reliability and domain-generality whilst metacognitive efficiency (measured by M-ratio) was neither reliable nor domain-general. Hence, overall confidence calibration (i.e., metacognitive bias) is a stable trait-like characteristic underpinned by domain-general mechanisms whilst metacognitive efficiency may rely on more domain-specific computations. Additionally, we found two distinct stimulus-locked EEG signatures related to the trial-by-trial fluctuations in confidence ratings during decision formation. A late event-related potential was reliably linked to confidence across cognitive domains, while evoked spectral power predicted confidence most reliably in the semantic knowledge domain. Establishing the reliability and domain-generality of neural predictors of confidence represents an important step in advancing our understanding of the mechanisms underlying self-evaluation. Significance Statement Understanding the mechanisms underlying metacognition is essential for addressing deficits in self-evaluation. Open questions exist regarding the domain-generality and reliability of both behavioural and neural measures of metacognition. We show that metacognitive bias is reliable across cognitive domains and time, whereas the most adopted measure of metacognitive efficiency is domain-specific and shows poor test-retest reliability. Hence, more reliable measures of metacognition, tailored to specific domains, are needed. We further show that decision confidence is linked to two EEG signatures: late event-related potentials and evoked alpha/beta spectral power. While the former predicts confidence in both perception and semantic knowledge domains, the latter is only reliably linked to knowledge confidence. These findings provide crucial insights into the computations underlying metacognition across domains.
Article
Full-text available
How can choice, confidence, and response times be modeled simultaneously? Here, we propose the new dynamical weighted evidence and visibility (dynWEV) model, an extension of the drift-diffusion model of decision-making, to account for choices, reaction times, and confidence simultaneously. The decision process in a binary perceptual task is described as a Wiener process accumulating sensory evidence about the choice options bounded by two constant thresholds. To account for confidence judgments, we assume a period of postdecisional accumulation of sensory evidence and parallel accumulation of information about the reliability of the present stimulus. We examined model fits in two experiments, a motion discrimination task with random dot kinematograms and a postmasked orientation discrimination task. A comparison between the dynWEV model, two-stage dynamical signal detection theory, and several versions of race models of decision-making showed that only dynWEV produced acceptable fits of choices, confidence, and reaction time. This finding suggests that confidence judgments depend not only on choice evidence but also on a parallel estimate of stimulus discriminability and postdecisional accumulation of evidence.
Article
Full-text available
Decisions vary in difficulty. Humans know this and typically report more confidence in easy than in difficult decisions. However, confidence reports do not perfectly track decision accuracy, but also reflect response biases and difficulty misjudgements. To isolate the quality of confidence reports, we developed a model of the decision-making process underlying choice-confidence data. In this model, confidence reflects a subject’s estimate of the reliability of their decision. The quality of this estimate is limited by the subject’s uncertainty about the uncertainty of the variable that informs their decision (‘meta-uncertainty’). This model provides an accurate account of choice-confidence data across a broad range of perceptual and cognitive tasks, investigated in six previous studies. We find meta-uncertainty varies across subjects, is stable over time, generalizes across some domains and can be manipulated experimentally. The model offers a parsimonious explanation for the computational processes that underlie and constrain the sense of confidence.
Article
Full-text available
Humans differ in their capability to judge choice accuracy via confidence judgments. Popular signal detection theoretic measures of metacognition, such as M-ratio, do not consider the dynamics of decision making. This can be problematic if response caution is shifted to alter the tradeoff between speed and accuracy. Such shifts could induce unaccounted-for sources of variation in the assessment of metacognition. Instead, evidence accumulation frameworks consider decision making, including the computation of confidence, as a dynamic process unfolding over time. Using simulations, we show a relation between response caution and M-ratio. We then show the same pattern in human participants explicitly instructed to focus on speed or accuracy. Finally, this association between M-ratio and response caution is also present across four datasets without any reference towards speed. In contrast, when data are analyzed with a dynamic measure of metacognition, v-ratio, there is no effect of speed-accuracy tradeoff.
Article
Full-text available
Visual metacognition is the ability to evaluate one’s performance on visual perceptual tasks. The field of visual metacognition unites the long tradition of visual psychophysics with the younger field of metacognition research. This article traces the historical roots of the field and reviews progress in the areas of (a) constructing appropriate measures of metacognitive ability, (b) developing computational models, and (c) revealing the neural correlates of visual metacognition. First, I review the most popular measures of metacognitive ability with an emphasis on their psychophysical properties. Second, I examine the empirical targets for modeling, the dominant modeling frameworks and the assumed computations underlying visual metacognition. Third, I explore the progress on understanding the neural correlates of visual metacognition by focusing on anatomical and functional studies, as well as causal manipulations. What emerges is a picture of substantial progress on constructing measures, developing models, and revealing the neural correlates of metacognition, but very little integration between these three areas of inquiry. I then explore the deep, intrinsic links between the three areas of research and argue that continued progress requires the recognition and exploitation of these links. Throughout, I discuss the implications of progress in visual metacognition for other areas of metacognition research, and pinpoint specific advancements that could be adopted by researchers working in other subfields of metacognition.