PreprintPDF Available

Two distinct stimulus-locked EEG signatures reliably encode domain-general confidence during decision formation

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Decision confidence, an internal estimate of how accurate our choices are, is essential for metacognitive self-evaluation and guides behaviour. However, it can be suboptimal and hence understanding the underlying neurocomputational mechanisms is crucial. To do so, it is essential to establish the extent to which both behavioural and neurophysiological measures of metacognition are reliable over time and shared across cognitive domains. The evidence regarding domain-generality of metacognition has been mixed, while the test-retest reliability of the most widely used metacognitive measures has not been reported. Here, in human participants of both sexes, we examined behavioural and electroencephalographic (EEG) measures of metacognition across two tasks that engage distinct cognitive domains – visual perception and semantic memory. The test-retest reliability of all measures was additionally tested across two experimental sessions. The results revealed a dissociation between metacognitive bias and efficiency, whereby only metacognitive bias showed strong test-retest reliability and domain-generality whilst metacognitive efficiency (measured by M-ratio) was neither reliable nor domain-general. Hence, overall confidence calibration (i.e., metacognitive bias) is a stable trait-like characteristic underpinned by domain-general mechanisms whilst metacognitive efficiency may rely on more domain-specific computations. Additionally, we found two distinct stimulus-locked EEG signatures related to the trial-by-trial fluctuations in confidence ratings during decision formation. A late event-related potential was reliably linked to confidence across cognitive domains, while evoked spectral power predicted confidence most reliably in the semantic knowledge domain. Establishing the reliability and domain-generality of neural predictors of confidence represents an important step in advancing our understanding of the mechanisms underlying self-evaluation. Significance Statement Understanding the mechanisms underlying metacognition is essential for addressing deficits in self-evaluation. Open questions exist regarding the domain-generality and reliability of both behavioural and neural measures of metacognition. We show that metacognitive bias is reliable across cognitive domains and time, whereas the most adopted measure of metacognitive efficiency is domain-specific and shows poor test-retest reliability. Hence, more reliable measures of metacognition, tailored to specific domains, are needed. We further show that decision confidence is linked to two EEG signatures: late event-related potentials and evoked alpha/beta spectral power. While the former predicts confidence in both perception and semantic knowledge domains, the latter is only reliably linked to knowledge confidence. These findings provide crucial insights into the computations underlying metacognition across domains.
Content may be subject to copyright.
1
Title: Two distinct stimulus-locked EEG signatures reliably encode domain-
general confidence during decision formation
Abbreviated title: Reliability and domain-generality of metacognition.
Martina Kopčanová 1, Robin A. A. Ince2, Christopher S. Y. Benwell1
1 Division of Psychology, School of Humanities, Social Sciences, and Law, University of Dundee,
Dundee, DD1 4HN, UK
2 School of Psychology and Neuroscience, University of Glasgow, Glasgow, G12 8QB, UK
Corresponding authors: Christopher S. Y. Benwell & Martina Kopčanová
Division of Psychology, School of Humanities, Social Sciences, and Law, University of Dundee,
Dundee, UK
c.benwell@dundee.ac.uk
m.kopcanova@dundee.ac.uk
Number of pages = 55
Number of figures = 7
Number of tables = 1
Words abstract = 250
Introduction = 689
Discussion = 1549
Conflict of Interest:
The authors declare no competing financial interests.
Acknowledgments:
C.S.Y.B. is supported by the British Academy/Leverhulme Trust and the United Kingdom
Department for Business, Energy and Industrial Strategy [SRG19/191169]. M.K. is supported by
the United Kingdom Economic & Social Research Council Scottish Graduate School of Social
Science [ES/P000681/1]. The authors thank Dewy Nijhof for help with PsychoPy.
.CC-BY 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted April 22, 2023. ; https://doi.org/10.1101/2023.04.21.537831doi: bioRxiv preprint
2
Abstract
Decision confidence, an internal estimate of how accurate our choices are, is essential for
metacognitive self-evaluation and guides behaviour. However, it can be suboptimal and hence
understanding the underlying neurocomputational mechanisms is crucial. To do so, it is essential
to establish the extent to which both behavioural and neurophysiological measures of
metacognition are reliable over time and shared across cognitive domains. The evidence
regarding domain-generality of metacognition has been mixed, while the test-retest reliability of
the most widely used metacognitive measures has not been reported. Here, in human
participants of both sexes, we examined behavioural and electroencephalographic (EEG)
measures of metacognition across two tasks that engage distinct cognitive domains visual
perception and semantic memory. The test-retest reliability of all measures was additionally
tested across two experimental sessions. The results revealed a dissociation between
metacognitive bias and efficiency, whereby only metacognitive bias showed strong test-retest
reliability and domain-generality whilst metacognitive efficiency (measured by M-ratio) was
neither reliable nor domain-general. Hence, overall confidence calibration (i.e., metacognitive
bias) is a stable trait-like characteristic underpinned by domain-general mechanisms whilst
metacognitive efficiency may rely on more domain-specific computations. Additionally, we found
two distinct stimulus-locked EEG signatures related to the trial-by-trial fluctuations in confidence
ratings during decision formation. A late event-related potential was reliably linked to confidence
across cognitive domains, while evoked spectral power predicted confidence most reliably in the
.CC-BY 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted April 22, 2023. ; https://doi.org/10.1101/2023.04.21.537831doi: bioRxiv preprint
3
semantic knowledge domain. Establishing the reliability and domain-generality of neural
predictors of confidence represents an important step in advancing our understanding of the
mechanisms underlying self-evaluation.
Significance Statement
Understanding the mechanisms underlying metacognition is essential for addressing deficits in
self-evaluation. Open questions exist regarding the domain-generality and reliability of both
behavioural and neural measures of metacognition. We show that metacognitive bias is reliable
across cognitive domains and time, whereas the most adopted measure of metacognitive
efficiency is domain-specific and shows poor test-retest reliability. Hence, more reliable
measures of metacognition, tailored to specific domains, are needed. We further show that
decision confidence is linked to two EEG signatures: late event-related potentials and evoked
alpha/beta spectral power. While the former predicts confidence in both perception and
semantic knowledge domains, the latter is only reliably linked to knowledge confidence. These
findings provide crucial insights into the computations underlying metacognition across
domains.
.CC-BY 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted April 22, 2023. ; https://doi.org/10.1101/2023.04.21.537831doi: bioRxiv preprint
4
1. Introduction
Human decisions are accompanied by a sense of confidence in their accuracy which informs
metacognitive self-evaluation. However, studies have shown that confidence does not always
appropriately reflect accuracy, leading to distorted metacognition (Shekhar & Rahnev, 2021a,
2021b; Song et al., 2011). Indeed, confidence distortions have been suggested to contribute to
both sub-clinical (Benwell et al., 2022b; Rouault, Seow, et al., 2018) and clinical (Hoven et al.,
2019) psychopathology. Understanding the underlying neurocomputational mechanisms may
help us to understand why metacognition is often sub-optimal and facilitate development of
novel interventional targets.
An important open question concerns the extent to which confidence relies on domain-
general (versus domain-specific) mechanisms. A domain-general account posits that a shared
metacognitive resource is employed to evaluate performance across different cognitive domains
(de Gardelle & Mamassian, 2014). Conversely, domain-specific metacognition might rely on
computations that are unique to each task (Morales et al., 2018). Behavioural evidence to date
has been mixed, with some studies indicating domain-generality (including between sensory
modalities (Faivre et al., 2018); perception and memory (Mazancieux et al., 2020; McCurdy et al.,
2013; Samaha & Postle, 2017)), but others largely suggesting domain-specificity (Ais et al., 2016;
Arbuzova et al., 2022; Baird et al., 2013; Fitzgerald et al., 2017; Kelemen et al., 2000); see Rouault,
Mcwilliams, et al., 2018 for review). These discrepancies may partly stem from variability in the
measures examined, with distinct latent processes underlying metacognitive judgements
(Fleming & Lau, 2014). For instance, metacognitive sensitivity indexes the degree to which
confidence ratings dissociate correct from incorrect decisions, whereas metacognitive bias
.CC-BY 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted April 22, 2023. ; https://doi.org/10.1101/2023.04.21.537831doi: bioRxiv preprint
5
indexes the overall level of confidence reported (regardless of task accuracy). Metacognitive bias
shows stronger correlation across domains than sensitivity (Ais et al., 2016; Benwell et al., 2022b;
Mazancieux et al., 2020).
Nevertheless, domain-generality measured at the behavioural level does not rule out the
existence of distinct neural mechanisms. Indeed, functional magnetic resonance imaging studies
suggest that both domain-general and domain-specific confidence signals for perception and
memory coexist in the brain (Baird et al., 2013; McCurdy et al., 2013; Morales et al., 2018; Rouault
et al., 2023). Electroencephalography (EEG) allows for measurement of metacognitively relevant
neural activity with high temporal resolution. Previous studies have shown that the strength of
stimulus-locked EEG responses during decision formation, including the central parietal
positivity/P3 component, scale with the reported and/or implicit level of confidence (Azizi et al.,
2021; Feuerriegel et al., 2022; Fitzgerald et al., 2022; Gherman & Philiastides, 2015, 2018;
Herding et al., 2019; Lim et al., 2020; Rausch et al., 2020; Zakrzewski et al., 2019). Additionally,
fluctuations in confidence have also been shown to be dependent on both spontaneous and
response-locked levels of EEG activity in the alpha-band (8-12Hz) (Faivre et al., 2018; Samaha et
al., 2017; Wöstmann et al., 2019). However, these EEG signatures have rarely been examined
outside the domain of perception and hence the degree to which they relate to either domain-
specific or domain-general mechanisms is unknown.
In addition to debate regarding domain-generality, the test-retest reliability of both
behavioural and neurophysiological measures of metacognition also remains unclear. Though
computational models of behaviour, like those often employed to estimate metacognitive
performance (Fleming, 2017; Maniscalco & Lau, 2012), offer insights into latent processes,
.CC-BY 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted April 22, 2023. ; https://doi.org/10.1101/2023.04.21.537831doi: bioRxiv preprint
6
increasing concerns have been raised about their reliability and potential to capture trait-like
characteristics (Brown et al., 2020; Hedge et al., 2018; Shahar et al., 2019). Likewise, the reliability
of neuroimaging based bio-markers of cognitive functions have also been called into question
(Botvinik-Nezer et al., 2019; Botvinik-Nezer & Wager, 2022; Haines et al., 2023; Pavlov et al.,
2021; Poldrack et al., 2017).
We examined the relationships between both pre- and post-stimulus EEG activity and
subjective confidence across separate tasks that engage distinct cognitive processes visual
perception and semantic memory. Additionally, we investigated the test-retest reliabilities of
both behavioural and neural measures of metacognition across two experimental sessions. We
found that overall confidence calibration (i.e., metacognitive bias) was the most reliable measure
across cognitive domains and time, whereas metacognitive efficiency showed poor test-retest
reliability and low domain-generality. Fluctuations of confidence within-participants were
reliably captured by stimulus-locked event-related potential (ERP) activity across cognitive
domains, and less reliably captured by event-related spectral perturbation (ERSP) activity.
.CC-BY 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted April 22, 2023. ; https://doi.org/10.1101/2023.04.21.537831doi: bioRxiv preprint
7
2. Materials and Methods
2.1. Participants
We recruited twenty-seven participants, of whom twenty-five completed two
experimental sessions on two separate days. Only the 25 who completed two sessions were
included in the analyses (15 females, 24 right-handed, 1 ambidextrous, age: M = 23.2, SD = 3.4,
range = 18-34). The mean number of days between testing sessions was 25 (range = 2-85).
Inclusion criteria were age between 18 and 40 years, normal or corrected-to-normal vision, and
no history of neurological or psychiatric disorders. Ten participants wore glasses or contact
lenses. On average, participants reported sleeping M = 7.21 (SD = 1.43) hours the night prior to
testing. All participants gave written informed consent in accordance with the Declaration of
Helsinki and were financially compensated for their time (£30 per recording session). The study
was approved by the School of Social Sciences Ethics Committee at the University of Dundee and
took place within the University’s Psychology department.
2.2. Tasks and Experimental design
Participants completed two separate 2-alternative forced-choice tasks, one perceptual
(Benwell et al., 2022b; Rouault, Seow, et al., 2018) and one semantic knowledge based (Benwell
et al., 2022b; Sanders et al., 2016), in a counterbalanced (blocked) order whilst 64-channel EEG
was simultaneously recorded. Each participant completed two experimental sessions on separate
days thereby allowing us to examine the test-retest reliability of both the behavioural and EEG
results. The order of the tasks was counterbalanced between the experimental sessions. The
perceptual task (PT) involved deciding which of two simultaneously presented boxes presented
.CC-BY 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted April 22, 2023. ; https://doi.org/10.1101/2023.04.21.537831doi: bioRxiv preprint
8
contained a higher number of dots, while in the knowledge task (KT) participants were required
to choose which of two countries had a larger human population. Following each response,
confidence judgements were provided on a six-point rating scale (see Figure 1).
We presented all stimuli using PsychoPy software (Peirce et al., 2019) on a 53x30cm
monitor with 1920x1080 resolution and 60Hz refresh rate. Participants sat in a dimly lit room
approximately 70cm from the computer screen. Each trial started with a black fixation cross
(1.5x1cm) presented on a white background for a duration randomly varying between 3000-
3500ms followed by the experimental stimuli presented for 500ms. For the perceptual task,
stimuli consisted of two black boxes (8.5x9cm) containing white dots, one on the left and the
other on the right side of the screen (with 4.2cm between the boxes). One box (the reference
box) constantly contained 272 dots (out of 544 possible dot locations), while the other box
contained an increased or reduced number of dots ranging from either -40 to +40 dots (in
increments of 8) in comparison to the reference box (including an identical condition). For each
difficulty level, the position of the target box (left/right) was randomised. For the knowledge task,
stimuli consisted of two country names whose population ratios were varied across 5 difficulty
levels. We downloaded the national populations from The World Bank
(‘https://data.worldbank.org/indicator/SP.POP.TOTL’) in December 2019. We created five
different evidence discriminability ‘bins’ by grouping country pairs with similar population log
ratios (log10(Country-A Population/Country-B Population)). The log ratio bins amounted to the
following, ranging from least to most discriminable: bin 1 (log10 ratio = 0-0.225), bin 2 (0.225-
0.45), bin 3 (0.45-0.675), bin 4 (0.675-0.9), bin 5 (0.9-1.125). Each bin included 18 different
country pairs. The left/right position of names in each country pair was randomised across
.CC-BY 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted April 22, 2023. ; https://doi.org/10.1101/2023.04.21.537831doi: bioRxiv preprint
9
participants. The stimuli were followed by another fixation cross that lasted 500ms. The inclusion
of this time-period prior to response ensured that stimulus induced EEG waveforms were not
contaminated by motor activity. Afterwards, participants were prompted to give their response
using the ‘w’ (left larger) and ‘e’ (right larger) keys. Next, they reported their decision confidence
by pressing one of six keys on the numeric pad (1-not confident at all, 6-certain). The response
and confidence prompts remained on the screen until participants responded. To familiarise
themselves with the tasks, participants completed ten practice trials prior to each. In total, each
experimental session consisted of 400 (knowledge) and 440 (perceptual) task trials, each split
into 5 blocks, with trial order fully randomised within blocks. This resulted in 80 trials per difficulty
level in each task, with forty additional catch trials included in the perceptual task. Overall, each
recording session lasted approximately 2.5-3 hours including EEG set-up. We randomised the
order of task performance across participants.
Figure 1: Perceptual and knowledge tasks. Both experimental tasks started with a black fixation
cross presented on a white background for a randomly varying duration of 3000-3500ms.
.CC-BY 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted April 22, 2023. ; https://doi.org/10.1101/2023.04.21.537831doi: bioRxiv preprint
10
Experimental stimuli followed, remaining on the screen for 500ms and consisting of pairs of
country names for the knowledge task and pairs of black boxes with varied number of white dots
for the perceptual task. Next, another fixation cross appeared for 500ms, followed by the
response prompt then the confidence rating scale, which remained on the screen until
participants responded.
2.3. Behavioural analysis
Model-free analyses
To evaluate the effectiveness of the task difficulty manipulation on each task, we
compared the mean accuracy and confidence ratings across all difficulty levels using one-way
repeated-measures Analyses of Variance (ANOVA), with difficulty as the independent variable
and proportion of correct responses/mean confidence as the dependent variables. Where
Mauchly’s Test of Sphericity indicated unequal variances (p < .05), we applied the Greenhouse-
Geisser correction (). We used Eta squared (η2) as measure of effect sizes.
Modelling metacognition
We modelled 1st-order decisions and subjective confidence ratings in both tasks within an
extended signal detection theory (SDT) framework (Maniscalco & Lau, 2012). All parameters
were modelled using data collapsed across all difficulty levels. We calculated both type-1
sensitivity (d’, indexing 1st-order decision accuracy) and type-2 (metacognitive) sensitivity (meta-
d’, equal to the value of d required to produce the observed confidence ratings data by an
optimal metacognitive observer with the same type-1 criterion) for each participant. A
metacognitively optimal observer should have d’ equal to meta-d’. We then calculated the
metacognitive ratio (meta-d’/d’) as a measure of metacognitive efficiency (estimate of
.CC-BY 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted April 22, 2023. ; https://doi.org/10.1101/2023.04.21.537831doi: bioRxiv preprint
11
metacognitive sensitivity whilst controlling for 1st-order performance), whereby M-ratio = 1
suggests ideal efficiency, M-ratio < 1 suggests not all evidence available for type-1 decisions was
used to make type-2 decisions, and M-ratio > 1 suggests more evidence was available for type-2
than type-1 decisions (Fleming & Lau, 2014). A measure of confidence bias, the type-2
(confidence) criterion (type-2 c’), was also calculated to estimate participants tendency to give
high/low confidence ratings regardless of overall accuracy. To separate confidence bias from
type-1 response bias, we computed the absolute difference between the type-1 and type-2
criteria (Benwell et al., 2022b; Sherman et al., 2018). We averaged the type-2 criterion across N-
1 available confidence ratings and then across both possible responses (‘left’/’right’) for all
analyses. All of the measures were estimated using the Maximum Likelihood Estimation method
from Maniscalco & Lau, (2012) (http://www.columbia.edu/~bsm2105/type2sdt/) using the
fit_meta_d_MLE function in MATLAB R2021a (Mathworks, USA).
Test-retest reliability & domain-generality
To test the reliability of all behavioural measures, both model-free (accuracy, confidence) and
model-based, we calculated Intraclass correlation (ICC) coefficients both between experimental
sessions and between tasks. Two-way consistency ICC type was used (C-1, based on the McGraw
and Wong (1996) convention) and calculated with the ICC function in MATLAB (Salarian, 2023
(https://www.mathworks.com/matlabcentral/fileexchange/22099-intraclass-correlation-
coefficient-icc)). We also used paired-samples t-tests to test for between-task differences on
each day for each measure of interest.
.CC-BY 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted April 22, 2023. ; https://doi.org/10.1101/2023.04.21.537831doi: bioRxiv preprint
12
2.4. Electroencephalography acquisition and pre-processing
Continuous EEG was recorded using a 64-channel ActiveTwo system (Biosemi, The
Netherlands) at a sampling rate of 1024Hz. The scalp electrodes were placed according to the
International 10-20 system. We placed four additional electrooculographic electrodes at the
outer canthi of each eye as well as above and below the participant’s right eye to capture
horizontal and vertical eye movements respectively.
We performed EEG data pre-processing offline using custom-written scripts in MATLAB
R2021a (Mathworks, USA) including EEGLAB (Delorme & Makeig, 2004) and Fieldtrip (Oostenveld
et al., 2011) functions. Low-pass (100Hz) and high-pass (0.5Hz) filters were applied to the data
using a zero-phase second-order Butterworth filter. We then divided the filtered recordings into
4 second epochs, from -2.5 to 1.5s relative to stimulus onset on each trial. We visually inspected
the data and removed faulty or excessively noisy channels without interpolation (Day 1: range=0-
3, M=0.4 (KT), 0.44 (PT); Day 2: range 0-3, M = 0.56 (KT), M = 0.36 (PT)). The recording was
subsequently re-referenced to the average of all channels (excluding the four non-scalp
electrodes) and excessively noisy trials were removed following a semi-automated artifact
identification procedure in which trials with potential artefacts were identified based on (1)
extreme amplitudes (threshold of ± 75 µV), (2) joint probability of the recorded activity across
electrodes at each time point (probability threshold limit of 3.5 and 3 standard deviations [SD]
for single-channel limit and global limit, respectively; pop_jointprob; Delorme & Makeig, 2004)
and (3) kurtosis (local limit of 5 SD, global limit of 3 SD; pop_rejkurt; Delorme & Makeig, 2004).
We then ran independent component analysis (ICA) using the runica’ function in EEGLAB
(Delorme & Makeig, 2004) and components corresponding to blinks/eye-movements, muscle
.CC-BY 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted April 22, 2023. ; https://doi.org/10.1101/2023.04.21.537831doi: bioRxiv preprint
13
activity, or transient channel noise were semi-automatically removed using the Multiple Artifact
Rejection Algorithm (MARA) (Winkler et al., 2011). Next, we removed any remaining noisy
epochs, this resulted in a mean of 424 (range = 401-437) perceptual and 386 (range = 369-398)
knowledge trials per participant (across both testing sessions). We excluded the rejected trials
from all behavioural analyses too. Finally, previously removed channels were interpolated using
a spherical spline method.
For ERP analyses only, we applied an additional low-pass (40Hz) filter to the clean time-
series data. The data were then cut into epochs spanning -500:1000ms relative to stimulus onset
on each trial and baseline corrected using the 500ms pre-stimulus period.
Time-frequency spectral power analysis
To estimate spectral power across time and frequency domains, we performed a Fourier-
based time-frequency transformation on the clean single-trial data for each channel using the
ft_freqanalysis function (method: mtmconvol’) in the Fieldtrip toolbox (Oostenveld et al.,
2011). Overlapping sections of single-trial time-series data were decomposed by consecutively
shifting a 0.5-second-long window forward in time by 0.02s and Hanning tapered. A frequency
resolution of 1Hz across 1-40Hz range was achieved by zero-padding the data to a length of 1
second. The absolute values of the resulting complex-valued time-frequency estimates were then
squared to obtain single-trial spectral power and converted into decibels (dB). We then used data
epochs spanning -1:1s relative to stimulus onset in the statistical analyses.
.CC-BY 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted April 22, 2023. ; https://doi.org/10.1101/2023.04.21.537831doi: bioRxiv preprint
14
2.5. Statistical analyses
We carried out all statistical analyses in MATLAB R2021a and 2022a (Mathworks, USA)
using custom-written code.
Single-trial regression analysis
To elucidate both the ERP and pre- and post-stimulus time-frequency spectral power
correlates of subjective confidence ratings, we adopted a non-parametric single-trial regression
approach (Benwell et al., 2017, 2022a; Samaha et al., 2017). The large number of task trials per
participant allowed us to test for systematic EEG-behaviour relationships in a hierarchical two-
stage estimation model (Friston, 2008), thus incorporating subject-level variation into the group-
level statistics. Prior to the regression analyses, we rank transformed all variables of interest
apart from accuracy to nullify the influence of any outlying trials.
First, we tested within-participant relationships between spectral power and confidence
ratings with separate regression models for each univariate response (i.e., time-electrode point
(for ERPS) or time-frequency-electrode point (for time-frequency power)). The single-trial EEG
power (), accuracy (

󰇜 and difficulty level (

) were entered as predictors of
the behavioural (

) outcome. This allowed us to test the relationships between EEG
activity and confidence whilst controlling for external evidence strength and 1st-order accuracy.
We obtained the regression coefficients for the EEG-confidence relationship from the
following multiple linear regression model:








.CC-BY 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted April 22, 2023. ; https://doi.org/10.1101/2023.04.21.537831doi: bioRxiv preprint
15
where  represents the strength and direction of the unique relationship between EEG
activity and confidence. The linear regression models were applied using the ‘fitlm’ function with
the least-squares criterion (Mathworks, USA).
Second, we computed group-level statistics with cluster-based permutation testing
(Maris & Oostenveld, 2007) using the ‘ft_freqstatistics’ function in Fieldtrip (method:
montecarlo’, Oostenveld et al. 2011). Regression coefficients () at each data point (time-
frequency-electrode) were combined across all participants, and a two-tailed dependent-
samples t-test against zero was used to test for systematic group-level effects. We used cluster-
based permutation testing to control the family-wise error rate (Maris & Oostenveld, 2007). First,
clusters were formed by combining all adjacent significant time-frequency-electrode datapoints
based on the initial t-tests, separately for negative and positive values, and summing the t-values
to produce a cluster-level statistic. A minimum of 1 significant neighbouring sample was required
to form a cluster. Electrodes were considered as neighbours based on the ‘biosemi64-
neighb.mat template in Fieldtrip (Oostenveld et al., 2011), created through symmetric
triangulation and manual editing, and leading to a mean of 6.6 (range=3-8) neighbours per
channel. The cluster t-statistics were then compared against a data-driven null hypothesis
distribution. This was obtained by randomly drawing coefficients from a subset of participants,
multiplying them by -1, thereby cutting the hypothesised brain-behaviour relationship, and
forming clusters based on significant t-tests against 0 as described above. We repeated the whole
procedure 2000 times while saving the largest cluster t-statistic on each iteration, thus building
a distribution of t-values that would be expected in the absence of an EEG-behaviour relationship.
Consequently, the t-statistics of the true positive and/or negative clusters falling within either
.CC-BY 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted April 22, 2023. ; https://doi.org/10.1101/2023.04.21.537831doi: bioRxiv preprint
16
the lower or higher 2.5% of the null distribution (alpha = .05, two-tailed) were considered
significant.
Decoding analysis
Finally, to investigate the domain-generality and test-retest reliability of confidence-
related EEG activity, we performed a series of multivariate decoding analyses. First, within each
task on each day separately, we tested whether single-trial confidence could be decoded from
post-stimulus EEG activity (both ERP and spectral power respectively). Crucially, we then
performed a cross-decoding analysis which allowed us to investigate whether a classifier trained
on a task from one cognitive domain could also predict confidence in the other cognitive domain.
A significant cross-classification would indicate that shared neural representations underlie
confidence judgements across cognitive domains, providing evidence for domain-general neural
correlates of metacognition. If cross-classification is not possible or is weaker than within-task
classification, perceptual and knowledge decision confidence may be related to domain-specific
or only partially overlapping neural mechanisms. Similarly, we also investigated the test-retest
reliability of the within-task decoding by testing whether a classifier trained on the data from the
first day could decode confidence on the second day as well.
We performed all multivariate pattern analyses using the MVPA Light toolbox (Treder,
2020) in MATLAB 2022a (Mathworks, USA). Classifiers were trained to decode between high and
low confidence trials which we determined using a binning method that ensured the most even
split between high/low confidence trials within each participant. We calculated the number of
trials for each of the 6 possible confidence ratings and then binned the trials so that the difference
.CC-BY 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted April 22, 2023. ; https://doi.org/10.1101/2023.04.21.537831doi: bioRxiv preprint
17
in trial number between high and low confidence bin was minimised (mean difference day 1 =
15.52% in a range 0.25-42.65%; day 2 = 14.64% in a range 1.52-44.25%). This procedure is
equivalent to a within-participant median split of continuous data. Across datasets from both
days (n = 50), 5 datasets were in a bin split into [1 = low bin] [2 3 4 5 6 = high bin]; 5 in a split of
[1 2] [3 4 5 6]; 10 in [1 2 3] [4 5 6]; 12 in [1 2 3 4] [5 6]; and 18 in [1 2 3 4 5] [6].
We used a Linear Discriminant Analysis (LDA) classifier and calculated the area under the
receiver operating characteristic curve (AUC) as a classifier performance metric. For within-task
decoding, we used 5-fold cross-validation to avoid overfitting and repeated it 10 times to reduce
the noise following the random assignment of trials into each fold. For analysis of ERP data, we
computed a classifier that identifies the spatial distribution of EEG activity (across all 64
electrodes) which maximally distinguished between high and low confidence at each time point
(0:1s). To assess the temporal generalisation of the classification performance, all classifiers were
trained and tested on each time point of the data, thereby producing a 2D matrix of decoding
performance (train x test time). The shape of the final decoding matrix allows inferences about
the temporal dynamics of mental representations (King & Dehaene, 2014). Similarly, for the
analysis of spectral power, we performed a classification analysis using the average power in the
classic alpha (8-12Hz) and beta (13-30Hz) bands across all post-stimulus time points (0:1s),
training and testing the classifier at each time point. We trained the classifiers using the average
power from individual bands as well as their combination.
We performed the cross-task and cross-day decoding using an LDA classifier with AUC
performance metric as well. In the between-task classification, we first trained the classifier on
the whole dataset (combined across both testing days) from the knowledge task and then tested
.CC-BY 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted April 22, 2023. ; https://doi.org/10.1101/2023.04.21.537831doi: bioRxiv preprint
18
it on data from the perceptual task (and vice versa). We examined test-retest reliability by
training the classifier on day 1 data and testing it on day 2 data for each task separately.
The statistical significance of all classifiers at the group level was tested using cluster-
based non-parametric permutation testing (Maris & Oostenveld, 2007) similar to that we
employed in the single-trial regression analyses. We compared the actual classification values at
all datapoints (in a 2D matrix for temporal generalisation) to a null classification AUC value (0.5)
with a paired-samples t-test. An element was considered significant if p < .05 (one-tailed). All
neighbouring datapoints that passed the element-level significance threshold were collected into
a cluster. Directly and diagonally adjacent elements were considered as neighbours. We obtained
a cluster-level statistic by summing the individual t-values within each cluster. We then compared
this against a null distribution generated by repeating the whole procedure 2000 times while
randomly permuting the data. Clusters with cluster-level p < .05 were considered significant.
.CC-BY 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted April 22, 2023. ; https://doi.org/10.1101/2023.04.21.537831doi: bioRxiv preprint
19
3. Results
3.1 Behavioural results
First, we evaluated and compared the overall performance on both the perceptual and
knowledge tasks. Figure 2A (day 1) and Figure 2B (day 2) plot the group-averaged proportion of
correct responses as a function of evidence discriminability for both tasks. As expected, in both
tasks (and on both days), the proportion of correct responses increased significantly from the
hardest to the easiest trials (Perceptual task day 1: F(1,24) = 151.71, pGG<.0001,
η2= 0.863, all post-hoc p’s <.05, Perceptual task day 2: F(1,24) = 153.77, pGG<.0001, η2= 0.865, all
post-hoc p’s <.05 apart from bins 4-5 (p = .734); Knowledge task day 1: F(1,24) = 247.55, p <.0001,
η2= 0.912, all post-hoc p’s <.05, Knowledge task day 2: F(1,24) = 161.99, p < .0001, η2= 0.871, all
post-hoc p’s < .05). Comparisons of proportion of correct responses between the tasks,
performed separately at each evidence discriminability level, showed that on both days
participants were significantly more accurate in the perceptual task at all difficulty levels apart
from the easiest condition (all p’s < .05). The scatterplots in Figure 2A-B further show that the
overall proportion of correct responses on the perceptual and knowledge tasks were not
significantly correlated in either experimental session (ICCs day 1: r = .287 p =.078; day 2: r = .289
p =.076). Hence, 1st-order accuracy did not show strong domain-generality across tasks. However,
it did show strong test-retest reliability from session 1 to session 2 within both tasks (Figure 2C:
perception r = .817, p < .0001; knowledge r = .807, p < .0001). Figure 2C also shows that overall
perceptual task accuracy increased from day 1 (mean proportion correct: M = 0.824) to 2 (M =
0.842) (t(24) = -2.227, p = .036, Cohens d = 445), whereas overall knowledge task accuracy did
not significantly differ from day 1 (M = 0.753) to 2 (M = 0.743) (t(24) = 1.157, p = .259, d = 231).
.CC-BY 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted April 22, 2023. ; https://doi.org/10.1101/2023.04.21.537831doi: bioRxiv preprint
20
Next, we investigated subjective confidence across tasks. Figure 2D (day 1) and Figure 2E
(day 2) plot group-averaged confidence ratings as a function of evidence discriminability for both
tasks. As expected, average confidence ratings increased significantly as task difficulty decreased
(Perceptual task day 1: F(1,24) = 50.36, pGG<.0001, η2= 0.677, all post-hoc p’s < .05 apart from
comparison in bins 0-1 (p =.9995), 0-2 (p = .091), Perceptual task day 2: F(1,24) = 35.21,
pGG<.0001, , η2= 0.595, all post-hoc p’s < .05 apart from bins 0-1 (p = .491), Knowledge task day
1: F(1,24) = 72.71, pGG<.0001, η2= 0.752, all post-hoc p’s < .05 apart from bins 1-2 (p =.678),
Knowledge task day 2: F(1,24) = 57.00, pGG<.0001, , η2= 0.704, all post-hoc p’s < 05 apart from
bins 1-2 (p =.067). Despite 1st-order accuracy being higher for the perceptual task across most
difficulty levels (Fig 2A-B), we did not observe any significant between-task differences in average
confidence at any of the individual difficulty levels on either of the testing days (Fig 2D-E: all p’s
> .05). In contrast to 1st-order accuracy, overall average confidence ratings were strongly
correlated between the tasks on both testing days (Figure 2D-E scatterplots; ICCs day 1: r = .550
p =.0018, day 2: r = .595 p =.0007). Hence, unlike accuracy, confidence ratings showed reliable
and moderately strong domain-generality across the cognitive domains tested (perception and
semantic memory). Average confidence ratings also showed excellent test-retest reliability from
session 1 to session 2 within both tasks (Figure 2F: ICCs perception r = .845, p < .0001; knowledge
r = .737, p < .0001). In line with 1st-order accuracy, Figure 2F shows that overall confidence
increased from day 1 (M = 3.787) to day 2 (M = 4.120) (t(24) = -3.222, p = .004, d = .645) for the
perceptual task, but not for the knowledge task (t(24) = -1.012, p = .322; day 1 M = 3.877, day 2
M = 3.999, d = .202).
.CC-BY 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted April 22, 2023. ; https://doi.org/10.1101/2023.04.21.537831doi: bioRxiv preprint
21
Figure 2: Model-free behavioural performance. A-B Group-averaged proportion of correct
responses per evidence discriminability level on each testing day. Error bars represent between-
participant standard error of the mean (SEM). Scatterplots show the relationship between the
proportion correct responses from each task (with Intraclass correlation coefficients (two-way
consistency type)). C Comparisons of objective performance between testing days. The bar chart
.CC-BY 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted April 22, 2023. ; https://doi.org/10.1101/2023.04.21.537831doi: bioRxiv preprint
22
shows mean proportion correct for each task on each day. Error bars represent standard
deviation. The scatterplots show the test-retest reliability of objective performance in each task
separately, measured by intraclass correlations (ICCs). D-E Group-averaged confidence ratings
per difficulty level. Error bars represent SEM. Scatterplots show the correlation in confidence
rating between tasks. F Test-retest reliability (scatterplots) and between-day comparison of
confidence ratings (bar chart, error bars = SD) in each task. *** p < .0001, ** p < .001, * p < .05
Overall, in both tasks the difficulty manipulation was effective, and participants used the
confidence scale appropriately. Test-retest reliabilities of both confidence and objective accuracy
were high within-tasks, but only confidence showed a significant between-task correlation
indicating domain-generality.
Model-based metacognition
We next investigated the reliability and domain-generality of model-based measures of
performance (collapsed across all discriminability levels) derived from an extended signal
detection theory (SDT) framework (Maniscalco & Lau, 2012). Comparison of type-1 and type-2
performance measures between the tasks (Figure 3) showed that type-1 sensitivity (d’) was
significantly higher in perception than knowledge on both days (D1: perception M(SD) =
1.806(0.443), knowledge M(SD) = 1.392(0.416), t(24) = 4.118, p < .001, Cohen’s d = .824; D2:
perception M(SD) = 2.281(0.558), knowledge M(SD) = 1.344(0.509), t(24) = 7.257, p < .0001, d =
1.451), in line with the proportion correct results above. Overall type-2 (confidence) c’ (indexing
confidence bias) did not differ significantly between-tasks on either day (D1: perception M(SD) =
0.959(0.561), knowledge M(SD) = 0.831(0.452), t(24) = 1.407, p = .172, d = .281; D2: perception
M(SD) = 0.891(0.492), knowledge M(SD) = 0.758(0.339), t(24) = 2.047, p = .052, d = .409), whereas
metacognitive efficiency (Meta-d’/d’) was significantly higher for knowledge compared to
.CC-BY 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted April 22, 2023. ; https://doi.org/10.1101/2023.04.21.537831doi: bioRxiv preprint
23
perception (D1: perception M(SD) = 0.709(0.225), knowledge M(SD) = 0.927(0.259), t(24) = -
3.914, p < .001, d = .783; D2: perception M(SD) = 0.675(0.244), knowledge M(SD) = 0.990(0.209),
t(24) = -5.754, p < .0001, d = 1.151). Hence, in line with our previous study (Benwell et al., 2022b),
whilst 1st-order performance was significantly better on the perception task, metacognitive
efficiency was significantly higher for the knowledge task. Higher metacognitive efficiency in the
knowledge task may be due to differences in evidence types between tasks. Unlike perceptual
evidence, semantic evidence in the knowledge task is internal and continuously available
throughout the trial for metacognitive evaluation.
.CC-BY 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted April 22, 2023. ; https://doi.org/10.1101/2023.04.21.537831doi: bioRxiv preprint
24
Figure 3: Between-task comparisons of model-based type1 and type-2 performance measures.
The top row plots day 1 data and the bottom row plots day 2 data. A-B type-1 sensitivity (d’), C-
D type-2 (confidence) criterion, and E-F metacognitive efficiency (M-ratio). The central boxplot
lines correspond to the median, the upper box edges correspond to the 0.75 quantile and the
lower edges represent the 0.25 quantile. The whiskers represent the non-outlier maximum and
minimum values.
.CC-BY 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted April 22, 2023. ; https://doi.org/10.1101/2023.04.21.537831doi: bioRxiv preprint
25
Table 1: Domain-generality and test-retest reliability of Type-1 and Type-2 performance
Between-
task ICC
r(p)
95% CI
UB-LB
Between-day
ICC
r(p)
Day 1
Perception
d’
0.316(.058)
.627- -.083
d’
0.790(<.0001)
Type-2
criterion
0.601(.0006)
.802-.277
Type-2
criterion
0.850(<.0001)
Meta-d’/d’
0.341(.044)
.644- -.055
Meta-d’/d’
0.268 (.093)
Day 2
Knowledge
d’
0.270(.091)
.596- -.132
d’
0.837(<.0001)
Type-2
criterion
0.705(<.0001)
.858-.436
Type-2
criterion
0.738(<.0001)
Meta-d’/d’
0.277(.086)
.600- -.125
Meta-d’/d’
0.180(.190)
Note: Intraclass correlations (ICCs, two-way consistency type) between days and between tasks
for each type-1 and metacognitive measure. Upper (UB) and lower (LB) bounds of the 95%
confidence intervals of the ICC r-values are also reported.
To test the domain-generality of the model-based metacognitive measures, we computed
between-task intra-class correlations for each. We reasoned that significant correlation of a given
measure between tasks suggests that a shared mechanism must contribute across cognitive
domains. As Table 1 shows (see Supplementary Figure S1 for scatterplots), the only measure that
showed a strong and replicable significant between-task correlation was type-2 (confidence) c,
.CC-BY 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted April 22, 2023. ; https://doi.org/10.1101/2023.04.21.537831doi: bioRxiv preprint
26
with r = .601 on day 1, and .705 on day 2. This suggests that overall confidence calibration
(metacognitive bias) strongly contributes to metacognition in a domain-general manner. In
contrast, neither type-1 sensitivity (d’) nor metacognitive efficiency (Meta-d’/d’) showed
replicable significant between-task correlations. Although metacognitive efficiency showed a
weak significant correlation between tasks on day 1, this was not replicated on day 2.
We tested the test-retest reliability of all measures with between-day, within-measure
intra-class correlations (Table 1, see also Supplementary Figure S2). Both d’ and type-2 c’ were
strongly correlated across both testing sessions for both tasks (all p’s < .0001, r range = .738-.850)
suggesting excellent test-retest reliability of type-1 sensitivity and confidence bias. In contrast,
metacognitive efficiency (Meta-d’/d’) did not significantly correlate between testing sessions for
either task, indicating poor test-retest reliability of this measure.
Summarising the results for our metacognitive behavioural measures, overall confidence
calibration (as indexed by both average confidence ratings and type-2 c’) was both domain-
general and highly reliable over time, whereas metacognitive efficiency (as indexed by Meta-
d’/d’) was neither domain-general nor reliable over time. Hence, we focussed our EEG analyses
on identifying reliable signatures associated with single-trial confidence reports across cognitive
domains.
3.2 EEG results
To identify EEG predictors of confidence during decision formation, we investigated
single-trial relationships between confidence ratings and both ERP and time-frequency spectral
power activity for each task separately across both experimental sessions.
.CC-BY 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted April 22, 2023. ; https://doi.org/10.1101/2023.04.21.537831doi: bioRxiv preprint
27
Late stimulus-locked ERP activity reliably predicts domain-general subjective confidence
We used cluster-based permutation analysis (Maris & Oostenveld, 2007) to identify
systematic relationships between single-trial ERP amplitude and confidence ratings across all
electrodes and post-stimulus time points (0-1s relative to stimulus presentation). Whilst
controlling for difficulty level and 1st-order accuracy, we observed significant relationships
between a late component of the evoked potential and confidence ratings in both tasks which
replicated across both experimental sessions (Figure 4).
Figure 4: Late ERP activity reliably reflects subjective confidence ratings, independently of
accuracy and evidence discriminability, across cognitive domains. Grand-averaged stimulus-
locked ERP waveforms at electrode P3 for high (red) and low (blue) confidence trials are shown
for the A Perception task (day 1), B Knowledge task (day 1), C Perception task (day 2), and D
.CC-BY 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted April 22, 2023. ; https://doi.org/10.1101/2023.04.21.537831doi: bioRxiv preprint
28
Knowledge task (day 2). The high versus low confidence trials were defined in a binning
procedure that minimised the difference in the number of high/low confidence trials within each
participant (i.e., approximate median split of ordinal data, see Methods). Red/blue shaded
waveform areas represent SEM of the respective waveforms. Note that we only performed the
binning here to plot high and low confidence waveforms for illustrative purposes. We performed
the statistical analysis on single-trial (non-binned) EEG and behavioural data. Gray shaded areas
represent the significant time windows for positive clusters, as identified through cluster-based
permutation testing. The topographies plot the mean t-values for all time-points from the
significant cluster across participants, with both significant positive cluster electrodes and
negative cluster electrodes highlighted in white.
For the perceptual task on day 1, we found that two significant clusters predicted
confidence (Figure 4A): one positive (cluster statistic = 17,693, p = .0015; spanning ~490-1000ms
post-stimulus over centroparietal/occipital electrodes with a left parietal maximum (see
topography in Figure 4A)) and one negative (cluster statistic = -18,845, p = .001; spanning ~520-
1000ms post-stimulus over frontal electrodes with a right maximum). We found two highly
similar clusters that predicted confidence on the knowledge task on day 1 as well (Figure 4B): one
positive (cluster statistic = 11,124, p = .0025; ~490-960ms over centroparietal electrodes (see
Figure 4B topography)) and one negative (cluster statistic = -9,953 p= .0015; ~470-890ms over
right frontotemporal electrodes).
These results were largely replicated in the second testing session (Figure 4C-D). On day
two, we found a positive (cluster statistic = 14,267, p = .002, beginning at ~530ms, over left
centroparietal/occipital electrodes, Fig. 4C) and a negative (cluster statistic = -17,997, p = .0005,
starting at ~530ms, over right frontotemporal electrodes) cluster in the perceptual task. In the
knowledge task (day 2, Fig. 4D), we found a positive cluster (cluster statistic = 4,360, p = .0265,
.CC-BY 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted April 22, 2023. ; https://doi.org/10.1101/2023.04.21.537831doi: bioRxiv preprint
29
spanning ~640-810ms, over left centroparietal/occipital electrodes) and a negative cluster
(cluster statistic = -6,387, p = .0125, spanning ~480-930ms, over right frontotemporal electrodes).
Based on the topography and timing of the observed clusters, they possibly represent
opposite dipoles of the late portion of the P3/central parietal positivity (CPP) component
previously implicated in decision-making and subjective measures of perception including
confidence (Azizi et al., 2021; Feuerriegel et al., 2022; Fitzgerald et al., 2022; Gherman &
Philiastides, 2015, 2018; Herding et al., 2019; Lim et al., 2020; Tagliabue et al., 2019). Participants’
confidence tended to be higher on trials with stronger late CPP amplitude irrespective of the
cognitive domain tested, thus highlighting that the CPP reliably reflects domain-general
subjective confidence over and above the external stimulus information (i.e., difficulty) and
objective accuracy.
Post-stimulus, but not pre-stimulus, alpha/beta desynchronisation predicts subjective
confidence
We additionally tested whether single-trial spectral power across time (-1s:1s) and
frequencies (1:40 Hz) is related to confidence ratings. We included the 1s pre-stimulus period in
this analysis because previous studies have suggested that spontaneous EEG power during this
period predicts subsequent confidence in perceptual decisions (Samaha et al., 2017; Wöstmann
et al., 2019).
.CC-BY 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted April 22, 2023. ; https://doi.org/10.1101/2023.04.21.537831doi: bioRxiv preprint
30
Figure 5: Single-trial alpha/beta desynchronisation reflects subjective confidence ratings,
independently of accuracy and evidence discriminability. The time-frequency plots show the
mean t-values across all electrodes at each time-frequency point for the A Perception task (day
1), B Knowledge task (day 1), C Perception task (day 2), and D Knowledge task (day 2). We
obtained the T-values by comparing the single-trial regression coefficients (representing the
unique relationship between confidence and spectral power) against zero with cluster-based
permutation testing. The significant clusters are outlined in black contour (dependent-samples t-
test, 2-tailed, cluster p < .025). The black vertical line at 0s denotes stimulus onset. The right inset
of each panel plots topographies of the average t-values for the significant clusters across alpha
(8-12Hz) and beta (13-30Hz) frequencies separately. Electrodes included in each cluster are
highlighted in white.
In Figure 5, the direction and strength of relationships between EEG power and
confidence, controlling for difficulty level and 1st-order accuracy, are represented as t-values
averaged across all electrodes at each time and frequency point. These t-values represent group-
level tests of whether the regression coefficients (EEG power predicting confidence) from the
.CC-BY 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted April 22, 2023. ; https://doi.org/10.1101/2023.04.21.537831doi: bioRxiv preprint
31
individual single-trial analyses showed a systematic linear relationship across participants. For
the perceptual task on day 1 (Figure 5A), we found a single negative cluster to predict confidence
ratings, starting ~0.220s after stimulus onset until 1s and spanning 6-38Hz frequency range
(cluster statistic = -32,772, p = .0025). Similarly, one negative post-stimulus cluster predicted
confidence in the knowledge task on day 1 (Figure 5B), beginning at ~340ms until 1s and spanning
4-32Hz (cluster statistic = -50,344, p = .0005). The negative relationships indicate that
participants’ confidence ratings were higher on trials with greater alpha and beta
desynchronisation (i.e., lower alpha/beta power) following stimulus onset for both tasks. As we
observed this effect after controlling for difficulty level and accuracy in the multiple regression
models, it indicates that post-stimulus alpha and beta power encode confidence over and above
external stimulus information (Griffiths et al., 2019) and objective accuracy. Both the perceptual
and knowledge task effects were globally distributed across the scalp (all 64 electrodes were
included in significant clusters), as shown in the topographical maps of mean t-values (Figure 5A,
5B: right inset). Interestingly, these results were only replicated on day 2 for the knowledge but
not the perceptual task (Figure 5C-D). In the knowledge task, we found a significant negative
cluster beginning at ~-200ms until 1s and spanning 7-38Hz with all 64 electrodes significant
(cluster statistic = -55,987, p = .0005), in line with the day 1 results. However, we did not find any
significant clusters predicting confidence in the perceptual task on day 2 (Figure 5C: all negative
cluster p’s => .359). In contrast to previous literature (Samaha et al., 2017; Wöstmann et al.,
2019), we found no significant pre-stimulus clusters predicting confidence.
.CC-BY 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted April 22, 2023. ; https://doi.org/10.1101/2023.04.21.537831doi: bioRxiv preprint
32
Time-resolved decoding of confidence from ERPs
To further investigate the domain-generality and test-retest reliability of the observed
EEG-confidence relationships, we performed additional analyses involving multivariate decoding
of confidence from single-trial EEG activity. We reasoned that if a classifier trained to dissociate
high versus low confidence trials using single-trial activity on one task could successfully decode
confidence on the other (untrained) task, then the neural signature must be engaged during both
perceptual and knowledge-based confidence judgements (i.e., it must be domain-general). In
contrast, if a signature is domain-specific then the classifier will not be able to predict confidence
ratings on the non-trained task. Similarly, if a classifier trained on day 1 could successfully decode
confidence on (untrained) day 2, then this would indicate high test-retest reliability of the EEG-
confidence signature. Additionally, predictions based on cross-validation analyses constitute
stronger evidence than inferential models, as cross-validation directly quantifies the ability to
generalise predictions to new data (Bzdok et al., 2020).
First, to establish whether confidence could be decoded from the ERP data we performed
within-task decoding of confidence on each testing day separately. We trained and tested LDA
classifiers on each point in time (0:1s, stimulus-locked) to investigate how the discrimination
pattern generalised over time with a 5-fold cross-validation. In both tasks and on both days, we
found significant decoding of confidence. Figure 6 shows significant within-task, within-day
clusters beginning at ~500ms (Fig 6A: day 1, p < .0001) and ~400ms (Fig 6B: day 2, p = .001) for
the perceptual task. We also trained a classifier on ERP data from day 1 and tested it on data
from day 2 (separately for each task) to examine test-retest reliability of the neural markers of
confidence. We found significant between-day decoding for the perceptual task beginning at
.CC-BY 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted April 22, 2023. ; https://doi.org/10.1101/2023.04.21.537831doi: bioRxiv preprint
33
approximately 430ms (Fig 6C: p = .0005). In the knowledge task, there were significant within-
task, within-day clusters beginning at ~340ms (Fig 6D: day 1, p = .0005) and ~450ms (Fig 6E: day
2, p < .0001). In line with the perceptual task, we found significant between-day decoding for the
knowledge task at approximately 430ms (Fig 6F: p = .001). For all classifiers, the best decoding
was centred around the diagonal where the training and testing data came from the same time
points. However, there was also statistically significant off-diagonal decoding suggesting a degree
of temporal generalisation of the classifier discrimination performance. Overall, confidence could
be reliably decoded for both tasks over separate days from the late-ERP, which represents a
temporally stable neural marker of subjective self-evaluation.
Figure 6: Time-resolved decoding of decision confidence from single-trial stimulus-locked ERPs.
Linear Discriminant Analysis (LDA) classifiers were trained and tested at all post-stimulus (0-1s)
time points. Mean AUC values across participants are shown. The topographies show group
.CC-BY 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted April 22, 2023. ; https://doi.org/10.1101/2023.04.21.537831doi: bioRxiv preprint
34
averaged correlations between the classifier decision values and the ERP voltages at each
participant’s peak AUC time-point. Note that negative decision values correspond to high
confidence trials so that overall negative correlations plotted represent a positive relationship
between confidence and ERP amplitude. A-B Within-task (A day 1, B day 2) decoding of
perceptual confidence. C Cross-day decoding of confidence in perception, where the classifier
was trained on data from day 1 and tested on data from day 2. D-E Within-task (D day 1, E day 2)
decoding of knowledge confidence. F Cross-day decoding in knowledge task. G-H Cross-task
decoding using data combined across both days. Significant clusters (one-tailed t-test, p < .05)
are highlighted in black.
To further test the domain-generality of the ERP-confidence relationships, we performed
time-resolved cross-task decoding, whereby a classifier was first trained on one task (data
combined across both days) and then tested on the data from the other task (and vice versa). We
found significant cross-task decoding in both decoding directions (knowledge to perception, and
perception to knowledge). For the classifier trained on the knowledge task, we found a significant
cluster beginning at ~580ms (p = .0015) (Figure 6G). There were two significant clusters, spanning
~430-970ms (p = .01) and ~830-1000ms (p = .032), respectively, in the performance of the
classifier trained on the perceptual task (Figure 6H). These results therefore support the late-ERP
as a reliable domain-general predictor of confidence ratings.
Time-resolved decoding of confidence from spectral power
Next, we recreated the above analyses using time-frequency spectral power. To improve
the signal-to-noise ratio while also being able to examine the temporal generalisability of any
EEG power-confidence relationships we performed the decoding using the canonical frequency
bands. We focused on alpha (8-12Hz) and beta (13-30Hz) bands based on previous literature
showing their link to subjective task performance like perceptual awareness (Benwell et al., 2017,
.CC-BY 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted April 22, 2023. ; https://doi.org/10.1101/2023.04.21.537831doi: bioRxiv preprint
35
2022) and confidence (for response-locked activity, Faivre et al., 2018). To test whether
confidence can be decoded from alpha + beta activity (combined), we trained and tested LDA
classifiers on each post-stimulus time point (0:1s) using a 5-fold cross-validation as described
above. We found significant cluster-level decoding in both task and on both days (Figure 7A-B
and 7D-E). In the perceptual task, we found significant within-task, within day decoding with a
cluster beginning at ~360ms (p < .0001) on day 1 (Figure 7A), and a cluster beginning at ~600ms
(p = .0025) on day 2 (Figure 7B). To test the test-retest reliability of these neural signatures we
trained a classifier on day 1 data and tested it on day 2 data (separately for each task). Figure 7C
shows we did not find any significant clusters in the perceptual task (all p’s > .287) decoding
confidence across-days. In the knowledge task, we also found significant within-task within day
decoding (Figure 7D-E). On day 1 there was a large significant cluster starting at ~440ms (p <
.0001) and a smaller one from ~60 to ~400ms (p = .042, Figure 7D). On day 2 of knowledge task,
we found a cluster spanning ~440 to ~980ms, p < .0001 (Figure 7E). In contrast to perception, we
found significant cross-day decoding in the knowledge task (in line with the results of single-trial
regression analyses), with a cluster starting at ~540ms (p = .0005, Figure 7F). When we used
power from individual alpha and beta frequency bands to train the classifiers, we obtained similar
results, with significant decoding found in both tasks and on both days (apart from alpha power
decoding in perceptual task on day 2 which did not reach significance) (see Supplementary
Figures S3 & S4). When we included each frequency band in the between-day classification
models separately, we found no significant cross-day decoding for 8-12Hz power, while for 13-
30Hz power we found significant, though, weaker, decoding in the knowledge task. Hence, this
.CC-BY 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted April 22, 2023. ; https://doi.org/10.1101/2023.04.21.537831doi: bioRxiv preprint
36
indicates alpha + beta (8-30Hz) spectral power is only a reliable predictor of semantic knowledge
confidence.
Finally, we tested whether a classifier trained on one task can predict confidence ratings
in the other task to establish the extent of domain-generality. As Figure 7G-H shows, we found
no significant cross-task decoding in either decoding direction for 8-30Hz power (knowledge to
perception (Fig 7G), nor perception to knowledge (Fig 7H), all p’s > .05). Similarly, we found no
consistent significant decoding when we tested 8-12Hz and 13-30Hz power separately
(Supplementary Figures S3-4).
Overall, these results suggest spectral power in the alpha + beta band (8-30Hz) only
reliably predicts confidence in a domain-specific manner, relating more closely to confidence
ratings in the knowledge task.
.CC-BY 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted April 22, 2023. ; https://doi.org/10.1101/2023.04.21.537831doi: bioRxiv preprint
37
Figure 7: Time-resolved decoding of decision confidence from single-trial α+β (8-30Hz) spectral
power (stimulus-locked, 0-1s). The classifiers were trained and tested on all post-stimulus (0-1s)
time points. Mean AUC values across participants are shown. The topographies show group
averaged correlations between the classifier decision values and the spectral power at each
participant’s peak AUC time-point. Note that negative decision values correspond to high
confidence trials so that overall positive correlations plotted represent a negative relationship
between confidence and 8-30Hz power. A-B Within-task (A day 1, B day 2) decoding of perceptual
confidence. C Cross-day decoding of confidence in perception, where the classifier was trained
on data from day 1 and tested on data from day 2. D-E Within-task (D day 1, E day 2) decoding of
knowledge confidence. F Cross-day decoding in knowledge task. G-H Cross-task decoding using
data combined across both days. Significant clusters (one-tailed t-test, p < .05) are highlighted in
black.
.CC-BY 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted April 22, 2023. ; https://doi.org/10.1101/2023.04.21.537831doi: bioRxiv preprint
38
4. Discussion
To understand the neurocomputational mechanisms underlying metacognition, it is
crucial to establish whether its behavioural and neurophysiological signatures are reliable over
time and generalisable across cognitive domains. Here, we examined the test-retest reliability
and domain-generality of both behavioural and EEG measures of metacognition. Overall
confidence calibration (i.e., metacognitive bias) was highly domain-general across two cognitive
domains (perception and semantic memory) and showed strong test-retest reliability across two
experimental sessions. In contrast, metacognitive efficiency (indexed by meta-d’/d’) showed low
domain-generality and poor test-retest reliability. These results suggest that both domain-
general and domain-specific resources contribute to overall metacognitive performance, with a
dissociation between metacognitive bias and efficiency, wherein the former may be underpinned
by more global metacognitive mechanisms. Trial-by-trial fluctuations of confidence within-
participants were predicted by both stimulus-locked ERP and ERSP activity during decision
formation, though the ERP signature was the most reliable predictor across domains and
sessions. The results 1) reveal the test-retest reliability of the most widely adopted measures in
the field, 2) contribute importantly to the debate about domain-generality of metacognition, and
3) show that confidence is encoded in two distinct stimulus-locked EEG signatures during decision
formation.
Test-retest reliability and domain-generality of metacognitive measures
High test-retest reliability of model-based measures is essential to build a robust science
of metacognition and for translating findings into relevant fields such as computational
.CC-BY 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted April 22, 2023. ; https://doi.org/10.1101/2023.04.21.537831doi: bioRxiv preprint
39
psychiatry (Brown et al., 2020). To our knowledge, this is the first study to report the test-retest
reliability of one of the most widely adopted measures of metacognitive efficiency (meta-d’/d’,
often called the M-ratio (Fleming & Lau, 2014; Maniscalco & Lau, 2012)). For both the perception
and knowledge tasks, the M-ratio showed poor test-retest reliability. This extends previous
research showing poor split-half reliability of M-ratio within a single testing session (Guggenmos,
2021). The M-ratio was also not strongly correlated between tasks, in line with studies that found
no (or weak) domain-generality of efficiency measures (Ais et al., 2016; Baird et al., 2013; Benwell
et al., 2022b; Fitzgerald et al., 2017). Metacognitive efficiency was significantly higher for
knowledge than perception, and this effect replicated across days despite the low test-retest
reliability of the individual participant measures, while type-1 sensitivity (d’) was higher for
perception. It remains to be seen whether the lack of reliability across time and domains
represents inherent variability of metacognitive efficiency itself (Bang et al., 2019; Shekhar &
Rahnev, 2021a) or noisiness of the M-ratio measure. Alternative efficiency measures have
recently been proposed (Desender et al., 2022; Guggenmos, 2022; Miyoshi & Nishida, 2022;
Shekhar & Rahnev, 2021a), and their test-retest reliability and domain-generality should be
investigated in future studies.
In contrast to metacognitive efficiency, confidence bias (type-2 c’) was highly reliable
between testing sessions in both cognitive domains and showed strong domain-generality. This
is in line with previous studies that found strong between-task correlations of metacognitive bias
(Ais et al., 2016; Benwell et al., 2022b; Mazancieux et al., 2020), suggesting overall confidence
level is a domain-general, trait-like characteristic that is less influenced by task-specific
representations and evidence type. This highlights that maladaptive confidence calibration may
.CC-BY 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted April 22, 2023. ; https://doi.org/10.1101/2023.04.21.537831doi: bioRxiv preprint
40
have a more pervasive influence on everyday life, in line with its association with symptoms of
psychopathology (Benwell et al., 2022b; Rouault, Seow, et al., 2018) and longitudinal evidence of
confidence bias improvement with mental health treatment (Fox et al., 2023). Hence, identifying
reliable, domain-general EEG predictors of confidence may represent a first step towards
understanding neural mechanisms relevant for mental health.
CPP/P3 is a reliable and domain general predictor of confidence
Across both tasks and both sessions, within-participant fluctuations in the level of
confidence from trial-to trial were correlated with a late ERP component, likely corresponding to
the Central Parietal Positivity (CPP)/P300 based on its latency and topography (Kelly & O’Connell,
2013; O’Connell et al., 2012; Tagliabue et al., 2019). Higher amplitudes of this component during
decision formation preceded higher confidence ratings. This is consistent with previous studies
which found that the CPP scales with both implicit (e.g., statistical) and explicit confidence level
(Feuerriegel et al., 2022; Fitzgerald et al., 2022; Gherman & Philiastides, 2015, 2018; Grogan et
al., 2023; Herding et al., 2019; Rausch et al., 2020; Zakrzewski et al., 2019). However, previous
studies have focused predominantly on perceptual confidence. Here we show that the CPP/P300
also tracks explicit post-decisional reports of confidence, over and above trial difficulty and
accuracy, even when internal (rather than sensory) evidence is evaluated during semantic
knowledge decisions. Moreover, we were able to train a classifier that successfully discriminated
between high/low confidence trials across testing days and cognitive domains, hence providing
strong evidence for test-retest reliability and domain-generality of this neural signature. The test-
retest reliability of the effect is particularly important given increasing concerns regarding the
reliability of scientific findings over the last decade, with numerous replication failures,
.CC-BY 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted April 22, 2023. ; https://doi.org/10.1101/2023.04.21.537831doi: bioRxiv preprint
41
particularly within neuroscience (Botvinik-Nezer et al., 2019; Haines et al., 2023; Pavlov et al.,
2021; Poldrack et al., 2017).
The stimulus-locked CPP, which closely matches our confidence discriminating ERP
component, is thought to reflect a supramodal decision signal that tracks accumulating evidence
(Kelly & O’Connell, 2013; O’Connell et al., 2012). Hence, the relationship with confidence
suggests that 1st and 2nd-order decision related signals may unfold together during decision
formation, in line with these processes being coupled in the early stages of decision formation
(Gherman & Philiastides, 2015; Kiani & Shadlen, 2009; van den Berg et al., 2016). Interestingly,
Gherman and Philistines (2018) identified a confidence discriminating component, similar in time
and topographical distribution to ours, which simultaneous fMRI revealed originated in
ventromedial Prefrontal Cortex (vmPFC). The vmPFC has also been shown to encode confidence
in other functional neuroimaging studies (De Martino et al., 2013; Lebreton et al., 2015; Morales
et al., 2018). Hence, we speculate that the confidence ERP effect we observed may reflect a
domain-general PFC metacognitive signal.
Alpha/beta de-synchronisation also predicts confidence
In addition to the ERP-confidence relationship, greater post-stimulus alpha/beta
desynchronisation also predicted higher confidence ratings, over and above accuracy and trial
difficulty, in both the perceptual and knowledge tasks. Post-stimulus alpha/beta
desynchronisation has previously been shown to correlate with perceptual awareness (Benwell
et al., 2017, 2022a) and perceptual confidence (Faivre et al., 2018). Here we show that alpha/beta
desynchronisation is associated with explicit confidence ratings during decision formation across
.CC-BY 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted April 22, 2023. ; https://doi.org/10.1101/2023.04.21.537831doi: bioRxiv preprint
42
both perceptual and semantic knowledge decisions. However, the effect was only reliable over
time in the knowledge task (and not perception), suggesting it is more domain-specific than the
ERP effect. Alpha desynchronisation is proposed to reflect increases in cortical excitability
modulated by global noradrenergic arousal according to attentional demands (Dahl et al., 2022;
Kosciessa et al., 2021). Importantly, higher confidence ratings have also been related to
physiological measures of arousal like increased heart rate (Allen et al., 2016). Hence, it is
possible that the fluctuations in post-stimulus power we observed reflect internal states, such as
attention and arousal, that concurrently influence confidence. This could be tested in the future
by combining EEG recordings with other physiological measures. Additionally, the global
alpha/beta desynchronisation effects we observed could result from oscillatory communication
in widespread neural networks (Buzsáki & Draguhn, 2004). The lack of successful cross-task
decoding may suggest that although alpha/beta mechanisms are linked to confidence in both
tasks, the networks involved, and hence the topographical patterns captured in our decoding
analysis, are relatively domain-specific.
Importantly, both late ERPs and alpha/beta desynchronisation correlated with confidence
independently of evidence strength and 1st-order accuracy. Dissociation between subjective
confidence and objective performance is in line with 2nd-order confidence generation models
(Fleming & Daw, 2016; Pleskac & Busemeyer, 2010) which propose that, unlike accuracy,
confidence may rely on additional computations (Murphy et al., 2015) and be more susceptible
to factors beyond evidence strength, like attention (Denison et al., 2018). Our results suggest
that prior to the 1st-order response these computations are linked to both decision-related ERPs
as well as alpha/beta power, whereby the former may represent a metacognitive readout of
.CC-BY 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted April 22, 2023. ; https://doi.org/10.1101/2023.04.21.537831doi: bioRxiv preprint
43
accumulating decision evidence (Gherman & Philiastides, 2018) while the latter may be more
closely linked to additional influences such as arousal and/or attention (Dahl et al., 2022;
Kosciessa et al., 2021).
It is important to note that other EEG signatures have been linked to confidence such as
the response-locked error-related negativity and Error Positivity (Boldt & Yeung, 2015; Desender
et al., 2019, though see Feuerriegel et al., 2022; Grogan et al., 2023; Rausch et al., 2020). Here,
we focused on stimulus-locked activity and pre-response neural signatures of confidence by
delaying the response period by 1s from stimulus onset. Hence, we were unable to conduct
meaningful reaction time or response-locked analyses. Future studies can investigate the
reliability and domain-generality of these alternative confidence signatures.
In conclusion, confidence calibration (metacognitive bias) is highly reliable across time
and cognitive domains. In contrast, metacognitive efficiency (as indexed by meta-d’/d’) is not
reliable over time, nor between tasks. This emphasises the need for the development of more
reliable model-based measures of metacognition that may need to be tailored according to
different types of decision evidence. We show that two distinct neural signatures encode
confidence judgements: slow ERPs which are reliable across both time and cognitive domains,
and alpha/beta desynchronisation which is stronger and more reliable across time in the
semantic knowledge domain (relative to perception). Identifying reliable and domain-general
neural predictors of confidence represents a crucial step towards understanding the neural
mechanism underlying suboptimal self-evaluation.
.CC-BY 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted April 22, 2023. ; https://doi.org/10.1101/2023.04.21.537831doi: bioRxiv preprint
44
References
Ais, J., Zylberberg, A., Barttfeld, P., & Sigman, M. (2016). Individual consistency in the accuracy
and distribution of confidence judgments. Cognition, 146, 377386.
https://doi.org/10.1016/J.COGNITION.2015.10.006
Allen, M., Frank, D., Schwarzkopf, D. S., Fardo, F., Winston, J. S., Hauser, T. U., & Rees, G. (2016).
Unexpected arousal modulates the influence of sensory noise on confidence. ELife, 5,
e18103. https://doi.org/10.7554/eLife.18103
Arbuzova, P., Maurer, L. K., & Filevich, E. (2022). Metacognitive domains are not aligned along a
dimension of internal-external information source. Psychonomic Bulletin & Review, 1-11.
https://doi.org/10.3758/s13423-022-02201-1
Azizi, Z., Zabbah, S., Jahanitabesh, A., & Ebrahimpour, R. (2021). Improvement of association
between confidence and accuracy after integration of discrete evidence over
time. BioRxiv, 2021-06. https://doi.org/10.1101/2021.06.20.449145
Baird, B., Smallwood, J., Gorgolewski, K. J., & Margulies, D. S. (2013). Medial and lateral
networks in anterior prefrontal cortex support metacognitive ability for memory and
perception. Journal of Neuroscience, 33(42), 1665716665.
https://doi.org/10.1523/JNEUROSCI.0786-13.2013
Bang, J. W., Shekhar, M., & Rahnev, D. (2019). Sensory noise increases metacognitive efficiency.
Journal of Experimental Psychology: General, 148(3), 437.
https://doi.org/10.1037/xge0000511
.CC-BY 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted April 22, 2023. ; https://doi.org/10.1101/2023.04.21.537831doi: bioRxiv preprint
45
Benwell, C. S. Y., Tagliabue, C. F., Veniero, D., Cecere, R., Savazzi, S., & Thut, G. (2017).
Prestimulus EEG power predicts conscious awareness but not objective visual
performance. ENeuro, 4(6). https://doi.org/10.1523/ENEURO.0182-17.2017
Benwell, C. S. Y., Coldea, A., Harvey, M., & Thut, G. (2022a). Low pre-stimulus EEG alpha power
amplifies visual awareness but not visual sensitivity. European Journal of Neuroscience.
https://doi.org/10.1111/ejn.15166
Benwell, C. S. Y., Mohr, G., Wallberg, J., Kouadio, A., & Ince, R. A. A. (2022b). Psychiatrically
relevant signatures of domain-general decision-making and metacognition in the
general population. Npj Mental Health Research, 1(1), Article 1.
https://doi.org/10.1038/s44184-022-00009-4
Boldt, A., & Yeung, N. (2015). Shared neural markers of decision confidence and error
detection. Journal of Neuroscience, 35(8), 34783484.
https://doi.org/10.1523/JNEUROSCI.0797-14.2015
Botvinik-Nezer, R., Iwanir, R., Holzmeister, F., Huber, J., Johannesson, M., Kirchler, M., Dreber,
A., Camerer, C. F., Poldrack, R. A., & Schonberg, T. (2019). FMRI data of mixed gambles
from the Neuroimaging Analysis Replication and Prediction Study. Scientific Data, 6(1),
Article 1. https://doi.org/10.1038/s41597-019-0113-7
Botvinik-Nezer, R., & Wager, T. D. (2022). Reproducibility in neuroimaging analysis: challenges
and solutions. Biological Psychiatry: Cognitive Neuroscience and Neuroimaging.
https://doi.org/10.1016/j.bpsc.2022.12.006
Brown, V. M., Chen, J., Gillan, C. M., & Price, R. B. (2020). Improving the reliability of
computational analyses: Model-based planning and its relationship with compulsivity.
.CC-BY 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted April 22, 2023. ; https://doi.org/10.1101/2023.04.21.537831doi: bioRxiv preprint
46
Biological Psychiatry: Cognitive Neuroscience and Neuroimaging, 5(6), 601609.
https://doi.org/10.1016/j.bpsc.2019.12.019
Buzsáki, G., & Draguhn, A. (2004). Neuronal oscillations in cortical networks. Science, 304(5679),
19261929. https://doi.org/10.1126/science.1099745
Bzdok, D., Engemann, D., & Thirion, B. (2020). Inference and prediction diverge in biomedicine.
Patterns, 1(8), 100119. https://doi.org/10.1016/j.patter.2020.100119
Dahl, M. J., Mather, M., & Werkle-Bergner, M. (2022). Noradrenergic modulation of rhythmic
neural activity shapes selective attention. Trends in Cognitive Sciences, 26(1), 3852.
https://doi.org/10.1016/J.TICS.2021.10.009
de Gardelle, V., & Mamassian, P. (2014). Does confidence use a common currency across two
visual tasks? Psychological Science, 25(6), 12861288.
https://doi.org/10.1177/0956797614528956
De Martino, B., Fleming, S. M., Garrett, N., & Dolan, R. J. (2013). Confidence in value-based
choice. Nature Neuroscience, 16(1), Article 1. https://doi.org/10.1038/nn.3279
Delorme, A., & Makeig, S. (2004). EEGLAB: an open source toolbox for analysis of single-trial
EEG dynamics including independent component analysis. Journal of Neuroscience
Methods, 134(1), 921. https://doi.org/10.1016/J.JNEUMETH.2003.10.009
Denison, R. N., Adler, W. T., Carrasco, M., & Ma, W. J. (2018). Humans incorporate attention-
dependent uncertainty into perceptual decisions and confidence. Proceedings of the
National Academy of Sciences of the United States of America, 115(43), 1109011095.
https://doi.org/10.1073/PNAS.1717720115/-/DCSUPPLEMENTAL
.CC-BY 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted April 22, 2023. ; https://doi.org/10.1101/2023.04.21.537831doi: bioRxiv preprint
47
Desender, K., Murphy, P., Boldt, A., Verguts, T., & Yeung, N. (2019). A postdecisional neural
marker of confidence predicts information-seeking in decision-making. The Journal of
Neuroscience, 39(17), 33093319. https://doi.org/10.1523/JNEUROSCI.2620-18.2019
Desender, K., Vermeylen, L., & Verguts, T. (2022). Dynamic influences on static measures of
metacognition. Nature Communications, 13(1), 4208. https://doi.org/10.1038/s41467-
022-31727-0
Faivre, N., Filevich, E., Solovey, G., Kühn, S., & Blanke, O. (2018). Behavioral, modeling, and
electrophysiological evidence for supramodality in human metacognition. Journal of
Neuroscience, 38(2), 263277. https://doi.org/10.1523/JNEUROSCI.0322-17.2017
Feuerriegel, D., Murphy, M., Konski, A., Mepani, V., Sun, J., Hester, R., & Bode, S. (2022).
Electrophysiological correlates of confidence differ across correct and erroneous
perceptual decisions. NeuroImage, 259, 119447.
https://doi.org/10.1016/j.neuroimage.2022.119447
Fitzgerald, L. M., Arvaneh, M., & Dockree, P. M. (2017). Domain-specific and domain-general
processes underlying metacognitive judgments. Consciousness and Cognition, 49, 264
277. https://doi.org/10.1016/J.CONCOG.2017.01.011
Fitzgerald, L. M., Arvaneh, M., Carton, S., O’Keeffe, F., Delargy, M., & Dockree, P. M. (2022).
Impaired metacognition and reduced neural signals of decision confidence in adults with
traumatic brain injury. Neuropsychology. https://eprints.whiterose.ac.uk/189892/
Fleming, S. M. (2017). HMeta-d: Hierarchical Bayesian estimation of metacognitive efficiency
from confidence ratings. Neuroscience of Consciousness, 2017(1), nix007.
https://doi.org/10.1093/nc/nix007
.CC-BY 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted April 22, 2023. ; https://doi.org/10.1101/2023.04.21.537831doi: bioRxiv preprint
48
Fleming, S. M., & Daw, N. D. (2016). Self-evaluation of decision-making: A general Bayesian
framework for metacognitive computation. Psychological Review, 124(1), 9191.
https://doi.org/10.1037/REV0000045
Fleming, S. M., & Lau, H. C. (2014). How to measure metacognition. Frontiers in Human
Neuroscience, 8, 443443. https://doi.org/10.3389/FNHUM.2014.00443/ABSTRACT
Fox, C. A., Lee, C. T., Hanlon, A., Seow, T., Lynch, K., Harty, S., Richards, D., Palacios, J., O’Keane,
V., Stephan, K. E., & Gillan, C. (2023). Metacognition in anxious-depression is state-
dependent: An observational treatment study. PsyArXiv.
https://doi.org/10.31234/osf.io/uk7hr
Friston, K. (2008). Hierarchical models in the brain. PLOS Computational Biology, 4(11),
e1000211. https://doi.org/10.1371/journal.pcbi.1000211
Gherman, S., & Philiastides, M. G. (2015). Neural representations of confidence emerge from
the process of decision formation during perceptual choices. NeuroImage, 106.
https://doi.org/10.1016/j.neuroimage.2014.11.036
Gherman, S., & Philiastides, M. G. (2018). Human VMPFC encodes early signatures of
confidence in perceptual decisions. ELife, 7, e38293.
https://doi.org/10.7554/eLife.38293
Griffiths, B. J., Mayhew, S. D., Mullinger, K. J., Jorge, J., Charest, I., Wimber, M., & Hanslmayr, S.
(2019). Alpha/beta power decreases track the fidelity of stimulus specific information.
ELife, 8. https://doi.org/10.7554/ELIFE.49562
.CC-BY 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted April 22, 2023. ; https://doi.org/10.1101/2023.04.21.537831doi: bioRxiv preprint
49
Grogan, J. P., Rys, W., Kelly, S. P., & O’Connell, R. G. (2023). Confidence is predicted by pre- and
post-choice decision signal dynamics. bioRxiv.
https://doi.org/10.1101/2023.01.19.524702
Guggenmos, M. (2021). Measuring metacognitive performance: Type 1 performance
dependence and test-retest reliability. Neuroscience of Consciousness, 2021(1), niab040.
https://doi.org/10.1093/nc/niab040
Guggenmos, M. (2022). Reverse engineering of metacognition. ELife, 11, e75420.
https://doi.org/10.7554/eLife.75420
Haines, N., Sullivan-Toole, H., & Olino, T. (2023). From classical methods to generative models:
Tackling the unreliability of neuroscientific measures in mental health research.
Biological Psychiatry: Cognitive Neuroscience and Neuroimaging.
https://doi.org/10.1016/j.bpsc.2023.01.001
Hedge, C., Powell, G., & Sumner, P. (2018). The reliability paradox: Why robust cognitive tasks
do not produce reliable individual differences. Behavior Research Methods, 50(3), 1166
1186. https://doi.org/10.3758/s13428-017-0935-1
Herding, J., Ludwig, S., von Lautz, A., Spitzer, B., & Blankenburg, F. (2019). Centro-parietal EEG
potentials index subjective evidence and confidence during perceptual decision making.
NeuroImage, 201. https://doi.org/10.1016/j.neuroimage.2019.116011
Hoven, M., Lebreton, M., Engelmann, J. B., Denys, D., Luigjes, J., & van Holst, R. J. (2019).
Abnormalities of confidence in psychiatry: An overview and future perspectives.
Translational Psychiatry, 9(1), Article 1. https://doi.org/10.1038/s41398-019-0602-7
.CC-BY 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted April 22, 2023. ; https://doi.org/10.1101/2023.04.21.537831doi: bioRxiv preprint
50
Kelemen, W. L., Frost, P. J., & Weaver, C. A. (2000). Individual differences in metacognition:
Evidence against a general metacognitive ability. Memory & Cognition, 28(1), 92107.
https://doi.org/10.3758/BF03211579
Kelly, S. P., & O’Connell, R. G. (2013). Internal and external influences on the rate of sensory
evidence accumulation in the human brain. Journal of Neuroscience, 33(50).
https://doi.org/10.1523/JNEUROSCI.3355-13.2013
Kiani, R., & Shadlen, M. N. (2009). Representation of confidence associated with a decision by
neurons in the parietal cortex. Science, 324(5928), 759764.
https://doi.org/10.1126/SCIENCE.1169405/SUPPL_FILE/KIANI_SOM.PDF
King, J.-R., & Dehaene, S. (2014). Characterizing the dynamics of mental representations: The
temporal generalization method. Trends in Cognitive Sciences, 18(4), 203210.
https://doi.org/10.1016/j.tics.2014.01.002
Kosciessa, J. Q., Lindenberger, U., & Garrett, D. D. (2021). Thalamocortical excitability
modulation guides human perception under uncertainty. Nature Communications,
12(1), 115. https://doi.org/10.1038/s41467-021-22511-7
Lebreton, M., Abitbol, R., Daunizeau, J., & Pessiglione, M. (2015). Automatic integration of
confidence in the brain valuation signal. Nature Neuroscience, 18(8), Article 8.
https://doi.org/10.1038/nn.4064
Lim, K., Wang, W., & Merfeld, D. M. (2020). Frontal scalp potentials foretell perceptual choice
confidence. Journal of Neurophysiology, 123(4), 15661577.
https://doi.org/10.1152/jn.00290.2019
.CC-BY 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted April 22, 2023. ; https://doi.org/10.1101/2023.04.21.537831doi: bioRxiv preprint
51
Maniscalco, B., & Lau, H. (2012). A signal detection theoretic approach for estimating
metacognitive sensitivity from confidence ratings. Consciousness and Cognition, 21(1),
422430. https://doi.org/10.1016/j.concog.2011.09.021
Maris, E., & Oostenveld, R. (2007). Nonparametric statistical testing of EEG- and MEG-data.
Journal of Neuroscience Methods, 164(1), 177190.
https://doi.org/10.1016/J.JNEUMETH.2007.03.024
Mazancieux, A., Fleming, S. M., Souchay, C., & Moulin, C. J. A. (2020). Is there a G factor for
metacognition? Correlations in retrospective metacognitive sensitivity across tasks.
Journal of Experimental Psychology: General, 149(9), 17881788.
https://doi.org/10.1037/XGE0000746
McCurdy, L. Y., Maniscalco, B., Metcalfe, J., Liu, K. Y., de Lange, F. P., & Lau, H. (2013).
Anatomical coupling between distinct metacognitive systems for memory and visual
perception. Journal of Neuroscience, 33(5), 18971906.
https://doi.org/10.1523/JNEUROSCI.1890-12.2013
McGraw, K. O., & Wong, S. P. (1996). Forming inferences about some intraclass correlation
coefficients. Psychological Methods, 1, 3046. https://doi.org/10.1037/1082-
989X.1.1.30
Miyoshi, K., & Nishida, S. Y. (2022). GGSDT: A unified signal detection framework for confidence
data analysis. bioRxiv. https://doi.org/10.1101/2022.10.28.514329
Morales, J., Lau, H., & Fleming, S. M. (2018). Domain-general and domain-specific patterns of
activity supporting metacognition in human prefrontal cortex. Journal of Neuroscience,
38(14), 35343546. https://doi.org/10.1523/JNEUROSCI.2360-17.2018
.CC-BY 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted April 22, 2023. ; https://doi.org/10.1101/2023.04.21.537831doi: bioRxiv preprint
52
Murphy, P. R., Robertson, I. H., Harty, S., & O’Connell, R. G. (2015). Neural evidence
accumulation persists after choice to inform metacognitive judgments. ELife, 4, e11946.
https://doi.org/10.7554/ELIFE.11946
O’Connell, R. G., Dockree, P. M., & Kelly, S. P. (2012). A supramodal accumulation-to-bound
signal that determines perceptual decisions in humans. Nature Neuroscience, 15(12),
17291735. https://doi.org/10.1038/nn.3248
Oostenveld, R., Fries, P., Maris, E., & Schoffelen, J. M. (2011). FieldTrip: Open source software
for advanced analysis of MEG, EEG, and invasive electrophysiological data.
Computational Intelligence and Neuroscience, 2011, 1-9.
https://doi.org/10.1155/2011/156869
Pavlov, Y. G., et al. (2021). #EEGManyLabs: Investigating the replicability of influential EEG
experiments. Cortex, 144, 213229. https://doi.org/10.1016/j.cortex.2021.03.013
Peirce, J., Gray, J. R., Simpson, S., MacAskill, M., Höchenberger, R., Sogo, H., Kastman, E., &
Lindeløv, J. K. (2019). PsychoPy2: Experiments in behavior made easy. Behavior Research
Methods, 51(1), 195203. https://doi.org/10.3758/S13428-018-01193-Y
Pleskac, T. J., & Busemeyer, J. R. (2010). Two-stage dynamic signal detection: A theory of
choice, decision time, and confidence. Psychological Review, 117(3), 864901.
https://doi.org/10.1037/A0019737
Poldrack, R. A., Baker, C. I., Durnez, J., Gorgolewski, K. J., Matthews, P. M., Munafò, M. R.,
Nichols, T. E., Poline, J.-B., Vul, E., & Yarkoni, T. (2017). Scanning the horizon: Towards
transparent and reproducible neuroimaging research. Nature Reviews Neuroscience,
18(2), Article 2. https://doi.org/10.1038/nrn.2016.167
.CC-BY 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted April 22, 2023. ; https://doi.org/10.1101/2023.04.21.537831doi: bioRxiv preprint
53
Rausch, M., Zehetleitner, M., Steinhauser, M., & Maier, M. E. (2020). Cognitive modelling
reveals distinct electrophysiological markers of decision confidence and error
monitoring. NeuroImage, 218, 116963.
https://doi.org/10.1016/j.neuroimage.2020.116963
Rouault, M., Mcwilliams, A., Allen, M. G., & Fleming, S. M. (2018). Human metacognition across
domains: Insights from individual differences and neuroimaging. Personality
Neuroscience, 1, 113. https://doi.org/10.1017/pen.2018.16
Rouault, M., Seow, T., Gillan, C. M., & Fleming, S. M. (2018). Psychiatric symptom dimensions
are associated with dissociable shifts in metacognition but not task performance.
Biological Psychiatry, 84(6), 443451. https://doi.org/10.1016/J.BIOPSYCH.2017.12.017
Rouault, M., Lebreton, M., & Pessiglione, M. (2023). A shared brain system forming confidence
judgment across cognitive domains. Cerebral Cortex, 33(4), 14261439.
https://doi.org/10.1093/cercor/bhac146
Samaha, J., Iemi, L., & Postle, B. R. (2017). Prestimulus alpha-band power biases visual
discrimination confidence, but not accuracy. Consciousness and Cognition, 54, 4755.
https://doi.org/10.1016/j.concog.2017.02.005
Samaha, J., & Postle, B. R. (2017). Correlated individual differences suggest a common
mechanism underlying metacognition in visual perception and visual short-term
memory. Proceedings of the Royal Society B: Biological Sciences, 284(1867), 20172035.
https://doi.org/10.1098/rspb.2017.2035
Sanders, J. I., Hangya, B., & Kepecs, A. (2016). Signatures of a statistical computation in the
human sense of confidence. Neuron, 90(3), 499506.
.CC-BY 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted April 22, 2023. ; https://doi.org/10.1101/2023.04.21.537831doi: bioRxiv preprint
54
https://doi.org/10.1016/J.NEURON.2016.03.025/ATTACHMENT/DB844F83-9B09-4BD9-
82D2-D21F8F127393/MMC1.PDF
Shahar, N., Hauser, T. U., Moutoussis, M., Moran, R., Keramati, M., Consortium, N., & Dolan, R.
J. (2019). Improving the reliability of model-based decision-making estimates in the two-
stage decision task with reaction-times and drift-diffusion modeling. PLOS
Computational Biology, 15(2), e1006803. https://doi.org/10.1371/journal.pcbi.1006803
Shekhar, M., & Rahnev, D. (2021a). The nature of metacognitive inefficiency in perceptual
decision making. Psychological Review, 128, 4570. https://doi.org/10.1037/rev0000249
Shekhar, M., & Rahnev, D. (2021b). Sources of metacognitive inefficiency. Trends in Cognitive
Sciences, 25(1), 1223. https://doi.org/10.1016/J.TICS.2020.10.007
Sherman, M. T., Seth, A. K., & Barrett, A. B. (2018). Quantifying metacognitive thresholds using
signal-detection theory. bioRxiv, 361543. https://doi.org/10.1101/361543
Song, C., Kanai, R., Fleming, S. M., Weil, R. S., Schwarzkopf, D. S., & Rees, G. (2011). Relating
inter-individual differences in metacognitive performance on different perceptual tasks.
Consciousness and Cognition, 20(4), 17871792.
https://doi.org/10.1016/j.concog.2010.12.011
Tagliabue, C. F., Veniero, D., Benwell, C. S. Y., Cecere, R., Savazzi, S., & Thut, G. (2019). The EEG
signature of sensory evidence accumulation during decision formation closely tracks
subjective perceptual experience. Scientific Reports, 9(1), 112.
https://doi.org/10.1038/s41598-019-41024-4
.CC-BY 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted April 22, 2023. ; https://doi.org/10.1101/2023.04.21.537831doi: bioRxiv preprint
55
Treder, M. S. (2020). MVPA-Light: A Classification and Regression Toolbox for Multi-
Dimensional Data. Frontiers in Neuroscience, 14.
https://www.frontiersin.org/articles/10.3389/fnins.2020.00289
van den Berg, R., Anandalingam, K., Zylberberg, A., Kiani, R., Shadlen, M. N., & Wolpert, D. M.
(2016). A common mechanism underlies changes of mind about decisions and
confidence. ELife, 5, e12192. https://doi.org/10.7554/eLife.12192
Winkler, I., Haufe, S., & Tangermann, M. (2011). Automatic classification of artifactual ICA-
components for artifact removal in EEG signals. Behavioral and Brain Functions, 7(1), 1
15. https://doi.org/10.1186/1744-9081-7-30/FIGURES/9
Wöstmann, M., Waschke, L., & Obleser, J. (2019). Prestimulus neural alpha power predicts
confidence in discriminating identical auditory stimuli. European Journal of
Neuroscience, 49(1). https://doi.org/10.1111/ejn.14226
Zakrzewski, A. C., Wisniewski, M. G., Iyer, N., & Simpson, B. D. (2019). Confidence tracks
sensory- and decision-related ERP dynamics during auditory detection. Brain and
Cognition, 129, 4958. https://doi.org/10.1016/j.bandc.2018.10.007
.CC-BY 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted April 22, 2023. ; https://doi.org/10.1101/2023.04.21.537831doi: bioRxiv preprint
... Consequently, metacognitive ability has been correlated with other stable individual differences, such as brain structure [10][11][12][13] . While metacognitive ability is often assumed to be domain-general and rely on shared neural substrates, this question remains hotly debated [14][15][16][17] . The construct of metacognitive ability is also thought to be different from other constructs such as task skill or bias, so it is often desirable to find metrics of metacognitive ability unrelated to these other constructs 18 . ...
... Guggenmos 41 examined both the split-half reliability and the across-participant correlation between d' and several measures of metacognition (meta-d', M-Ratio, M-Diff, and AUC2) finding surprisingly low reliability and significant correlations with d' for all measures. Relatedly, Kopcanova et al. 14 examined the test-retest reliability of M-Ratio and also found low-reliability values. Another paper developed a new technique to examine dependence on metacognitive bias and found that meta-d' and M-Ratio are not independent of metacognitive bias 28 . ...
... Similar test-retest correlation coefficients were obtained when Pearson correlation was computed instead of ICC (Fig. 6). These results are in line with the findings of Kopcanova et al. 14 and suggest that correlations between measures of metacognition and measures that do not substantially fluctuate on a day-by-day basis (e.g., structural brain measures) are likely to be particularly noisy such that very large sample sizes may be needed to find reliable results. ...
Article
Full-text available
One of the most important aspects of research on metacognition is the measurement of metacognitive ability. However, the properties of existing measures of metacognition have been mostly assumed rather than empirically established. Here I perform a comprehensive empirical assessment of 17 measures of metacognition. First, I develop a method of determining the validity and precision of a measure of metacognition and find that all 17 measures are valid and most show similar levels of precision. Second, I examine how measures of metacognition depend on task performance, response bias, and metacognitive bias, finding only weak dependences on response and metacognitive bias but many strong dependencies on task performance. Third, I find that all measures have very high split-half reliabilities, but most have poor test-retest reliabilities. This comprehensive assessment paints a complex picture: no measure of metacognition is perfect and different measures may be preferable in different experimental contexts.
... To investigate whether any time-on-task effects on EEG activity affected brainbehaviour relationships we performed a single-trial multiple regression analysis (for similar approach see Benwell et al., 2017Benwell et al., , 2022Kopčanová et al., 2023). We used a hierarchical twostage estimation (Friston, 2008) to incorporate participant level variability into group level statistics. ...
... In addition to time-on-task related EEG effects, we also found that alpha and beta desynchronisation uniquely predicted single-trial decision confidence independently of accuracy, RTs, stimulus contrast, and trial-order. This is in line with previous studies that showed post-stimulus alpha/beta power correlates with subjective judgements like confidence and perceptual awareness Faivre et al., 2018;Kopčanová et al., 2023) and is unrelated to accuracy Kopčanová et al., 2023). Given the independence of this effect from RTs, it is unlikely it can solely be attributed to motor preparation (Faivre et al., 2018). ...
... In addition to time-on-task related EEG effects, we also found that alpha and beta desynchronisation uniquely predicted single-trial decision confidence independently of accuracy, RTs, stimulus contrast, and trial-order. This is in line with previous studies that showed post-stimulus alpha/beta power correlates with subjective judgements like confidence and perceptual awareness Faivre et al., 2018;Kopčanová et al., 2023) and is unrelated to accuracy Kopčanová et al., 2023). Given the independence of this effect from RTs, it is unlikely it can solely be attributed to motor preparation (Faivre et al., 2018). ...
Preprint
Full-text available
Fluctuations in oscillatory brain activity have been shown to co-occur with variations in task performance. More recently, part of these fluctuations has been attributed to long-term (>1hr) monotonous trends in the power and frequency of alpha oscillations (8-13 Hz). Here we tested whether these time-on-task changes in EEG activity are limited to activity in the alpha band and whether they are linked to task performance. Thirty-six participants performed 900 trials of a two-alternative forced choice visual discrimination task with confidence ratings. Pre- and post-stimulus spectral power (1-40Hz) and aperiodic (i.e., non-oscillatory) components were compared across blocks of the experimental session and tested for relationships with behavioural performance. We found that time-on-task effects on oscillatory EEG activity were primarily localised within the alpha band, with alpha power increasing and peak alpha frequency decreasing over time, even when controlling for aperiodic contributions. Aperiodic, broadband activity on the other hand did not show time-on-task effects in our data set. Importantly, time-on-task effects in alpha frequency and power explained variability in single-trial reaction times. Moreover, controlling for time-on-task effectively removed the relationships between alpha activity and reaction times. However, time-on-task effects did not affect other EEG signatures of behavioural performance, including post-stimulus predictors of single-trial decision confidence. Therefore, our results dissociate alpha-band brain-behaviour relationships that can be explained away by time-on-task from those that remain after accounting for it - thereby further specifying the potential functional roles of alpha in human visual perception.
... Note that due to recently reported limitations of the m-ratio [27] as a reliable and unbiased measure of metacognitive ability, particularly when tasks have under 400 trials [12,28,29], we did not include this measure as one of the primary outcomes. However, for completeness and consistency with previous studies [15][16][17], we report regression results between m-ratio and both age and symptom dimensions in Supplementary Fig. S6. ...
Article
Full-text available
When making decisions in everyday life, we often rely on an internally generated sense of confidence to help us revise and direct future behaviours. For instance, confidence directly informs whether further information should be sought prior to commitment to a final decision. Many studies have shown that aging and both clinical and sub-clinical symptoms of psychopathology are associated with systematic alterations in confidence. However, it remains unknown whether these confidence distortions influence information-seeking behaviour. We investigated this question in a large general population sample (N = 908). Participants completed a battery of psychiatric symptom questionnaires and performed a perceptual decision-making task with confidence ratings in which they were offered the option to seek helpful information (at a cost) before committing to a final decision. Replicating previous findings, an ‘anxious-depression’ (AD) symptom dimension was associated with systematically low confidence, despite no detriment in objective task accuracy. Conversely, a ‘compulsive behaviour and intrusive thoughts’ (CIT) dimension was associated with impaired task accuracy but paradoxical over-confidence. However, neither symptom dimension was significantly associated with an increased or decreased tendency to seek information. Hence, participants scoring highly for AD or CIT did not use the option to information seek any more than average to either increase their confidence (AD) or improve the accuracy of their decisions (CIT). In contrast, older age was associated with impaired accuracy and decreased confidence initially, but increased information seeking behaviour mediated increases in both accuracy and confidence for final decisions. Hence, older adults used the information seeking option to overcome initial deficits in objective performance and to increase their confidence accordingly. The results show an appropriate use of information seeking to overcome perceptual deficits and low confidence in healthy aging which was not present in transdiagnostic psychopathology.
Article
Full-text available
Research on brain-behaviour relationships often makes the implicit assumption that these derive from a co-variation of stochastic fluctuations in brain activity and performance across trials of an experiment. However, challenging this assumption, oscillatory brain activity, as well as indicators of performance, such as response speed, can show systematic trends with time on task. Here, we tested whether time-on-task trends explain a range of relationships between oscillatory brain activity and response speed, accuracy as well as decision confidence. Thirty-six participants performed 900 trials of a two-alternative forced choice visual discrimination task with confidence ratings. Pre- and post-stimulus spectral power (1–40 Hz) and aperiodic (i.e., non-oscillatory) components were compared across blocks of the experimental session and tested for relationships with behavioural performance. We found that time-on-task effects on oscillatory EEG activity were primarily localised within the alpha band, with alpha power increasing and peak alpha frequency decreasing over time, even when controlling for aperiodic contributions. Aperiodic, broadband activity on the other hand did not show time-on-task effects in our data set. Importantly, time-on-task effects in alpha frequency and power explained variability in single-trial reaction times, and controlling for time-on-task effectively removed these relationships. Time-on-task effects did not affect other EEG signatures of behavioural performance, including post-stimulus predictors of single-trial decision confidence. Our results dissociate alpha-band brain-behaviour relationships that can be explained away by time-on-task from those that remain after accounting for it, thereby further specifying the potential functional roles of alpha in human visual perception.
Preprint
Full-text available
It is well established that one's confidence in a choice can be influenced by new evidence encountered after commitment has been reached, but the processes through which post-choice evidence is sampled remain unclear. To investigate this, we traced the pre- and post-choice dynamics of electrophysiological signatures of evidence accumulation (Centro-parietal Positivity, CPP) and motor preparation (mu/beta band) to determine their sensitivity to participants' confidence in their perceptual discriminations. Pre-choice CPP amplitudes scaled with confidence both when confidence was reported simultaneously with choice, or when reported 1-second after the initial direction decision. When additional evidence was presented during the post-choice delay period, the CPP continued to evolve after the initial choice, with a more prolonged build-up on trials with lower confidence in the alternative that was finally endorsed, irrespective of whether this entailed a change-of-mind. Further investigation established that this pattern was accompanied by earlier post-choice CPP peak latency, earlier lateralisation of motor preparation signals toward the ultimately chosen response, and faster confidence reports when participants indicated high certainty that they had made a correct or incorrect initial choice. These observations are consistent with confidence-dependent stopping theories according to which post-choice evidence accumulation ceases when a criterion level of confidence in a choice alternative has been reached. Our findings have implications for current models of choice confidence, and predictions they may make about EEG signatures.
Article
Full-text available
Advances in computational statistics and corresponding shifts in funding initiatives over the past few decades have led to a proliferation of neuroscientific measures being developed in the context of mental health research. Although such measures have undoubtedly deepened our understanding of neural mechanisms underlying cognitive, affective, and behavioral processes associated with various mental health conditions, the clinical utility of such measures remains underwhelming. Recent commentaries point toward the poor reliability of neuroscientific measures to partially explain this lack of clinical translation. Here, we provide a concise theoretical overview of how unreliability impedes clinical translation of neuroscientific measures; discuss how various modeling principles, including those from hierarchical and structural equation modeling frameworks, can help to improve reliability; and demonstrate how to combine principles of hierarchical and structural modeling within the generative modeling framework to achieve more reliable, generalizable measures of brain-behavior relationships for use in mental health research.
Article
Full-text available
It is still debated whether metacognition, or the ability to monitor our own mental states, relies on processes that are “domain-general” (a single set of processes can account for the monitoring of any mental process) or “domain-specific” (metacognition is accomplished by a collection of multiple monitoring modules, one for each cognitive domain). It has been speculated that two broad categories of metacognitive processes may exist: those that monitor primarily externally generated versus those that monitor primarily internally generated information. To test this proposed division, we measured metacognitive performance (using m-ratio, a signal detection theoretical measure) in four tasks that could be ranked along an internal-external axis of the source of information, namely memory, motor, visuomotor, and visual tasks. We found correlations between m-ratios in visuomotor and motor tasks, but no correlations between m-ratios in visual and visuomotor tasks, or between motor and memory tasks. While we found no correlation in metacognitive ability between visual and memory tasks, and a positive correlation between visuomotor and motor tasks, we found no evidence for a correlation between motor and memory tasks. This pattern of correlations does not support the grouping of domains based on whether the source of information is primarily internal or external. We suggest that other groupings could be more reflective of the nature of metacognition and discuss the need to consider other non-domain task-features when using correlations as a way to test the underlying shared processes between domains.
Preprint
Full-text available
Human decision behavior entails a graded awareness of its certainty, known as a feeling of confidence. Until now, considerable interest has been paid to behavioral and computational dissociations of decision and confidence, which has raised an urgent need for measurement frameworks that can quantify the efficiency of confidence rating relative to decision accuracy (metacognitive efficiency). As a unique addition to such frameworks, we have developed a new signal detection theory paradigm utilizing the generalized gaussian distribution (GGSDT). This framework evaluates the observer's internal standard deviation ratio and metacognitive efficiency through the scale and shape parameters respectively. The shape parameter quantifies the kurtosis of internal distributions and can practically be understood in reference to the proportion of the gaussian ideal observer's confidence being disrupted with random guessing (metacognitive lapse rate). This interpretation holds largely irrespective of the contaminating effects of decision accuracy or operating characteristic asymmetry. Thus, the GGSDT enables hitherto unexplored research protocols (e.g., direct comparison of yes/no versus forced-choice metacognitive efficiency), expected to find applications in various fields of behavioral science. This paper provides a detailed walkthrough of the GGSDT analysis with an accompanying R package (ggsdt).
Article
Full-text available
The human ability to introspect on thoughts, perceptions or actions − metacognitive ability − has become a focal topic of both cognitive basic and clinical research. At the same time it has become increasingly clear that currently available quantitative tools are limited in their ability to make unconfounded inferences about metacognition. As a step forward, the present work introduces a comprehensive modeling framework of metacognition that allows for inferences about metacognitive noise and metacognitive biases during the readout of decision values or at the confidence reporting stage. The model assumes that confidence results from a continuous but noisy and potentially biased transformation of decision values, described by a confidence link function. A canonical set of metacognitive noise distributions is introduced which differ, amongst others, in their predictions about metacognitive sign flips of decision values. Successful recovery of model parameters is demonstrated, and the model is validated on an empirical data set. In particular, it is shown that metacognitive noise and bias parameters correlate with conventional behavioral measures. Crucially, in contrast to these conventional measures, metacognitive noise parameters inferred from the model are shown to be independent of performance. This work is accompanied by a toolbox ( ReMeta ) that allows researchers to estimate key parameters of metacognition in confidence datasets.
Article
Full-text available
Human behaviours are guided by how confident we feel in our abilities. When confidence does not reflect objective performance, this can impact critical adaptive functions and impair life quality. Distorted decision-making and confidence have been associated with mental health problems. Here, utilising advances in computational and transdiagnostic psychiatry, we sought to map relationships between psychopathology and both decision-making and confidence in the general population across two online studies (N’s = 344 and 473, respectively). The results revealed dissociable decision-making and confidence signatures related to distinct symptom dimensions. A dimension characterised by compulsivity and intrusive thoughts was found to be associated with reduced objective accuracy but, paradoxically, increased absolute confidence, whereas a dimension characterized by anxiety and depression was associated with systematically low confidence in the absence of impairments in objective accuracy. These relationships replicated across both studies and distinct cognitive domains (perception and general knowledge), suggesting that they are reliable and domain general. Additionally, whereas Big-5 personality traits also predicted objective task performance, only symptom dimensions related to subjective confidence. Domain-general signatures of decision-making and metacognition characterise distinct psychological dispositions and psychopathology in the general population and implicate confidence as a central component of mental health.
Article
Full-text available
Humans differ in their capability to judge choice accuracy via confidence judgments. Popular signal detection theoretic measures of metacognition, such as M-ratio, do not consider the dynamics of decision making. This can be problematic if response caution is shifted to alter the tradeoff between speed and accuracy. Such shifts could induce unaccounted-for sources of variation in the assessment of metacognition. Instead, evidence accumulation frameworks consider decision making, including the computation of confidence, as a dynamic process unfolding over time. Using simulations, we show a relation between response caution and M-ratio. We then show the same pattern in human participants explicitly instructed to focus on speed or accuracy. Finally, this association between M-ratio and response caution is also present across four datasets without any reference towards speed. In contrast, when data are analyzed with a dynamic measure of metacognition, v-ratio, there is no effect of speed-accuracy tradeoff.
Article
Full-text available
Every decision we make is accompanied by an estimate of the probability that our decision is accurate or appropriate. This probability estimate is termed our degree of decision confidence. Recent work has uncovered event-related potential (ERP) correlates of confidence both during decision formation and after a decision has been made. However, the interpretation of these findings is complicated by methodological issues related to ERP amplitude measurement that are prevalent across existing studies. To more accurately characterise the neural correlates of confidence, we presented participants with a difficult perceptual decision task that elicited a broad range of confidence ratings. We identified a frontal ERP component within an onset prior to the behavioural response, which exhibited more positive-going amplitudes in trials with higher confidence ratings. This frontal effect also biased measures of the centro-parietal positivity (CPP) component at parietal electrodes via volume conduction. Amplitudes of the error positivity (Pe) component that followed each decision were negatively associated with confidence for trials with decision errors, but not for trials with correct decisions, with Bayes factors providing moderate evidence for the null in the latter case. We provide evidence for both pre- and post-decisional neural correlates of decision confidence that are observed in trials with correct and erroneous decisions, respectively. Our findings suggest that certainty in having made a correct response is associated with frontal activity during decision formation, whereas certainty in having committed an error is instead associated with the post-decisional Pe component. These findings also highlight the possibility that some previously reported associations between decision confidence and CPP/Pe component amplitudes may have been a consequence of ERP amplitude measurement-related confounds. Re-analysis of existing datasets may be useful to test this hypothesis more directly.
Preprint
Objective: Prior studies have found metacognitive biases are linked to a transdiagnostic dimension of anxious-depression, manifesting as reduced confidence in performance. However, previous work has been cross-sectional and so it is unclear if under-confidence is a trait-like marker of anxious-depression vulnerability, or if it resolves when anxious-depression improves.Methods: Data were collected as part of a large-scale transdiagnostic, four-week observational study of individuals initiating internet-based cognitive behavioural therapy (iCBT) or antidepressant medication. Self-reported clinical questionnaires and perceptual task performance were gathered to assess anxious-depression and metacognitive bias at baseline and four-week follow-up. Primary analyses were conducted for individuals who received iCBT (n=649), with comparisons between smaller samples that received antidepressant medication (n=82) and a control group receiving no intervention (n=88).Results: Prior to receiving treatment, anxious-depression severity was associated with under-confidence in performance in the iCBT arm, replicating previous work. From baseline to follow-up, levels of anxious-depression were significantly reduced, and this was accompanied by a significant increase in metacognitive confidence in the iCBT arm (B=0.17, SE=0.02, p<0.001). These changes were correlated (r(647)=-0.12, p=0.002); those with the greatest reductions in anxious-depression levels had the largest increase in confidence. While the three-way interaction effect of group and time on confidence was not significant (F(2, 1632)=0.60, p=0.550), confidence increased in the antidepressant group (B=0.31, SE=0.08, p<0.001), but not among controls (B=0.11, SE=0.07, p=0.103).Conclusions: Metacognitive biases in anxious-depression are state-dependent; when symptoms improve with treatment, so does confidence in performance. Our results suggest this is not specific to the type of intervention.
Article
Recent years have marked a renaissance in efforts to increase research reproducibility in psychology, neuroscience, and related fields. Reproducibility is the cornerstone of a solid foundation of fundamental research-one that will support new theories built on valid findings and technological innovation that works. The increased focus on reproducibility has made the barriers to it increasingly apparent, along with the development of new tools and practices to overcome these barriers. Here, we review challenges, solutions, and emerging best practices with a particular emphasis on neuroimaging studies. We distinguish 3 main types of reproducibility, discussing each in turn. Analytical reproducibility is the ability to reproduce findings using the same data and methods. Replicability is the ability to find an effect in new datasets, using the same or similar methods. Finally, robustness to analytical variability refers to the ability to identify a finding consistently across variation in methods. The incorporation of these tools and practices will result in more reproducible, replicable, and robust psychological and brain research and a stronger scientific foundation across fields of inquiry.