Access to this full-text is provided by IOP Publishing.
Content available from Journal of Neural Engineering
This content is subject to copyright. Terms and conditions apply.
J. Neural Eng. 20 (2023) 026024 https://doi.org/10.1088/1741-2552/acc613
Journal of Neural Engineering
OPEN ACCESS
RECEIVED
27 January 2021
REVISED
27 February 2023
ACC EPT ED FOR PUB LICATI ON
21 March 2023
PUBLISHED
31 March 2023
Original content from
this work may be used
under the terms of the
Creative Commons
Attribution 4.0 licence.
Any further distribution
of this work must
maintain attribution to
the author(s) and the title
of the work, journal
citation and DOI.
PAPER
Decoding study-independent mind-wandering from EEG using
convolutional neural networks
Christina Yi Jin1,2,∗, Jelmer P Borst1and Marieke K van Vugt1
1Bernoulli Institute for Mathematics, Computer Science and Artificial Intelligence, University of Groningen, Groningen 9747AG, The
Netherlands
2Research Center for Augmented Intelligence, Zhejiang Lab, Hangzhou 310000, People’s Republic of China
∗Author to whom any correspondence should be addressed.
E-mail: cyj.sci@gmail.com
Keywords: EEG, convolutional neural network, mind-wandering, generalizability, meta-learner, classifier, machine learning
Supplementary material for this article is available online
Abstract
Objective. Mind-wandering is a mental phenomenon where the internal thought process
disengages from the external environment periodically. In the current study, we trained EEG
classifiers using convolutional neural networks (CNNs) to track mind-wandering across studies.
Approach. We transformed the input from raw EEG to band-frequency information (power),
single-trial ERP (stERP) patterns, and connectivity matrices between channels (based on inter-site
phase clustering). We trained CNN models for each input type from each EEG channel as the input
model for the meta-learner. To verify the generalizability, we used leave-N-participant-out
cross-validations (N=6) and tested the meta-learner on the data from an independent study for
across-study predictions. Main results. The current results show limited generalizability across
participants and tasks. Nevertheless, our meta-learner trained with the stERPs performed the best
among the state-of-the-art neural networks. The mapping of each input model to the output of the
meta-learner indicates the importance of each EEG channel. Significance. Our study makes the first
attempt to train study-independent mind-wandering classifiers. The results indicate that this
remains challenging. The stacking neural network design we used allows an easy inspection of
channel importance and feature maps.
1. Introduction
Mind-wandering is a thought process that is charac-
terized by being not directly relevant to the primary
goals in the current context (Smallwood and Schooler
2015). Mind-wandering tends to manifest itself as
attentional lapses which often contribute to mak-
ing errors in a task (Cheyne et al 2006). However,
mind-wandering does not result in errors in all cases.
Sometimes people can handle their primary tasks
well when they start enjoying these periods of self-
distraction as a temporary escape from the current
situation (Schooler et al 2011). This positive effect
of mind-wandering is particularly common when the
current task is associated with low cognitive load—
in other words, it is a task in which performance
can be achieved with little executive control involved
(Randall et al 2019) and therefore low performance
is not a fool-proof indicator of mind-wandering.
Another behavioral measure that has been proposed
to characterize mind-wandering is increased response
time variability. Several studies have shown that even
when no obvious mistakes are observed, participants
display increased variance in their response times
when their mind wanders (Bastian and Sackur 2013,
Seli et al 2013, Zheng et al 2019, Zanesco et al 2021b).
In addition to using behavior, mind-wandering
can also be detected by means of physiological and
neural measures. For example, researchers found that
the pupil diameter in reaction to stimuli becomes
smaller in an off-task state (Huijser et al 2018), pos-
sibly related to a vigilance decrement that tends to co-
occur with mind-wandering (Unsworth and Robison
2016). On the level of the cerebral cortex, mind-
wandering appears to be associated with inhibited
sensory processing to the visual stimuli, referred to
as ‘perceptual decoupling’ (Schooler et al 2011). This
perceptual decoupling manifests itself in the electro-
encephalogram (EEG) as a reduced P1 and increased
alpha power (frequency range 8.5 ∼12 Hz) observed
© 2023 The Author(s). Published by IOP Publishing Ltd
J. Neural Eng. 20 (2023) 026024 C Y Jin et al
at the parietal-occipital regions (Kam and Handy
2013, Compton et al 2019, Jin et al 2019). In func-
tional magnetic resonance imaging (fMRI) studies,
mind-wandering is associated with increased activa-
tion of the default mode network (DMN), together
with changes in the connectivity between the DMN
and other networks (Christoff et al 2009, Ho et al
2019). Through indicating a memory retrieval pro-
cess, the involvement of the DMN supports the func-
tional role of mind-wandering as ‘spontaneous future
cognition’ (Cole and Kvavilashvili 2019), potentially
aiding problem-solving and creativity (Schooler et al
2011).
Given this set of neural and physiological correl-
ates of mind-wandering, several studies have explored
the possibility of predicting mind-wandering on a
single-trial level using machine learning. Mittner
and colleagues used multimodal signatures of the
(co)activations in the DMN, anti-correlated net-
work, and the pupil diameter as the features for a
machine learning model to predict mind wandering
(Mittner et al 2014). They found that neural data
could reliably predict mind-wandering with a median
accuracy of 79.7% using leave-one-participant-out
cross-validation (LOPOCV). They also found that
mind-wandering was not linked to DMN, CAN or
pupil diameter alone, but that instead all the fea-
tures were necessary for the optimal predictive per-
formance (Mittner et al 2014). This seems sens-
ible, given that the contents of mind-wandering
can vary considerably. In more recent work, they
achieved 65% accuracy across participants by train-
ing another multimodality machine learning model
(Groot et al 2021). The authors attributed the dif-
ferent performance level achieved across studies to
individual biases in self-reports, differences in levels
of meta-awareness, and heterogeneity in the thought
content being reported (Groot et al 2021).
Kawashima and Kumano (2017) found that train-
ing an EEG-based mind-wandering classifier with
a non-linear support vector machine (SVM) per-
formed better than a linear SVM. They also found
that training with a selected subset of electrodes from
frontal and parietal-occipital regions performed bet-
ter than using all the electrodes. This suggests that
the association between EEG and mental state fluc-
tuations is more complex than a simple linear rela-
tionship and a subset of electrodes at key positions
might contain sufficient information for discrim-
inating mind-wandering. Jin et al (2019) explored
whether it was possible to predict mind wandering
on the basis of various features of EEG data, ranging
from power and inter-site phase clustering (ISPC)
in the alpha and theta bands, to single-trial event-
related potentials (stERPs). They were able to pre-
dict mind-wandering with an average accuracy of
64% for a sustained attention to response task, and
69% for a visual search task. Additionally, they per-
formed across-task predictions between the sustained
attention to response and visual search tasks, with an
accuracy of 59% and 60% (Jin et al 2019). In a related
study, the researchers used ICA-decomposed EEG in
the alpha band to predict the occurrence of mind-
wandering—albeit this time in a model that general-
ized across participants. They reported again an aver-
age accuracy of around 60% when testing the models
on a left-out dataset (LOPOCV), for both within- and
across-task predictions (Jin et al 2020).
Several conclusions can be drawn from the stud-
ies above. The accuracy with EEG seemed to hit a
ceiling at 60% when making generalization predic-
tions across individuals or across tasks. One possib-
ility is the relatively low signal-noise ratio of scalp
EEG compared to other neural imaging techniques,
such as intracranial EEG or fMRI. However, scalp
EEG is still one of the main signals to be used
for Brain-Computer Interfaces (BCIs) with healthy
users. Another possible cause for low accuracy is
that labels come from thought probes, which depend
on the accuracy of each individual’s introspection.
Other setups, for example facial video recording, eye-
tracking measures and pupillometry can potentially
help to validate and revise the subjective thought
probes. A third possible cause of low classifier per-
formance is that learning is performed with pre-
computed EEG features (e.g. P3 or alpha power).
These features are selected based on previous studies,
but they do not represent all the temporal-frequency
information from EEG sensors across the whole scalp.
Also, the studies discussed above trained SVMs to
learn the relationship between pre-computed fea-
tures. SVM—when used with a nonlinear kernel—
is a powerful tool for learning features based on
non-linear relationships, but its computational cost
is relatively high (O(n3), Abdiansah and Wardoyo
(2015)3). To handle a large quantity of the frequency
information and/or evoked potentials from all the
sensors, we need more powerful machine learning
models.
The convolutional neural network (CNN) is a
good candidate for addressing this type of problem.
A CNN uses a kernel to detect the features of the
input signals. With deeper CNN layers, the learned
features become more abstract, while at the same time
the dimensions of the input decrease. The CNN lay-
ers use much fewer parameters than the fully con-
nected (FC) neural networks to reduce the compu-
tational and storage cost. In practice, multiple CNN
layers are designed to detect useful features. Then the
FC layers learn the relationship of the features for
the final prediction. Hosseini and Guo (2019) trained
a CNN architecture to detect mind-wandering in a
3We tested the computation time with the current dataset. Raw
EEG signal from one channel as input to the CNN took 0.433 s to
finish one epoch of training. The same input to SVM took 1.4 s for
the training.
2
J. Neural Eng. 20 (2023) 026024 C Y Jin et al
meditation task4. They trained datasets from two
participants and achieved the highest performance
of 91.78% during ten-fold cross-validation (CV)
within-individuals. The authors also tried to predict
across individuals by training a classifier with each
individual only and using it to predict the other data-
set. The performance dropped to 66% for across-
individual predictions. Their results indicate that the
generalizability of mind-wandering classifiers is chal-
lenging, even with state-of-the-art neural networks.
Nevertheless, Hosseini and Guo (2019) only used
the raw EEG signal to train their CNN classifiers,
which may have limited their performance. Even
though all information is necessarily present in the
raw EEG signal, not all classifier architectures can
extract all informative features for decoding. For
example, it is possible that pre-analyses such as single
trial ERPs and temporal-frequency analyses reveal
information that otherwise remains hidden to the
classifier. In addition, they had a limited sample
size. In the current study, we will include band-
frequency information and stERPs as input types for
the CNN models as well as the raw EEG. Further-
more, we endeavor to train study-independent mind-
wandering classifiers: we trained and validated clas-
sifiers with data from Jin et al (2019, referred to as
Study A) and tested the classifiers on the data from
Jin et al (2020, referred to as Study B). In addition,
each study contained two independent tasks and dif-
ferent groups of participants participated in each of
these studies. The design and the probes of both stud-
ies can be found in figure 1.
For developing the CCN model, we first trans-
formed the raw EEG into the frequency power spec-
trum and stERP contour maps. We also included
the ISPC reflecting connectivity between EEG chan-
nels. Raw EEG, power and stERP from each chan-
nel and the ISPC from each channel pair was trained
with an independent CNN classifier. The output for
each model represents the activations of one on-task
neuron and one mind-wandering neuron using one
input type from one spatial point (channel or channel
pair). To figure out the relationship between the input
types and their spatial sampling points, we trained
a meta-learner with the concatenated binary out-
comes of each input model (figure 2). This allowed
us to evaluate the contribution of each input model
by mapping their weights to the final output lay-
ers of the meta-learner. To compare the performance
of our classifier to the other state-of-the-art neural
networks, we trained two other CNN models—the
aforementioned Hosseini and Guo (2019) model and
EEGnet (Lawhern et al 2018). EEGnet is superior
4Mind-wandering in meditation tasks is defined as decoupling
from a focus on internal sensations (e.g. focusing on the breath)
instead of decoupling from processing external stimuli in a cognit-
ive task situation such as discussed in the current paper,which most
likely requires a different classifier.
in learning frequency patterns with only raw EEG
as the input. Its architecture design also allows to
map the feature contribution to its spatial patterns.
EEGnet has been proven effective at solving classic
BCI problems such as motor imagery and P300 speller
(Lawhern et al 2018).
Our study makes the first attempt to train study-
independent mind-wandering classifiers. With the
meta-learner, our network will allow for an easy
inspection of channel importance and feature maps.
2. Method
2.1. Datasets
The research was conducted in accord with the
Declaration of Helsinki and approved by the Research
Ethics Committee of the Faculty of Arts (CETO),
University of Groningen. Participants gave writ-
ten informed consent. They were financially com-
pensated for their participation during the initial data
collection stage. They were debriefed with the main
goal of the task after the experiment.
2.1.1. Training dataset
The training dataset is derived from Jin et al (2019).
Thirty participants (13 females, ages 18–30 years,
M=23.33, SD =2.81) took part in the study.
They performed a visual search task and a sustained-
attention to respond task (SART) for six blocks each
in two sessions (figure 1(a)). The main stimuli in
the SART were English words that occurred in lower-
case for 89% of the time and in uppercase for 11%
of the time. Participants were required to press ‘m’
whenever they saw a lowercase word and to withhold
their response when an uppercase word appeared. In
the visual search task, participants were given a target
shape to search at the beginning of each block. They
were required to look for the target in each trial of
that block and indicated if the target was in the search
panel (yes/no) by pressing the left or right arrow key
corresponding to their response. There was an equal
probability of the target-present and the target-absent
trials. An SART block had 135 trials and a visual
search block had 140 trials. Details can be found in
the original study (Jin et al 2019).
Participants were interrupted by probe questions
asking them about the content of their thinking at
that moment (figure 1(a)). They could respond to this
question with one of six options: (1) I was entirely
concentrated on the ongoing task; (2) I evaluated
aspects of the task (e.g. my performance or how long
it takes); (3) I thought about personal matters; (4) I
was distracted by my surroundings (e.g. noise, tem-
perature, my physical condition); (5) I was daydream-
ing, thinking of task-unrelated things; (6) I was not
paying attention, but my thought was not anywhere
specifically. Each task had 54 probes that were inter-
spersed with the trials. Two consecutive probes were
3
J. Neural Eng. 20 (2023) 026024 C Y Jin et al
Figure 1. Trial sequences of the SART and the visual search (VS) task, and the probe illustrations for the (a) training and
(b) testing dataset. The orange frame indicates the stimuli around which the EEG epochs were extracted (each epoch comprised
the period starting at stimulus onset).
separated by 7–24 trials, roughly accounting for 34–
144 s. In the analysis, response 1 and 2 were labeled
as an on-task state; response 3 and 5 were labeled as a
mind-wandering state, and the other responses were
ignored, following the same classification rule as the
original work.
2.1.2. Testing dataset
Performance of the classifier was tested on an entirely
different experiment from the one on which the clas-
sifier was trained. The testing dataset is derived from
Jin et al (2020). Thirty participants (16 females, age
18–31 years, M=23.73, SD =3.47) took part in
the study (figure 1(b)). In the visual search task, par-
ticipants either counted the specified target in the
following search panel and indicated their response
by pressing the number key (counting condition),
or passively viewed the search panel and pressed ‘j’
as a standard response (non-counting condition). In
the SART, participants viewed single digits ranging
between 1 to 9 drawn from a uniform distribution.
They pressed ‘j’ whenever they saw a digit other than
‘3′; if ‘3′appeared (11%), they were to withhold their
response. The visual search task had 21 trials in each
block and 20 blocks in total. The SART had twelve
blocks with a block length that varied between two
to seven repetitions of the nine digits (18–63 tri-
als). Probes were shown at the end of each block in
both tasks. Block length varied between one to three
minutes. Further details can be found in the original
paper (Jin et al 2020).
In this dataset, the probes appeared at the end
of each block (which were much shorter than in
the training dataset). Participants indicated their
momentary attentional state on a rating scale of
−5–5 with anchor ‘−5′for ‘totally mind-wandering’,
‘−2′for ‘mind-wandering’, ‘0′for ‘uncertain’, ‘2′for
‘focused’, and ‘5′for ‘highly focused’. Thus, positive
ratings were classified as on-task and negative ratings
as mind-wandering.
2.2. EEG preparation
2.2.1. Recording hardware
The training dataset had 128 channels, while the test-
ing dataset had 32 channels. All electrode locations
were within the International 10-10 System. In the
current analysis, we only considered the 32 chan-
nels that overlapped between the training and test-
ing studies (figure 2). The data were recorded using
the Biosemi ActiveTwo recording system. The online
sampling rate was 512 Hz. The Biosemi hardware
does not have any high-pass filtering. An anti-aliasing
filtering is performed in the ADC’s decimation filter
(www.biosemi.com/faq/adjust_filter.htm).
2.2.2. Preprocessing
Data had already been preprocessed in the original
studies, and we reused these preprocessed data here.
The offline EEG preprocessing was done with the
EEGLAB toolbox in MATLAB. Continuous EEG was
re-referenced to the average signal of both mast-
oids. The band-pass filtering was set to be 0.5–40 Hz
and 0.1–42 Hz for the training and testing datasets,
respectively. Both datasets were down-sampled to
256 Hz. The original segmentation was [−400 1200]
ms with respect to stimulus onset for Study A, and
[−1000 3000] ms for Study B. In the current study,
4
J. Neural Eng. 20 (2023) 026024 C Y Jin et al
Figure 2. Channel locations (32-channel 10-10 system). EEG from each channel is transformed into power spectrum and
single-trial ERP (stERP) contour maps. The 16 channels in red font were also used to compute the inter-site phase clustering
(ISPC). Each input type from each spatial location (channel or channel pair) was trained with three convolutional layers (of which
neuron numbers are 16, 32 and 32) with a max-pooling layer after each CNN. For raw EEG, we used a 1D CNN with a kernel
length of 5 to account for the pattern of every five temporal sampling points. For other types of input, we used a 2D CNN with a
kernel size of (5,3) to learn the pattern of 5 temporal sampling points and 3 frequency sampling points. The obtained model with
each input type from one spatial location serves as the input model for the meta-learner, which learned the relationship of the
binary outcomes between each input model with two fully connected (FC) layers and generated the final prediction.
we took the overlapping [−400 1000] ms as the time
window for analysis. The 200 ms before the stimulus
onset was used as the baseline. Ocular artifacts were
detected and removed using the infomax independ-
ent component analysis.
2.2.3. Transformation
Apart from the raw EEG data, three kinds of trans-
formations were performed on the original EEG
matrix—power and ISPC derived from a time-
frequency decomposition using the complex Morlet
wavelets (Cohen 2014), and stERP analysis (Bostanov
2004, Bostanov and Kotchoubey 2006, Jin et al 2019).
These transformations were performed in MATLAB
with an in-house script.
A complex Morlet wavelet is created using the
equation
cmw =e−t2/2s2ei2πft
s=n
2πf
where fdenotes the frequency in Hz, and nrefers to
the number of wavelet cycles. Convolving the com-
plex wavelet with the original signal gives complex dot
products, from which the amplitude can be extrac-
ted through the vector length of each complex data
point (power is obtained from the squared amp-
litudes, figure 2(c), and the phase angle is obtained
from the angle with respect to the positive real
axis).
A set of wavelets was created with frequencies ran-
ging from 4 to 40 Hz with 35 frequency sampling
points in logarithmic space, covering multiple fre-
quencies in each of the theta/alpha/beta/gamma
bands. Delta oscillations (1–3 Hz) were excluded
because our time window of 1400 ms was not long
enough to estimate delta power with sufficient accur-
acy. The upper boundary of the analyzed frequency
was limited by the bandpass filtering during the pre-
processing. The number of cycles used for the wavelet
increased from 3 to 7 in a logarithmic spacing on the
frequency axis.
5
J. Neural Eng. 20 (2023) 026024 C Y Jin et al
ISPC was computed as a measure of the con-
nectivity between channels (Cohen 2014):
ISPCf=
n−1
n
X
t=1
ei(Φxt−Φyt )
in which Φxand Φyare the phase angles from elec-
trode xand y. ISPC is computed as the averaged phase
angle differences (which were also mapped to the
complex plane) in a moving time window. The length
of the averaged angle difference denotes the cluster-
ing (i.e. the more the phase difference remains con-
stant, the more their length adds up). The window
length increased from 3 to 5 cycles linearly with the
frequency. As shown in figure 2, there is an empty
region of half the wave length at both sides of each
frequency with no averaging data. Those regions were
set to be zero during machine learning. The ISPC
was computed in a 16-channel layout resulting in 120
channel pairs (Channel layout in red font, figure 2) to
reduce the number of channel pairs.
Both the power and ISPC matrices were down-
sampled to 50 Hz given the fact that time-frequency
analysis smears out the signal over time.
The stERP analysis, which quantifies the bumps
in the EEG in terms of temporal location and amp-
litude was performed through computing the cross-
covariance between the signal and the kernel ψ(t):
W(s,t) = 1
√s
∞
ˆ
−∞
f(τ)ψτ−t
sdτ
ψ(t) = 1−16t2e−8t2.
With two varying parameters—time lag τand
scale s(an indication of wavelength), the cross-
covariances can be mapped to a contour graph
(figure 2(e)). In the current study, the time lag (τ)
ranges from 0 ms (stimulus onset) to 1000 ms (end
of the EEG epoch) following the same temporal res-
olution as the raw EEG. The scale (s) ranges from 1
to 2500 ms in logarithmic space with 300 sampling
points. The obtained matrix is down-sampled to 65
sampling points in the time lag and 30 points in the
scale (frequency dimension).
Together, this creates four types of inputs for the
CNNs: raw EEG, power, ISPC, and stERP (figure 2).
2.3. Neural network
2.3.1. Architecture of the neural network
The CNN input model and the meta-learner struc-
ture are shown in figure 2. Its design was based on
prior work reviewed by Roy et al (2019), complemen-
ted with an extensive parameter exploration. Specific-
ally, for the raw EEG input, we used three layers of a
1D CNN with a kernel size of five temporal points.
The neuron numbers are 16, 32 and 32 sequentially5.
Each convolutional layer is followed by a max-pooling
layer with a pooling size of 2. The output of the third
max-pooling layer is flattened and connected to two
FC layers with 200 and 50 neurons. The activation
function is ReLU for all the hidden layers. The out-
put layer uses a softmax activation function to cal-
culate the activation of one on-task neuron and one
mind-wandering neuron. The CNN structure for the
power, ISPC or stERP is similar to that of the raw EEG
except that a 2D CNN replaced the 1D CNN. The ker-
nel size of the 2D CNN is (3,5) to account for the
pattern of three frequency points and five temporal
points. Each 2D CNN layer is followed by a max-
pooling layer with a pooling size of (2,2). The learn-
ing was performed with categorical cross-entropy as
the loss function and optimized by Adam at a learning
rate of 0.0001. The training batch size is 120, iterated
over 200 epochs.
The meta-learner is trained with the concaten-
ated binary outcomes from each input model. For
the raw EEG, power and stERP, we trained 32 input
models (using data from each channel). For the ISPC,
we trained 120 input models (using phase cluster-
ing from each channel pair). Thus, altogether, we had
2×32 ×3+2×120 =432 outputs from all the
input models to form a (1, 432) vector as the input
to the meta-learner (Meta_Full). We also trained two
other meta-learners using a subset of those input
models. One is without the power but with the other
three input types (Meta_NoPower), because power
was the most overfitting input according to a pre-
liminary analysis (suppl. II). The third meta-learner
used only stERP (Meta_stERP), as this input type
performed best during validations in the preliminary
study (suppl. II).
The meta-learner used two FC layers consisting
of 100 and 20 neurons. The final output consisted of
two neurons that provide the prediction of on-task
or mind-wandering. The hidden layers of the meta-
learner used standard ReLU activation functions and
the output layer used softmax. The meta-learner was
trained with batches of 200 trials and iterated over 50
epochs, performed by categorical cross-entropy as the
loss function and optimized by Adam at a learning
rate of 0.0005.
We set the dropout to be 0.4 to train the input
models and 0.2 to train the meta-learner in order to
prevent overfitting. The proposed CNN models and
meta-learner is implemented on a workstation with
5The choice of neuron number is decided according to prelimin-
ary results, in which we tested CNN design with [16, 16, 8], [16, 32,
32] and [64, 64, 32] neurons. The results (Suppl. III) indicate an
improvement on ROCAUC during validations by increasing neur-
ons from [16,16,8] to [16,32,32] (.513, .519 respectively). How-
ever, adding more neurons to make a design as [64,64,32] did not
increase the ROCAUC on validations (.519). Therefore, we decided
[16,32,32] to be the proposed network design following Occam’s
Razor Principle.
6
J. Neural Eng. 20 (2023) 026024 C Y Jin et al
an Intel Xeon 2.2 GHz, 512 GB RAM, and a TITAN X,
TITAN Xp, and two GeForce RTX 2080 graphical
cards with CUDA V10.1.243 (one or two of the four
GPUs depending on the availability) using Python 3.7
and the keras machine learning library.
2.3.2. Validation and testing
A leave-N-participant-out cross-validation
(LNPOCV) was used to assess the performance. The
N was decided to be 6 to account for 20% of the data-
sets given that the total sample consisted of 30 parti-
cipants. In that sense, we are performing five-fold CV
across individuals. In each fold, we set aside 6 indi-
vidual datasets for validation purposes and trained
the classifiers with the other 24 individual datasets.
Each individual dataset was used once for the val-
idation. The performance was indicated by both the
accuracy and the area under the curve (AUC) of the
Receiver Operating Characteristic (ROC).
2.3.3. Comparison models
We trained a CNN using the architecture by Hos-
seini and Guo (2019)6and an EEGnet7for compar-
ison purposes. Both used only raw EEG from multiple
channels as input. The validation and testing proced-
ures remained the same.
3. Result
3.1. Network performance
We used trials within 12 s before each probe for both
the training and testing samples. This time window
results in 3169 on-task (OT) trials and 1976 mind-
wandering (MW) trials from Study A, and 1266 OT
and 424 MW trials from Study B.
We first examined the best normalization
approach for each input type in a preliminary ana-
lysis. Given that the classes were not balanced origin-
ally, we adjusted the weight of the cross-entropy loss
of the MW class while keeping the weight of the loss
of OT constant at 1 (by setting the class_weight in
Keras.model.fit()). The best normalization and class
weights for each input type can be found in Suppl.
I and II. Input types were normalized with the best
normalization and trained with the best class weights
to be the input models for the meta-learner.
The performance of the meta-learners, as well
as the comparison models, are listed in table 1. The
models learned well during the training, with above
75% of the AUC for all models. However, the clas-
sification performance dropped when predicting the
left-out datasets. The meta-learner with stERP per-
formed relatively the best during validations, achiev-
ing 59% accuracy and .57 for the AUC, indicating a
6The proposed architecture accommodates input size (64, 8192),
while the current data size is (32, 180). We thus reduced the ker-
nel length to accommodate for our data size. The architecture and
other hyperparameters are the same as in Hosseini and Guo (2019).
7EEGnet-8,2.
mild level of generalizability across individuals. The
meta-learner with power as the input (Meta_Full)
and the Hosseini2019 model showed strong signs of
overfitting. They both correctly identify almost 100%
of the training labels. However, their performance
on the validation datasets was almost at chance level
(ROCAUC .505 and .509 during LNPOCV).
Finally, we tested all the models by predicting
the data of Study B. The performance decreased fur-
ther, as expected. Only two of our meta-learners
(Meta_Full and Meta_stERP) can predict above
chance level in both accuracy and AUC. The meta-
learner with stERP (accuracy .519 and .506) is slightly
better than the meta-learner with all input types.
Interestingly, Hosseini2019 achieved the best accur-
acy (.545) during the testing while the ROCAUC
was .492. EEGnet2018 had relatively high ROCAUC
(.539), while the accuracy was lower (.457). Our
Meta_stERP model achieved balanced performance
in both accuracy and ROCAUC.
3.2. Learned features
We investigated the importance of the input locations
by analyzing the weights of the input model in the
meta-learner. Given that Meta_stERP performed the
best, we mapped location importance using weights
derived from this model. The output of the meta-
learner consisted of one neuron responding to the
OT state and another neuron responding to the MW
state. As each input model also represented one OT
neuron and one MW neuron, the mapping gives two
weight topoplots for the meta-OT neuron (OT-OT,
MW-OT) and two weight topoplots for the meta-
MW neuron (OT-MW, MW-MW). To simplify the
interpretation, we performed PCA on the two weight
topoplots for each meta-output neuron and obtained
the largest PC accounting for 100% of the variance
(figure 3(a)). The importance of channel locations
in the PC topoplots looks similar between the meta-
OT and the meta-MW neurons. Some channels are
important in both OT and MW activations (e.g. P4).
Some other channels are responded to more by one
of the two classes (e.g. C3 decides the activations of
the meta-OT neuron more than the meta-MW. O2
decides the activation of the meta-MW neuron more
than the meta-OT).
We chose P4 as an example to plot the feature
maps because it is informative in activating both the
meta-OT and meta-MW neurons (figure 3(a)). Fea-
ture maps are based on the output of each convolu-
tional layer (figure 3(b)). Comparing the feature map
of one OT trial and one MW trial, especially the fea-
ture map of the third convolutional layer, we found
the features are identified during the whole time series
and across the whole scale range of the stERP, indic-
ating that all the ERP components are likely to be pre-
dictive. The activation of the middle scale range seems
to be highlighted in the MW trial compared to the OT
trial. Given that the middle scale range is equivalent to
7
J. Neural Eng. 20 (2023) 026024 C Y Jin et al
Table 1. Network performance as indicated by accuracy and ROC area under the curve (AUC). The Three meta models indicate the
meta-learner with all the input types (Meta_Full), without power for input (Meta_Power) and with single-trial ERP only (Meta_stERP).
Meta_Full Meta_noPower Meta_stERP Hosseini2019 EEGnet2018
Training Accuracy .980 .828 .806 1 .754
ROCAUC .980 .828 .806 1 .754
Validation (LNPOCV) Accuracy .534 .579 .587 .539 .468
ROCAUC .505 .565 .569 .509 .450
Testing Accuracy .511 .492 .519 .545 .457
ROCAUC .501 .485 .506 .492 .539
Note: Model that performs the best during the leave-n-participant-out-cross-validation (LNPOCV) is indicated by bold.
Figure 3. (A) Mapping of channel importance of the meta-learner trained with stERP. The weights of each input model to the
meta-learner outputs are marked in the matrices. Each matrix is further mapped to two weight topoplots for the input of OT and
MW activation, respectively. To simplify the interpretation, the weight topoplots are decomposed by means of PCA to obtain the
largest PC for each meta-output neuron. (B) Feature maps of each convolutional layer with one OT and MW trial as examples.
the wavelength range looking for early sensory evoked
potentials, this indicates that early sensory processing
is likely to feature mind-wandering.
4. Discussion
In the current study, we trained meta-learning neural
networks with multiple CNN input models to clas-
sify mind-wandering. Each input model was trained
with one EEG input type from one spatial sampling
point to predict mind-wandering separately. Thus,
the meta-learner not only learned a prediction based
on combining all the input models but also allowed us
to map the importance of each channel by examining
the weights of the input models to the meta-learner
outputs.
The current results demonstrate the difficulty of
achieving mind-wandering detection that is gener-
alizable across individuals or studies. The general-
ization across individuals within the same study is
shown during the LNPOCV. While the CNN clas-
sifier and the meta-learners performed well on the
training datasets (above .75 for the ROCAUC), per-
formance on other datasets that it had never seen
dropped below .6 in the accuracy and AUC. Further-
more, the performance on the testing datasets derived
from other studies indicated that across-study predic-
tions were hardly achieved.
Nevertheless, the currently proposed meta-
learner with stERP as the input achieved the best
performance during the validation and testing stages.
We examined the channel importance by mapping
the weights of each input model to the outcomes of
the meta-learner and found a similar channel import-
ance map between the final OT and MW neuron as
learnt by the meta-learner. Finally, by looking at the
feature maps between classes, we could understand
how the CNN learned the patterns from the stERP
contour maps.
The largest limitation of the current study is the
generalizability of the CNN classifiers, even though
we addressed the problem with state-of-the-art neural
networks. We attribute the main cause to the hetero-
geneity of mind-wandering: on one side, individu-
als differ in their mind-wandering thoughts as well
as in the patterns of the neural activation associated
with mind-wandering (Christoff et al 2016, Wang et al
2018, Zanesco et al 2021a); on the other side, mind-
wandering while performing another task is essen-
tially a dual-tasking process—individuals are likely
to keep working on their primary task without the
performance being interrupted if the primary task is
8
J. Neural Eng. 20 (2023) 026024 C Y Jin et al
low-demanding or habitual (van Vugt et al 2015). In
that case, ‘free’ cognitive resources are available to be
used by mind-wandering (Taatgen et al 2021), mak-
ing mind-wandering generation ‘hidden’ and difficult
to discriminate in neural data. How to improve the
accuracy of self-reports, or more generally the pre-
cision of mind-wandering data collection is a meth-
odological issue in experimental psychology, which
is outside the scope of the current EEG decoding
study.
Interestingly, we trained the current neural net-
work with the same datasets used in Hosseini and Guo
(2019) and tested the modeling performance during
similar across-individual predictions. We achieved
76.0% and 70.4% accuracies for the across-individual
prediction, which are higher than the 67.63% and
65.26% as reported in the original study. This indic-
ates that the current neural network is even suitable
for learning EEG signals in BCI studies. It seems that
the modeling performance varies according to the
classification goals: inter-individual generalizability is
easier to achieve with a model based on the same task
than on multiple tasks.
As indicated, the current architecture can also
be used to detect mind-wandering in an online set-
ting (i.e. for BCI). Based on the current results, for
such applications we recommend considering indi-
vidual classifiers instead of inter-individual classifi-
ers to detect task-general mind-wandering, or altern-
atively to use inter-individual classifiers to study
mind-wandering within the same task. Training task-
general inter-individual mind-wandering classifiers is
at this point too challenging to achieve sufficiently
high performance.
5. Conclusion
The current study indicates that a generalizable clas-
sifier to detect study-independent mind-wandering
episodes with scalp EEG remains challenging. Nev-
ertheless, we found that the meta-learner with input
models trained with stERP contour maps performed
the best. We also showed how this work can con-
tribute to explainable artificial intelligence by giv-
ing an example of how channel contributions and
the learned features can be examined by means of
the weights of the input models and the feature
maps.
Data availability statement
The data that support the findings of this study are
openly available at the following URL/DOI: https://
unishare.nl/index.php/s/T94LXPQqw5FEA4J.
Code availability statement
Analysis code is available at https://github.com/
christina109/MW_EEG_CNN.
ORCID iD
Christina Yi Jin https://orcid.org/0000-0002-
1482-4444
References
Abdiansah A and Wardoyo R 2015 Time complexity analysis of
support vector machines (SVM) in LibSVM Int. J. Comput.
Appl. 128 28–34
Bastian M and Sackur J 2013 Mind wandering at the fingertips:
automatic parsing of subjective states based on response
time variability Front. Psychol. 4573
Bostanov V 2004 BCI competition 2003-data sets Ib and IIb:
feature extraction from event-related brain potentials with
the continuous wavelet transform and the t-value scalogram
IEEE Trans. Biomed. Eng. 51 1057–61
Bostanov V and Kotchoubey B 2006 The t-CWT: a new ERP
detection and quantification method based on the
continuous wavelet transform and student’s t-statistics Clin.
Neurophysiol. 117 2627–44
Cheyne J A, Carriere J S A and Smilek D 2006
Absent-mindedness: lapses of conscious awareness and
everyday cognitive failures Conscious. Cogn. 15 578–92
Christoff K, Gordon A M, Smallwood J, Smith R and Schooler J W
2009 Experience sampling during fMRI reveals default
network and executive system contributions to mind
wandering Proc. Natl Acad. Sci. USA 106 8719–24
Christoff K, Irving Z C, Fox K C R, Spreng R N and
Andrews-Hanna J R 2016 Mind-wandering as spontaneous
thought: a dynamic framework Nat. Rev. Neurosci.
17 718–31
Cohen M X 2014 Analyzing Neural Time Series Data: Theory and
Practice (Cambridge, MA: MIT Press)
Cole S and Kvavilashvili L 2019 Spontaneous future cognition: the
past, present and future of an emerging topic Psychol. Res.
83 631–50
Compton R J, Gearinger D and Wild H 2019 The wandering mind
oscillates: EEG alpha power is enhanced during moments of
mind-wandering Cogn. Affect. Behav. Neurosci. 19 1184–91
Groot J M, Boayue N M, Csifcsák G, Boekel W, Huster R,
Forstmann B U and Mittner M 2021 Probing the neural
signature of mind wandering with simultaneous fMRI-EEG
and pupillometry NeuroImage 224 117412
Ho N S P, Wang X, Vatansever D, Margulies D S, Bernhardt B,
Jefferies E and Smallwood J 2019 Individual variation in
patterns of task focused, and detailed, thought are uniquely
associated within the architecture of the medial temporal
lobe NeuroImage 202 116045
Hosseini S and Guo X 2019 Deep convolutional neural network
for automated detection of mind wandering using EEG
signals Proc. 10th ACM Int. Conf. on Bioinformatics,
Computational Biology and Health Informatics (https://doi.
org/10.1145/3307339.3342176)
Huijser S, van Vugt M K and Taatgen N A 2018 The wandering
self: tracking distracting self-generated thought in a
cognitively demanding context Conscious. Cogn. 58 170–85
Jin C Y, Borst J P and van Vugt M K 2019 Predicting task-general
mind-wandering with EEG Cogn. Affect. Behav. Neurosci.
19 1059–73
Jin C Y, Borst J P and van Vugt M K 2020 Distinguishing vigilance
decrement and low task demands from mind-wandering: a
machine learning analysis of EEG Eur. J. Neurosci.
52 4147–64
Kam J W Y and Handy T C 2013 The neurocognitive
consequences of the wandering mind: a mechanistic account
of sensory-motor decoupling Front. Psychol. 4725
Kawashima I and Kumano H 2017 Prediction of mind-wandering
with electroencephalogram and non-linear regression
modeling Front. Hum. Neurosci. 11 365
Lawhern V J, Solon A J, Waytowich N R, Gordon S M, Hung C P
and Lance B J 2018 EEGNet: a compact convolutional neural
9
J. Neural Eng. 20 (2023) 026024 C Y Jin et al
network for EEG-based brain–computer interfaces J. Neural
Eng. 15 056013
Mittner M, Boekel W, Tucker A M, Turner B M, Heathcote A and
Forstmann B U 2014 When the brain takes a break: a
model-based analysis of mind wandering J. Neurosci.
34 16286–95
Randall J G, Beier M E and Villado A J 2019 Multiple routes to
mind wandering: predicting mind wandering with resource
theories Conscious. Cogn. 67 26–43
Roy Y, Banville H, Albuquerque I, Gramfort A, Falk T H and
Faubert J 2019 Deep learning-based electroencephalography
analysis: a systematic review J. Neural Eng. 16 051001
Schooler J W, Smallwood J, Christoff K, Handy T C, Reichle E D
and Sayette M A 2011 Meta-awareness, perceptual
decoupling and the wandering mind Trends Cogn. Sci.
15 319–26
Seli P, Cheyne J A and Smilek D 2013 Wandering minds and
wavering rhythms: linking mind wandering and
behavioral variability J. Exp. Psychol. Hum. Percept.
Perform. 39 1–5
Smallwood J and Schooler J W 2015 The science of mind
wandering: empirically navigating the stream of
consciousness Ann. Rev. Psychol. 66 487–518
Taatgen N A, van Vugt M K, Daamen J, Katidioti I, Huijser S and
Borst J P 2021 The resource-availability model of distraction
and mind-wandering Cogn. Syst. Res. 68 84–104
Unsworth N and Robison M K 2016 Pupillary correlates of lapses
of sustained attention Cogn. Affect. Behav. Neurosci.
16 601–15
van Vugt M K, Taatgen N A, Sackur J and Bastian M 2015
Modeling mind-wandering: a tool to better understand
distraction Proc. 13th Int. Conf. on Cognitive Modeling
(ICCM) (Groningen, the Netherlands)
Wang H T, Poerio G, Murphy C, Bzdok D, Jefferies E and
Smallwood J 2018 Dimensions of experience: exploring the
heterogeneity of the wandering mind Psychol. Sci. 29 56–71
Zanesco A P, Denkova E and Jha A P 2021a Associations between
self-reported spontaneous thought and temporal sequences
of EEG microstates Brain Cogn. 150 105696
Zanesco A P, Denkova E and Jha A P 2021b Self-reported mind
wandering and response time variability differentiate
prestimulus electroencephalogram microstate dynamics
during a sustained attention task J. Cogn. Neurosci. 33 28–45
Zheng Y, Wang D, Zhang Y and Xu W 2019 Detecting mind
wandering: an objective method via simultaneous control of
respiration and fingertip pressure Front. Psychol. 10 216
10
Available via license: CC BY 4.0
Content may be subject to copyright.