ArticlePDF Available

Abstract and Figures

In this paper, we address cognitive overload detection from unobtrusive physiological signals for users in dual-tasking scenarios. Anticipating cognitive overload is a pivotal challenge in interactive cognitive systems and could lead to safer shared-control between users and assistance systems. Our framework builds on the assumption that decision mistakes on the cognitive secondary task of dual-tasking users correspond to cognitive overload events, wherein the cognitive resources required to perform the task exceed the ones available to the users. We propose DecNet, an end-to-end sequence-to-sequence deep learning model that infers in real-time the likelihood of user mistakes on the secondary task, i.e., the practical impact of cognitive overload, from eye-gaze and head-pose data. We train and test DecNet on a dataset collected in a simulated driving setup from a cohort of 20 users on two dual-tasking decision-making scenarios, with either visual or auditory decision stimuli. DecNet anticipates cognitive overload events in both scenarios and can perform in time-constrained scenarios, anticipating cognitive overload events up to 2s before they occur. We show that DecNet’s performance gap between audio and visual scenarios is consistent with user perceived difficulty. This suggests that single modality stimulation induces higher cognitive load on users, hindering their decision-making abilities.
Content may be subject to copyright.
Preprint version; final version available at http://ieeexplore.ieee.org
IEEE TRANSACTIONS ON COGNITIVE AND DEVELOPMENTAL SYSTEMS 1
Predicting Secondary Task Performance: A Directly
Actionable Metric for Cognitive Overload Detection
Pierluigi Vito Amadori, Member, IEEE, Tobias Fischer, Member, IEEE, Ruohan Wang, Member, IEEE,
Yiannis Demiris, Senior Member, IEEE
Abstract—In this paper, we address cognitive overload detec-
tion from unobtrusive physiological signals for users in dual-
tasking scenarios. Anticipating cognitive overload is a pivotal
challenge in interactive cognitive systems and could lead to
safer shared-control between users and assistance systems. Our
framework builds on the assumption that decision mistakes on
the cognitive secondary task of dual-tasking users correspond
to cognitive overload events, wherein the cognitive resources
required to perform the task exceed the ones available to the
users. We propose DecNet, an end-to-end sequence-to-sequence
deep learning model that infers in real-time the likelihood of
user mistakes on the secondary task, i.e., the practical impact of
cognitive overload, from eye-gaze and head-pose data. We train
and test DecNet on a dataset collected in a simulated driving setup
from a cohort of 20 users on two dual-tasking decision-making
scenarios, with either visual or auditory decision stimuli. DecNet
anticipates cognitive overload events in both scenarios and can
perform in time-constrained scenarios, anticipating cognitive
overload events up to 2s before they occur. We show that DecNet’s
performance gap between audio and visual scenarios is consistent
with user perceived difficulty. This suggests that single modality
stimulation induces higher cognitive load on users, hindering
their decision-making abilities.
Index Terms—Cognitive Workload, User Monitoring, Decision
Anticipation, Simulated Driving.
I. INTRODUCTION
C
OGNITIVE load modeling has received significant re-
search interests in recent years thanks to its wide range
of applications spanning from human-robot interaction [
1
], to
human-computer interaction [
2
], and intelligent vehicles [
3
], [
4
],
[
5
]. Accurate inference of a user’s cognitive state in real-time
could lead to disruptive benefits towards optimized interface
designs and adaptive user interfaces [
4
], [
6
], more effective and
situational-aware robots [
7
], [
8
], as well as safer and smarter
vehicles [
9
], [
10
], [
11
]. Modeling and inferring human cognitive
states is also an inherently multidisciplinary task, where
numerous fields, such as psychology [
12
], neuroscience [
13
],
engineering and artificial intelligence [14], [15], overlap.
Despite the evident applications in human-robot interaction
and intelligent vehicles, cognitive state inference still has not
reached a level suitable for real-world applications, compared
Manuscript received May 17, 2021; revised August 13, 2021; accepted
September 5, 2021. This work was supported in part by UK DSTL/EPSRC
Grant EP/P008461/1, and a Royal Academy of Engineering Chair in Emerging
Technologies to Yiannis Demiris.
All authors are with the Personal Robotics Lab, Department of Electrical
and Electronic Engineering, Imperial College London, SW7 2BT, U.K. (e-
mail:
{p.amadori, r.wang16, y.demiris}@imperial.ac.uk;
info@tobiasfischer.info)
Research presented in this paper is a continuation of Amadori et al. [16].
to other machine learning domains, e.g., computer vision [
17
],
[
18
], [
19
] and natural language processing [
20
], [
21
]. Also,
cognitive load inference does not provide a directly actionable
feedback signal for real-world assistance systems, as humans
may still perform well under stress [
22
]. Because of this, in
this paper, we propose a paradigm shift from cognitive load
inference towards cognitive overload detection. We identify
as cognitive overload the instances in which the amount of
cognitive resources required to perform a task exceed the ones
currently available to a user, therefore leading to severe decrease
in both task performance and safety.
Driving is a popular scenario for cognitive state model-
ing [
10
], [
23
], [
15
], as it is a highly cognitive demanding task
that requires drivers to be constantly aware of the surrounding
environment, while continuously making decisions and taking
actions [
24
]. Also, the possible causes of cognitive overload
and distraction in drivers are numerous [
18
], such as visual,
e.g., eyes off-road due to the use of mobile phones or in-
vehicle information systems, or auditory, e.g., mind off-road
due to holding hand-free cellphone conversations or even e-
mail systems. While visual distractions can have a clear and
observable effect, e.g., the driver is not looking at the road, audi-
tory/cognitive distractions have more subtle effects, e.g., driving
performance degrade and hazard perception is hindered [
18
].
Thus, the design of systems that can infer cognitive distraction
is critical to improve safety, albeit particularly challenging [
25
].
In this paper, we focus our attention on a dual-task human-in-
the-loop simulated virtual reality (VR) driving scenario. Here,
human participants are tasked to drive and avoid obstacles
(primary task), while performing a cognitively demanding “n-
back task” (secondary task) [
26
]. Leveraging on the known rela-
tionship between task performance and cognitive overload [
27
],
we assume that mistakes on the secondary task correspond
to cognitive overload instances in the participant. Differently
from previous cognitive load inference methods [
15
], [
25
],
the proposed approach provides a directly actionable and
unambiguous feedback signal for assistance systems as it
anticipates the practical effects of cognitive overload. Given the
focus on cognitive overload detection in simulated VR driving,
we use the terms user and driver interchangeably throughout
the paper.
We investigate whether we can predict the practical impacts
of cognitive overload events and distraction on secondary task
decision making from unobtrusive physiological signals from
the driver, namely eye gaze and head pose. To do this, we
exploit the widespread availability of affordable and unobtrusive
sensors and improvements in algorithms [
28
], [
29
] to collect
2 IEEE TRANSACTIONS ON COGNITIVE AND DEVELOPMENTAL SYSTEMS
physiological data from drivers. Our work may be interpreted as
an extension to conventional cognitive load classification [
4
],
wherein we demonstrate that predicting the correctness of
cognitive state-dependent decisions is feasible.
The contributions of the paper are:
1)
We present an end-to-end long short-term memory
(LSTM; [
30
])-based model, namely DecNet, for anticipat-
ing cognitive overload events in humans. The proposed
approach can reliably infer in real-time the likelihood of
a user’s mistake on an imminent secondary task decision;
2)
We collect a dataset containing physiological and behav-
ioral data from a cohort of twenty participants in a realistic
driver-in-the-loop virtual reality simulation. Participants
were instructed to drive along the road while avoiding
obstacles (primary task) and to make cognitive-based
decisions (secondary task) in two separate scenarios with
visual and auditory decision stimuli, respectively;
3)
We analyze DecNet’s performance on these scenarios and
investigate the effects that combined visual-auditory and
visual-visual stimuli have on cognitive stimulation and
decision in the driver;
4)
We demonstrate that DecNet estimates that the task
difficulty in the visual-visual scenario is higher than that
of the visual-auditory scenario, which is in line with the
task’s perceived difficulty obtained from questionnaires,
as well as several models of multitasking [31], [32].
The rest of the paper is organized as follows: Section II
provides a detailed overview of related works. Section III
formalizes the problem of decision anticipation and introduces
the proposed model, DecNet, to solve it. Section IV explains in
detail both the experimental protocol and the data collection/pre-
processing procedure. Section V provides an in-depth presen-
tation of the experimental setup used to evaluate and test
DecNet performance on the collected dataset. Section VI
analyzes the results and performance achieved by DecNet
against both classic methods and comparable recurrent neural
network architectures and investigates the impact of multimodal
stimuli on driver performance. Finally, Section VII summarizes
the contributions of the paper and its main limitations, and
outlines future research directions.
II. RE L ATED WO R KS
Our focus on cognitive overload detection during simulated
driving is closely related to cognitive load classification, human-
machine interaction, and the role of secondary tasks during
driving. This section overviews related literature.
A. Cognitive Load Classification
Cognitive load classification is inherently a very complex
task, as different cognitive load levels experienced by humans
are not directly measurable [
33
]. When addressing cognitive
load classification, sophisticated feature engineering is often
required to improve data quality and extract useful features from
raw sensor signals, e.g., [
4
], [
10
], [
23
], [
25
]. These studies
have proven statistical correlations between cognitive load
and physiological signals, although they can vary significantly
among experiment participants.
VR Headset
Decision Labelling
Compare decision from driver
with correct one from policy.
Head Pose
Eye Gaze
Simulated Scenario
DecNet
Sequence-to-Sequence
Supervised Learning
Driver Decision
Input
Sequence
Classification
Label
Steering Wheel
Fig. 1. Block diagram of DecNet framework for decision correctness
anticipation. We frame the problem as sequence-to-sequence supervised
learning. We use a Virtual Reality (VR) headset with integrated head pose and
eye gaze tracking to monitor the user (blue). The simulated environment
prompts the user to make decisions according to a specific policy. The
correctness of the decisions identify the labels (orange), while head pose
and eye gaze data represent the inputs of DecNet. The final goal of DecNet
(green) is to anticipate the correctness likelihood of future secondary task
decisions, which is indicative of events of cognitive overload on users.
Personalized models can tackle the problem, as seen in [
34
],
[
35
], however such models may become impractical as data
collection and model training are required for every new user.
Toward this end, [
15
] introduced a novel end-to-end framework
for real-time cognitive load classification, where the network
is capable of learning useful feature representations directly
from data.
B. Gaze Patterns in Cognitive Human-Machine Interaction
Gaze patterns have been widely used in cognitive human-
machine interaction. For example, [
36
] used gaze patterns to
infer a user’s level of domain knowledge in the domain of
genomics, while [
37
] focused on knowledgeability prediction
using a noninvasive eye-tracking method on mobile devices
with Support Vector Machines (SVMs).
Gaze patterns also allow for the detection of cognitive-
behavioral patterns [
38
] and internal thought (directing attention
away from a primary visual task) [
39
] in intelligent user inter-
faces. Interestingly, it was also shown that a human’s gaze is a
requirement for perspective-taking in human-robot interactions,
which allows a robot to infer the world’s characteristics from
the human’s viewpoint [
40
], [
41
]. In a similar manner, studies
have shown that there are clear correlations between gaze
patterns and cognitive load [33], [42].
C. Multitasking in Driving
Multitasking scenarios have been extensively employed by
assistive and intelligent vehicles research communities [
1
], [
43
],
[
44
], [
45
], [
46
]. In [
43
], authors have investigated the impact of
secondary tasks on driving performance and showed that they
lead to clear safety-related issues, such as off-road glances
and unplanned lane deviations. Recently, [
45
], investigated
how engaged are drivers on secondary tasks in vehicles with
AMADORI et al.: PREDICTING SECONDARY TASK PERFORMANCE: A DIRECTLY ACTIONABLE METRIC FOR COGNITIVE OVERLOAD DETECTION 3
different degrees of automation. In [
46
], authors investigated
the role of secondary tasks on highly automated vehicles
and studied how drivers regulate their resources to complete
primary and secondary tasks and how they react during take-
over requests. In [
47
], various secondary tasks were classified
based on the EEG dynamics. While very good accuracy was
achieved, EEG is intrusive and subject-dependent, whereas our
method generalizes across subjects and is based on easy-to-
access signals.
Our work is also related to that of Ersal et al. [
48
], who
proposed a radial-basis neural network to predict the actions
that a driver would have taken if there had not been a secondary
task present. In [
49
], based on the finding that secondary tasks
impact the driver’s driving abilities, the take-over readiness of
drivers was modeled by explicitly taking the secondary task
into account. Also, Engstr
¨
om et al. [
50
] introduced a framework
that predicts the effects of cognitive load on driver performance,
and argued that secondary tasks hinder driving tasks that rely on
cognitive control, while automatic performance is unaffected.
D. Final Remarks
In the sections above, we have shown how the proposed study
builds on past literature both for its experimental assumptions
and in its model design. We use cognitive secondary tasks
decisions as a proxy for cognitive overload instances, as
numerous studies have shown that secondary tasks have direct
effects on driver behavior, safety and cognitive states. Also, our
choice for head pose and eye gaze as input signals to DecNet is
supported by findings in cognitive human-machine interaction
studies, which have shown that gaze patterns strongly correlate
with human cognitive states.
Differently from cognitive load inference literature, we
propose a paradigm shift from cognitive load classification to-
wards cognitive overload detection via secondary task decision
correctness anticipation. The proposed paradigm shift offers
a directly actionable feedback metric that assistive systems
can use to intervene and prevent the practical impacts of
cognitive overload instances in users. Also, we propose a novel
generalized user-agnostic model that can operate in real-time
and that employs a sequence-to-sequence learning paradigm
to encourage feature extraction from early observations.
III. PROP OSE D SOLUTION
We frame cognitive overload detection as a supervised
classification problem where the correctness of secondary task
decisions is used as a label, as shown in Fig. 2. Specifically,
we consider a dataset
D:nXj, yjoN
j=1,(1)
where
Xj
is a temporal sequence of size
Nsteps
,
yj
represents
the corresponding label, and Nidentifies the total number of
samples. We omit time dependency on
Xj
to simplify notation.
In our study,
Xj
denotes a sequence of physiological signals,
and
yj
the correctness of the secondary task decision, as a proxy
to cognitive overload events. Our supervised classification
problem uses binary labels, which are derived by evaluating
LSTM
Weighted Cross-Entropy Loss
LSTM
ht-N ht
P(t)
yt
MLP MLP
ŷt-N ŷt
Feature Extraction
Stage
LSTM LSTM
ht
MLP
s(t)
RNN RNN RNN RNN
Decision Label
Statistical Feature
Preprocessing
twin twin twin twin
steps
steps
xt-N xt
steps xt-N xt
steps
ht-N htht
. . .
Fig. 2. Overview of DecNet for secondary task decision correctness anticipation
during training (left) and inference (right). Sequential sensor readings from
the driver
s(t)
are collected by a sliding window of length
twin
to extract
the sequence input to DecNet,
Xj
. The RNN stage first processes the input
into a sequence of features/hidden states
˙
ht
via Eq. (3), which is then used
as input to the LSTM stage according to Eq. (4). Finally, the hidden states of
the LSTM are projected to decision correctness likelihood via Eq. (9).
whether the secondary task decision performed by the driver
is correct (
yj= 1
) or wrong (
yj= 0
). Please see Section IV
for a detailed presentation on experimental procedure and data
collection.
Since we framed cognitive overload detection as a classifi-
cation problem, the final goal is to identify a model
pθ
that
minimizes the cross-entropy loss Las
L=
N
X
j=1
log pθ(yj|Xj).(2)
In this paper, we parametrize
pθ
with DecNet, which
comprises of a cascade of two sequential models: a recurrent
neural network (RNN) and a long short-term memory network
(LSTM) [30].
A. Decision Anticipation Network (DecNet)
DecNet is a two-stage end-to-end sequential model that
jointly learns to extract the most relevant features via an RNN
module and to exploit them via an LSTM network in order to
infer cognitive overload events by anticipating the correctness
likelihood of an imminent decision, as shown in Fig. 2. In other
words, the hidden states of the RNN, see Eq. (3), are used
as input to the LSTM module. Finally, we project the hidden
states of the LSTM stage with a multilayer perceptron (MLP)
with a Rectified Linear Unit(ReLU) nonlinearity, followed by
a softmax layer to predict the decision correctness probability.
Given a sequence of observations
X= (x1,x2, ..., xNsteps )
where xtRNx,t, the initial RNN stage operates as
˙
ht= tanh(Wrnnxt+Hr nn ˙
ht1+brnn),(3)
where
˙
htRNrnn
identifies the hidden state/feature vector at
the time step
t
and
tanh(·)
represents the hyperbolic tangent
4 IEEE TRANSACTIONS ON COGNITIVE AND DEVELOPMENTAL SYSTEMS
function. The output of the RNN stage is a sequence of
hidden states/feature vectors
(˙
h1,˙
h2, ..., ˙
hNsteps )
, which is used
as input to the LSTM network for cognitive overload detection
via secondary task decision correctness anticipation. The
parameters to be learned at this stage are
Wrnn RNrnn ×Nx
,
Hrnn RNrnn ×Nrnn and brnn RNrnn .
The LSTM network stage operates on the sequence of feature
vectors and outputs a sequence of hidden states
htRNlstm
,
as follows:
it=σ(Wiht1+Ii˙
ht+bi)(4)
ft=σ(Wfht1+If˙
ht+bf)(5)
ot=σ(Woht1+Io˙
ht+bo)(6)
ct=ftct1+ittanh(Wcht1+Ic˙
ht+bc)(7)
ht=ottanh(ct),(8)
where
it
,
ft
,
ot
, and
ct
identify input gate, forget gate, output
gate, and memory cell, respectively. The parameters to be
learned are
WRNlstm×Nlstm
,
IRNlstm×Nrnn
and
bRNlstm , where is used to represent {i, f, o}.
Finally, after computing the hidden states of the LSTM stage,
DecNet performs a probability projection via a fully connected
layer followed by a softmax activation, as
ˆ
yt= softmax(Wyht+by),(9)
where
WyR2×Nlstm
and
byR2
are the parameters to be
learned for the projection stage.
B. Model Training
When training DecNet, we employ a sequence-to-sequence
learning paradigm, similar to [
15
], [
9
], [
16
]. For the training
process, we adopt a label smoothing technique [
51
] by replacing
the binary labels of decision correctness, i.e., correct (
yj= 1
)
and wrong (
yj= 0
), with soft labels, i.e., correct (
yj= 0.9
)
and wrong (
yj= 0.1
), as they have shown to be a particularly
effective strategy to improve learning stabilization and model
generalization. We assume a weighted cross-entropy loss across
each input sequence Xj:
loss =
N
X
j=1
Nsteps
X
t=1
e(tNsteps)log p(yj|x1t),(10)
where
x1t= (x1,x2, ..., xt)
identifies the sub-sequence of
observations until timestamp t.
Eq. (10) defines a weighted loss across the entire sequence
and builds on the assumption that longer sequences contain
extra information for correct inference. It may also be inter-
preted as a form of auxiliary loss similar to [
52
], whereby the
network is encouraged to extract relevant features from early
observations, increase the gradient signal that gets propagated
back, and provide additional regularization.
The exponential weights in the loss, i.e.,
e(tNsteps)
, reduce
the impact of earlier decisions over performance, as they are
made when less context is available for error anticipation [
9
],
and incentivize the role of latter decisions. Also, this loss was
shown to have positive effects as a regularizer to prevent early
overfitting [15].
IV. EXP ERI MEN TS
In this section, we introduce the experimental protocol
adopted, the simulated scenarios, the dataset collection proce-
dure and data pre-processing. During each recording session,
we collected both physiological signals and behavioral data
from the driver in two separate simulated scenarios, i.e., one
where decision stimuli are provided via audio cues and one
where they are visually presented. For physiological signals,
we collect gaze and head pose, due to their unobtrusive nature
and their known correlation with human cognitive states [
15
].
While driving, participants were given a series of instructions
to follow to complete a cognitive secondary task, and their
decisions were recorded as behavioral data.
A. Participants
Twenty participants (mean age 26.4, standard deviation
3.3) with normal or corrected to normal vision consented
to participate in our experiments. Before beginning, each
participant was introduced to sensors and experimental protocol.
Participants were given a chance to do a test drive with
the simulator. This allowed them to familiarize themselves
with the driving task and the simulated environment, and
to reduce learning factors on data collection. During each
participant’s drive, an observer monitored physiological signals
and behavioral data integrity. This study has been approved
by the Ministry of Defence Research Ethics Committee
(MoDREC).
B. Setup
We setup a realistic dual-tasking driver-in-the-loop virtual
reality (VR) simulation for the experiment (see Fig. 3). The
setup included: a physical simulator, a VR headset, and a screen
for monitoring purposes. The VR headset has integrated eye
gaze and head pose tracking, and requires an infra-red camera
mounted above the steering wheel to operate. We developed
and designed the simulated driving environments using the
Unreal Engine (https://www.unrealengine.com/). The use of a
simulated environment allows to have complete control on both
on the environment, i.e., driving maneuvers and speed, and
the tasks that participants experience during the experiment,
i.e., the frequency and the number of decisions, in addition to
providing a safe environment to the participants.
C. Experimental Protocol
During each drive, participants were instructed to jointly
perform two tasks. The primary task was to drive along a
straight highway and avoid stationary rectangular obstacles. The
secondary task is an “n-back” [
26
] based task which required
participants to perform a cognitive-based decision when the
simulator prompted them to do so. Since the cognitive load is
inherently not a measurable metric, this task has often been
used in the literature as a proxy to modulate different levels
of cognitive load on the driver [
23
], [
25
], [
24
]. The paradigm
builds on the core assumption that a participant’s cognitive
load while performing a task is strongly correlated with the
working memory required to perform such a task [
26
]. The task
AMADORI et al.: PREDICTING SECONDARY TASK PERFORMANCE: A DIRECTLY ACTIONABLE METRIC FOR COGNITIVE OVERLOAD DETECTION 5
Fig. 3. Driver-in-the-loop simulation. Top: The two simulated scenarios. For
the audio stimuli scenario, obstacles are simple boxes, while in the visual
stimuli case, numbers are displayed directly on the obstacles. Bottom: The
participant wears a VR headset with integrated eye gaze tracker and head
pose estimation. The screen displays the scene observed by the participant
and sensor readings in real-time for monitoring during the trial.
allows to easily to modulate different levels of cognitive load
by increasing/decreasing the “n”, and it also has been shown
to be an effective tool to predict individual fluid intelligence
and higher cognitive functions, especially when used to induce
higher levels of load, such as 3-back [53].
We designed the simulator to prompt the secondary task
numbers to the participants at regular intervals. Participants
were instructed that each number corresponded to a specific
category and that their task was to iteratively remember
the category of the number they were presented three steps
before. The numbers spanned from 1 to 12 and corresponding
categories were as follows: category
a
corresponded to the set
of numbers
{1,2,3}
, category
b
corresponded to the set of
numbers
{4,5,6}
and so on. Participants were instructed that
four buttons on the steering wheel were dedicated to the task,
with each button corresponding to a different category.
To illustrate the experiment condition, let us consider a
participant in the audio stimuli scenario, presented with the
sequence of numbers
3,5,8,7,6,9,10
. Given a 3-back task
the participant would not be required to perform any decision
until prompted the number
7
. In fact, when provided with the
number
7
, the participant is storing 3 numbers in their memory,
i.e.,
3,5,8
and is therefore required to “make a decision” based
on the number they were presented 3-steps before, i.e.,
3
, and
press the button that corresponds to the category
a
. From this
moment on, every time the participant is presented with a new
number, the participant is asked to decide/remember to which
category the last number in their memory buffer corresponded
to.
Secondary tasks can have disruptive effects on the primary
task performance [
54
], [
55
]. To avoid this, when describing
the secondary task, participants were informed that, although it
was important for them to correctly perform the secondary task,
their main focus must always be to safely perform the primary
task, i.e., driving and avoiding obstacles. We enforced this, by
reminding the participants before each drive that driving safety
was of utmost importance. Their compliance was reflected
in the fact that not a single crash was recorded amongst all
participants and all driving scenarios.
The experimental protocol required each participant to drive
the simulator on two separate scenarios of 180s. In each
scenario, a different modality of stimuli for the secondary task
was in place: one auditory and one visual. For the visual stimuli,
numbers were displayed on the obstacles, as shown in Fig. 3
middle. In the scenario with the auditory stimuli, numbers were
announced to the participants at every obstacle, as shown in
Fig. 3 top. The two scenarios are designed to induce a constant
level of cognitive load by prompting constant decisions in the
participant. The 3-back task was also chosen for the secondary
task to be challenging and cognitively demanding to perform.
In fact, since we build on the assumption that decision mistakes
from drivers are indicative of cognitive overload occurrences,
the secondary task needed to be complex enough to induce
events of cognitive overload in the participants.
Collecting the data on these two scenarios opens to the
possibility to investigate two separate cognitively demanding
cases: one where multiple modalities are stimulated and one
where a single modality is engaged. Also, we can investigate
whether a single modality engaging task could lead to sensory
overload, as numerous works have shown that concurrent tasks
using the same modality lead to performance decrease [
31
],
[
32
]; and whether cognitive load and its effects on driver
decisions can be distributed with different stimuli.
D. Simulated Environment
The two simulated environments, as shown in Fig. 3 top
and middle, both assumed a daylight scenario in good weather
conditions. This ensured that obstacles were clearly visible
to the participants and also limited the possible sources of
distractions during the experiment. We designed the road to
be 10m wide, enough to be divided into three 3.3m wide
virtual lanes, and the obstacles to be 3.5m wide in order to
entirely block a virtual lane and to ensure that drivers had to
steer to avoid them. Before each drive, obstacles are randomly
placed along one of three lanes and 100 meters apart. We
implemented a proportional–integral–derivative controller on
the simulator to have a consistent cruise speed of 120 km/h.
This helps ensure a consistent cognitive load throughout the
experiment, an equal number of decisions for all participants
and a consistent driving scenario. More specifically, participants
were prompted to make a decision approximately every 3
seconds and for a total of 55 decisions per drive in both
scenarios. We also designed the obstacles so that they did not
exert any effects on the vehicle upon contact, while sending a
notification to the monitoring researcher. This allows reducing
the potential loss of focus on the primary and secondary tasks
from the participant caused by a crash with the obstacle, while
still tracking their performance. All participants successfully
6 IEEE TRANSACTIONS ON COGNITIVE AND DEVELOPMENTAL SYSTEMS
gazel,x
gazel,y
hmdθ
hmdΦ
01234tin s
hmdΨ
tinstance
tframe
twin
toverlap
decision
previous decision
Fig. 4. Sequence generation. We frame the sequence of observations that
precede a secondary decision as a “decision instance” of variable length
tinstance
. Within each decision instance, we extract inputs of length
tframe
with
a sliding windows approach with fixed overlap
toverlap
. For sequential models,
each element of the sequence was computed over windows of duration twin.
complied with instructions on safety, as not a single crash was
recorded amongst all participants and all driving scenarios.
Finally, to ensure that no memory effect could occur between
trials, numbers used for the secondary task were automatically
regenerated for each drive.
The location of the obstacles on the lanes is implemented
according to a custom-defined discrete distribution, which
allows us to reduce the probability of cases where a virtual
lane is free of obstacles for extended periods. Assuming
ci
as the distance between the current obstacle location and the
previous obstacle location in lane
i
. We then define the obstacle
placement probability distribution in lane ias
p(i) = eci/IntervalSize
Pieci/IntervalSize ,(11)
where IntervalSize represents the distance between two adjacent
obstacles. For instance, consider a case where the
i
-th lane
has not been blocked for the past
5
obstacles, the custom
distribution ensures that the probability for that lane to be
blocked is
e5
times higher than the one of the most recently
blocked lane.
E. Dataset Collection
During each drive, we collect: 1) instantaneous two-
dimensional gaze locations for left and right eye at 60 Hz,
2) three-dimensional head pose at 60 Hz, and 3) driver decisions.
At time
t
, the integrated eye-tracker in the VR headset provides
two-dimensional vectors with the gaze position on the screens,
as seen through the headset lenses for both eyes as follows
el(t) = [el
x(t), el
y(t)],er(t)=[er
x(t), er
y(t)],(12)
where superscripts
l
and
r
differentiate between left and right
eye, respectively, and subscripts specify the axis of the data.
Vectors are normalized in the range
[1,1]
along both x-
axis and y-axis, so that the center is
(0,0)
, bottom-left is
(1,1)
and top-right is
(1,1)
. Head pose information is
directly inferred from position and rotation of the headset,
with position and rotation during calibration being considered
as reference. The headset position is specified in Cartesian
coordinates, and the rotation is described in Euler angles (roll-
pitch-yaw notation):
h(t)=[hx(t), hy(t), hz(t), hφ(t), hθ(t), hψ(t)],(13)
where subscripts
φ
,
θ
and
ψ
specify yaw, pitch and roll data,
respectively. All the physiological data was collected in a vector
ˆ
s(t)as follows:
ˆ
s(t) = er
x, er
y, el
x, el
y, hx, hy, hz, hφ, hθ, hψ,(14)
where time dependency on the single elements of
ˆ
s(t)
has been
omitted to simplify notation.
The secondary task decisions were recorded in a vector u:
u(td) = [ns, id, rt],(15)
where
td
identifies the time at which occurred the decision,
ns
the number that was provided as stimulus,
id
the category
chosen by the participant and
rt
the reaction-time of the
participant.
F. Pre-processing and Dataset Split
After data collection, each gaze pattern data sample was
processed to provide distance
δ
from the previous sample on
both axes and the absolute distance from the center of the field
of view (
dfov
). This procedure ensures that the network does not
learn to associate mistakes and correct decisions with specific
gaze locations, but on the dynamics of the eye movements. For
the head pose, we keep the absolute values of position and
rotation, as head movements are characterized by slower shifts
than eye gaze. This led to a sample for each time step with 11
raw features, as follows:
s(t) = δr
x, δr
y, δl
x, δl
y, dfov, hx, hy, hz, hφ, hθ, hψ.(16)
We process the dataset for classification by splitting each
participant’s data into decision instances of variable length
tinstance
. Secondary task decision instances are bounded by the
timestamp at which the driver made a decision
td
and the
timestamp immediately after the previous decision
td1
, as
shown in Fig. 4.
Dataset splitting into train, validation and test set is only
performed after processing the data into a sequence of decision
instances. This procedure ensures that data from decisions in
the train set and the validation/test sets are entirely separated
and not correlated. In other words, we always perform training,
validation and testing on separate decisions.
We pre-process the data within each decision instance
j
with
a sliding window approach. We extract the input sequences
for our models from sequences of raw sensor data of length
tframe
. The window of raw data is processed into a
Nsteps
-long
sequence Xj(t)of fixed-size feature vectors xj,t:
Xj(t) = [xj,tNsteps , ..., xj,t].(17)
We compute each feature vector
xj,t
from a sliding window
of length twin, as follows
xj,t =f[s(ttwin), ..., s(t)],(18)
AMADORI et al.: PREDICTING SECONDARY TASK PERFORMANCE: A DIRECTLY ACTIONABLE METRIC FOR COGNITIVE OVERLOAD DETECTION 7
where the operator
f(·)
computes mean, standard deviation,
median, 25th and 75th percentiles, maximum, minimum and
range of its argument. We chose this set of features to
capture central tendencies, variability, and extremes of each
physiological signal. For non-sequential models, i.e., logistic
regression and SVM, we directly compute the aforementioned
statistical features over data windows of length tframe.
We normalize the features to have zero-mean and unit
variance, and we uniformly sample the input sequences
tframe
via a sliding window approach with overlap
toverlap =ξ·tframe
.
The parameter
ξ
represents the overlap ratio, which is fixed to
95% for all the models considered in the paper.
As we frame the problem as supervised learning, we need
to identify the binary labels for the cognitive overload events.
Our main assumption is that mistakes on the secondary task
are representative of cognitive overload events, therefore we
assign labels according to the following policy:
yj=(0if id 6=category(ns)
1if id =category(ns),(19)
where
category(·)
is the operator that extracts the category of
the number that was given as stimulus. In other words, if the
participant could not recall the correct category to the number
stored in their memory, the data corresponding to that decision
is assigned to an event of cognitive overload, i.e.,
yj= 0
. On
the other hand, correct decision corresponded to a level of
cognitive workload that the participant could sustain.
V. EX PER IM ENTAL SET UP
In this section, we present the experimental setup we assumed
to address the following research questions:
1)
Do gaze patterns and head movements correlate with
driver secondary task decision-making processes?
2)
Can these correlations be exploited to anticipate the
likelihood of making a mistake on the secondary task, i.e.,
a cognitive overload event?
3)
How far in advance can we anticipate a cognitive over-
load event so that a closed-loop assistance system can
intervene?
4)
What is the impact of different stimuli on cognitive
stimulation and decision on the driver?
To answer these questions, we evaluate the performance of
DecNet on the collected dataset and compare it with various
models.
A. Classification Scenarios
For critical safety applications, we focus on the model’s
ability to anticipate the likelihood of future secondary task
decision mistakes of the driver (wrong decision classification),
as they relate to cognitive overload events which might lead to
dangerous maneuvers. However, it is not advisable for a model
to be unable to robustly infer the likelihood of future correct
decisions (correct decision classification). If the assistance
system takes over too often, even when it would not have been
necessary, it may cause discomfort and distrust on the driver.
Consequently, we evaluate DecNet performance on three
separate classification scenarios: correct decision, wrong de-
cision, and normalized decision classification. Correct and
wrong decision classification scenarios focus on evaluating
whether the model can anticipate correct or wrong decisions,
respectively. Instead, the normalized classification scenario
evaluates DecNet’s ability to anticipate the general correctness
of the next decision. In this scenario, we first compute
performance metrics for both correct and wrong decisions.
Then, their average is weighted according to the support, i.e.,
the number of true instances for each decision label.
B. Evaluation Setup
We evaluate classification performance in terms of precision,
recall, and F1-score:
P=tp
tp +fp , R =tp
tp +fn , F1= 2 ·P·R
P+R,(20)
where
tp
,
fp
and
fn
identify true positives, false positives and
false negatives, respectively.
While we focus on the classification performance of DecNet
in terms of precision, recall, and
F1
-score, it is important to
stress that the output of the proposed model is the correctness
likelihood of the next secondary/cognitive task decision. We
map the likelihood value to a binary class, i.e., correct or wrong
decision, via a classification threshold
η
. In other words, if
the correctness likelihood is above the threshold
η
, we classify
the next decision as correct, while if the value falls below, we
classify it as a wrong decision.
Classification performance are computed in an offline test
setting, where we use 80% of data for training, 10% of data for
validation, and the remaining 10% for testing. When comparing
model performances, we report the mean and standard deviation
of each metric for all algorithms using 5-fold cross-validation.
All models were implemented in Python on Keras (https:
//keras.io/) and trained with Adam optimizer [
56
]. For the
training of the networks, we set the learning rate to 0.0001,
the total number of epochs to 100 and we performed early
stopping on the validation set. Model training and testing
were deployed on an Intel Core i7-6800K 3.40GHz CPU and
NVIDIA GeForce GTX 1080 8GB GPU.
VI. RE S ULTS
In this section, we compare DecNet performance with our
previous sequence-to-sequence model (Seq2Seq; [
16
]) and with
two sequential neural network model baselines, i.e., standard
recurrent neural network (RNN) and long short-term memory
(LSTM). In addition to these, we also compare against non-
sequential baselines, i.e., Logistic Regression (LogReg) and
Support Vector Machines (SVMs). The above baseline models
are in line with cognitive load classification literature [
15
],
[
25
], [
57
], [
2
], which have focused on sequential modeling,
e.g., as LSTMs and RNNs [
15
], Hidden Markov Models
(HMM) [
25
], Logistic Regression and SVM [
57
] and Naive
Bayes classifier [2].
8 IEEE TRANSACTIONS ON COGNITIVE AND DEVELOPMENTAL SYSTEMS
TABLE I
AUD IO TAS K: C LASSIFICATION PERFORMANCE.
tframe = 0.5s tframe = 1s tframe = 1.5s
Method F1-Score F1-Score F1-Score
Normalized Decision Classification
LogReg 0.60±0.02 0.60±0.03 0.62±0.02
SVM 0.70±0.02 0.69±0.03 0.68±0.02
RNN 0.75±0.03 0.70±0.02 0.69±0.03
LSTM 0.73±0.03 0.71±0.03 0.70±0.04
Seq2Seq 0.74±0.01 0.75±0.02 0.73±0.03
DecNet 0.75±0.01 0.75±0.02 0.78±0.01
Wrong Decision Classification
LogReg 0.50±0.03 0.50±0.03 0.51±0.03
SVM 0.59±0.03 0.57±0.03 0.54±0.03
RNN 0.60±0.03 0.61±0.03 0.60±0.03
LSTM 0.62±0.02 0.62±0.02 0.61±0.04
Seq2Seq 0.64±0.02 0.64±0.02 0.63±0.03
DecNet 0.65±0.01 0.67±0.02 0.68±0.02
Correct Decision Classification
LogReg 0.64±0.02 0.64±0.03 0.68±0.03
SVM 0.75±0.02 0.75±0.02 0.75±0.02
RNN 0.75±0.04 0.74±0.02 0.73±0.02
LSTM 0.77±0.04 0.75±0.03 0.75±0.03
Seq2Seq 0.83±0.02 0.83±0.02 0.77±0.02
DecNet 0.84±0.01 0.82±0.01 0.84±0.01
A. Classification Performance
In this section, we list the classification performance as a
function of the length of the input
tframe
used for training and
testing for both the audio and the visual stimuli experiment.
In Table I and Table II, we report
F1
-scores under the three
classification settings, i.e., normalized classification, correct and
wrong decision discovery, for the audio and the visual stimuli
experiment, respectively. For each classification scenario, we
compute the classification threshold
η
according to the best
F1
performance achieved in the validation set.
DecNet outperforms all other models on all classification
tasks. The performance gap with other models becomes
more accentuated when longer inputs are provided to the
models. On frame length
tframe = 1.5s
for the audio stimuli
experiment, DecNet shows an overall
F1
-score performance
improvement of 8%, more specifically of
0.05
,
0.05
,
and
0.07
, for the normalized, wrong and correct decision
classification, respectively. A similar behavior appears on the
visual task on the wrong classification scenario, where DecNet
has 5% performance increase over the second best performing
model, Seq2Seq. In general, we can see that sequence models
better capture the secondary task decision-making process
than simpler non-sequential models, with DecNet providing
a performance increase especially on the wrong decision
classification scenario.
In Table III, we evaluate
F1
-scores for DecNet when
tframe =
1.5sfor the three classification scenarios as a function of the
physiological signals used for training and testing. Here,
F1
Gaze identifies the performance of DecNet when only gaze
information is used,
F1
Head when only head pose information
is used and finally
F1
Comb lists the performance when both
TABLE II
VIS UAL TAS K: C LASSIFICATION PERFORMANCE.
tframe = 0.5s tframe = 1s tframe = 1.5s
Method F1-Score F1-Score F1-Score
Normalized Decision Classification
LogReg 0.56±0.02 0.56±0.03 0.56±0.01
SVM 0.61±0.02 0.61±0.02 0.61±0.02
RNN 0.64±0.02 0.63±0.02 0.62±0.04
LSTM 0.64±0.04 0.65±0.03 0.63±0.01
Seq2Seq 0.65±0.02 0.64±0.02 0.65±0.01
DecNet 0.66±0.02 0.66±0.02 0.66±0.03
Wrong Decision Classification
LogReg 0.49±0.03 0.47±0.04 0.47±0.02
SVM 0.53±0.01 0.53±0.03 0.53±0.03
RNN 0.61±0.02 0.61±0.02 0.60±0.03
LSTM 0.61±0.02 0.61±0.01 0.60±0.02
Seq2Seq 0.62±0.01 0.61±0.01 0.60±0.05
DecNet 0.61±0.01 0.62±0.01 0.63±0.02
Correct Decision Classification
LogReg 0.61±0.02 0.62±0.02 0.62±0.02
SVM 0.66±0.03 0.66±0.02 0.67±0.02
RNN 0.75±0.01 0.75±0.01 0.75±0.01
LSTM 0.75±0.01 0.75±0.01 0.76±0.01
Seq2Seq 0.76±0.02 0.75±0.01 0.75±0.01
DecNet 0.76±0.03 0.76±0.01 0.75±0.01
TABLE III
DEC NET: CLASSIFICATION PERFORMANCE AGAINST FEATURES.
Scenario F1Gaze F1Head F1Comb
Audio Norm. 0.67±0.01 0.75±0.01 0.78±0.01
Audio Wrong 0.54±0.02 0.61±0.02 0.68±0.02
Audio Correct 0.81±0.01 0.81±0.02 0.84±0.01
Visual Norm. 0.63±0.01 0.63±0.01 0.66±0.03
Visual Wrong 0.59±0.02 0.56±0.03 0.63±0.02
Visual Correct 0.75±0.01 0.75±0.02 0.75±0.01
gaze and head pose information are combined. As we can
see, both streams of information contribute to the performance
of DecNet Comb, with head pose data being able to achieve
better performance when considered alone. The performance
of DecNet Gaze on the audio task, however, are not surprising,
as they are comparable to previous studies on knowledgeability
anticipation from gaze information alone [
37
]. On the visual
task, instead, we can see that gaze data is more relevant to the
final
F1
-Score performance of DecNet, as it identifies the main
resource used by the participants to capture the information
required to complete the task.
In Fig. 5, we collect the precision-recall curves for DecNet
when
tframe = 1.5s
for the audio-stimuli and visual-stimuli
tasks. The plots show the precision/recall performance curves
for both correct and wrong decision classification as a function
of the decision threshold in the audio, Fig. 5a, and visual task
scenario, Fig. 5b. We can see that in both scenarios and for
both classes the precision-recall curves are well above the
random baseline, despite the complexity of the task. The plots
indicate that DecNet can effectively separate the two classes
of correct, i.e., high load, and wrong decisions, i.e., cognitive
overload, on the secondary task as performance as a function
AMADORI et al.: PREDICTING SECONDARY TASK PERFORMANCE: A DIRECTLY ACTIONABLE METRIC FOR COGNITIVE OVERLOAD DETECTION 9
(a) Audio Stimuli Experiment
(b) Visual Stimuli Experiment
Scenario P R F1Score Scenario P R F1Score
Norm. Dec. Class. 0.81 0.77 0.79 Norm. Dec. Class. 0.67 0.65 0.66
Wrong Dec. Class. 0.58 0.83 0.68 Wrong Dec. Class. 0.55 0.66 0.60
Correct Dec. Class. 0.91 0.74 0.82 Correct Dec. Class. 0.75 0.65 0.70
Fig. 5. Precision-recall curves for the wrong decision (solid orange line) and correct decision (dotted green line) classes. The left plot represents the audio
task, while the right plot shows the visual task. All classification curves are clearly above the random baseline. Note that the random baselines are different
depending on the task and decision class, as the decision classes are (slightly) imbalanced in our dataset. Shaded areas represent the standard deviation across
10 runs. Green and orange dots depict the best decision threshold in terms of
F1
score for correct and wrong decisions respectively, while the blue dots depict
the best performance when balancing correct and wrong decisions. The table below each plot shows the performance in terms of precision, recall and
F1
-score
for the three classification scenarios when using the normalized classification threshold. All thresholds have been selected based on the validation set.
(a) Audio Stimuli Experiment
(b) Visual Stimuli Experiment
Fig. 6. Performance as a function of the time available to anticipate the correctness of an incoming decision. The end ratio indicates how early DecNet is
required to anticipate the next decision. For instance, assuming a decision instance of
tinstance = 3.5s
, an input of
tframe = 1s
and an end ratio of
ER = 0.8
,
DecNet would produce a prediction ER ·(tinstance tframe)=2sbefore such decision takes place.
of the classification threshold ηare consistent.
B. Decision Correctness Anticipation Performance
The main goal of DecNet is to provide an actionable metric
for assistance systems to be able to intervene if the occurrence
of a mistake is detected. In this section, we evaluate how
well and how far in advance DecNet is able to anticipate the
correctness of an imminent decision. Without loss of generality,
we assume that the need for a decision has already been detected
by the assistive system, as incoming decision detection is
beyond this paper’s scope.
In Fig. 6, we show the
F1
-score performance of DecNet as
a function of time available to the classifier before providing
the correctness likelihood of the next decision. We assume a
tframe = 1s
input. Performances for correctness anticipation for
the audio task are fairly stable for all the scenarios considered.
However, it is interesting to notice that DecNet appears to
be more able to identify incoming mistakes in the time frame
between
2s
and
3s
before the decision. This could suggest
that the features of an incoming mistake from the driver are
more robust and relevant during the moments that precede a
decision. On the other hand, the features that predict a correct
incoming decision might be more prominent after the model
has had more time to observe the driver since the past decision
has passed, i.e., when the driver has had time to switch their
attention from the past decision to the next one.
The final goal of DecNet resides in its ability to be imple-
mented and operate in real-time to provide timely assistance
10 IEEE TRANSACTIONS ON COGNITIVE AND DEVELOPMENTAL SYSTEMS
5678
Task complexity score
0%
20%
40%
60%
80%
100%
Percentages
1
11
5
3
012
17
Audio Task
Visual Task
Fig. 7. Participant perceived difficulty of the task. Participants were asked to
rate the perceived complexity of each of the experiments with a value between
0, i.e., low complexity, to 8, i.e, highest complexity.
to the user. To evaluate this, we have computed the inference
time, given an input sequence of
1.5
s. The total inference time
of DecNet is
8
ms, which corresponds to an inference rate of
125
Hz. Since the data from the eye and head pose tracker
is captured at
60
Hz, we conclude that the proposed DecNet is
capable of operating in real-time.
C. Effect of Stimuli Modality on Performance
Performance in Tables I and II and Fig. 5 indicate that
DecNet can better anticipate the correctness of incoming
decisions when participants were provided audio stimuli, in
comparison to decisions prompted by visual cues. To investigate
this, we asked each participant to rate the secondary task’s
perceived complexity with a value ranging between 0 to 8, with
0 identifying a low demanding task and 8 a highly demanding
one.
Responses of the participants are collected in Fig. 7, where
we can see that 85% of the participants considered the visual
task to be of higher complexity than the audio stimuli task. This
shows that DecNet performance on decision anticipation and
the human perceived complexity match in both experimental
scenarios, and suggests that single modality stimulation exerts
higher levels of cognitive load on drivers, directly affecting
their ability to make correct decisions. Overall, participants
agreed that visual scenario is more demanding than the audio.
In fact, in the audio task there are no confounding variables,
as numbers are announced every time an obstacle is passed,
while on the visual task, participants had to read the numbers
displayed on the obstacles. However, there was no evident
difference in their driving performance as no crashes were
recorded. It is also interesting to notice that drivers assigned
different levels of complexity for each scenario, showing that
drivers’ perception of cognitive load can differ also on the same
task and suggesting that they divide energies between the two
tasks using different strategies. This highlights the ambiguity of
a cognitive load-based metric, and further stresses the benefits
of a metric based on secondary task decision mistakes, which
inherently occur when drivers are overloaded.
Our results confirm the findings of numerous studies in
the multitasking theory literature, such as [
31
], [
32
]. The
multitasking model in [
31
] assumes that auditory and visual
perception use different resources, therefore if two joint
tasks use different modalities their performance are expected
to improve, and worsen if they require the same modality.
Similarly, in the working memory theory by [
32
] it was shown
that participants experience significant performance disruption
when two or more concurrent tasks operate on the same visual
modality, which is consistent with our results on the visual
stimuli scenario.
VII. CONCLUSIONS
In this paper, we introduced DecNet, an end-to-end multi-
stage recurrent deep model that anticipates the correctness of
an imminent decision from a driver as a proxy to cognitive
overload instances. We collected a dataset from a cohort of
participants on two separate decision-making scenarios: one
where decision stimuli are presented visually and one where
they are auditory. We investigated the ability of the proposed
model to anticipate the secondary task decisions on both
scenarios from non-obtrusive physiological signals only, namely
eye gaze and head pose.
Our results showed that DecNet is high performing in the task
of decision correctness anticipation, achieving 81% precision
and 77% recall on the auditory stimuli task, and 67% precision
and 65% recall on the visual stimuli task. The proposed model
outperforms comparable models on all the scenarios considered.
We tested the real-time capabilities of DecNet and proved that
the proposed model can reliably infer the correctness likelihood
of a decision up to 2s before such a decision takes place.
We have also investigated the effects that different stimuli
modalities have on cognitive overload events of the driver, i.e.,
their decision accuracy on the secondary task, and therefore
more generally on their level of cognitive load. Our analyses
indicate that when a single modality is overloaded, as for the
visual stimuli task, both drivers and DecNet tend to be less
reliable performance-wise. This suggests that, in case of take-
over, it would be preferable to use a different modality than
the one currently used to perform such a task. While DecNet
is capable of running online in real-time, given its inference
rate of 125Hz, all shown classification performance refer to
offline testing. Therefore, it would be interesting to investigate
how DecNet performs in a closed-loop setting, where a human
driver is interacting with the simulated environment.
Our study has shown that unobtrusive physiological signals
are strongly correlated with cognitive overload events in
the driver and that DecNet can exploit such correlations to
anticipate these events. The proposed model DecNet shows
solid and reliable performance, however it represents an initial
step towards real-time cognitive overload estimation and it
would be interesting to investigate the application of more
advanced machine learning models to this problem. Also, the
proposed study focused on simulated scenarios without external
distractions, and it would be therefore worth exploring how
well we can generalize our methods to more complex scenarios,
such as real-world driving. In our study, we induced a high
level of cognitive load in the driver by means of a “3-back”
task, which is inherently artificial. Although common in driver
AMADORI et al.: PREDICTING SECONDARY TASK PERFORMANCE: A DIRECTLY ACTIONABLE METRIC FOR COGNITIVE OVERLOAD DETECTION 11
cognitive states studies due to their ease of implementation,
these tasks still represent a proxy for real-life driving tasks and
may represent a limitation of the applicability of the proposed
methods to natural driving scenarios.
REFERENCES
[1]
T. Carlson and Y. Demiris, “Collaborative control for a robotic wheelchair:
evaluation of performance, attention, and workload,IEEE Trans. on
Sys., Man, and Cybernetics, Part B (Cybernetics), vol. 42, no. 3, pp.
876–888, 2012.
[2]
E. Haapalainen, S. Kim et al., “Psycho-physiological measures for
assessing cognitive load,” in Intern. Conf. on Ubiquitous Comput., 2010,
pp. 301–310.
[3]
M. L. Reyes and J. D. Lee, “Effects of cognitive load presence and
duration on driver eye movements and event detection performance,”
Transp. Res. Part F: Traffic Psychol. and Behav., vol. 11, no. 6, pp.
391–402, 2008.
[4]
J. A. Healey and R. W. Picard, “Detecting stress during real-world driving
tasks using physiological sensors,” IEEE Trans. on Intell. Transp. Sys.,
vol. 6, no. 2, pp. 156–166, 2005.
[5]
R. Bose, H. Wang et al., “Regression-based continuous driving fatigue
estimation: Toward practical implementation,IEEE Trans. Cogn. Devel.
Syst., vol. 12, no. 2, pp. 323–331, June 2019.
[6]
D. Grimes, D. S. Tan et al., “Feasibility and pragmatics of classifying
working memory load with an electroencephalograph,” in SIGCHI Conf.
on Hum. Factors in Comput. Sys., 2008, pp. 835–844.
[7]
M. Gombolay, A. Bair et al., “Computational design of mixed-initiative
human–robot teaming that considers human factors: Situational awareness,
workload, and workflow preferences,Int. J. of Robot. Res., vol. 36, no.
5-7, pp. 597–617, 2017.
[8]
A. Steinfeld, T. Fong et al., “Common metrics for human-robot
interaction,” in ACM SIGCHI/SIGART Conf. Human-Robot Interaction,
2006, pp. 33–40.
[9]
A. Jain, A. Singh et al., “Recurrent neural networks for driver activity
anticipation via sensory-fusion architecture,” in IEEE Intern. Conf. on
Robot. and Automat., 2016, pp. 3118–3125.
[10]
Y. Liang, M. L. Reyes, and J. D. Lee, “Real-time detection of driver
cognitive distraction using support vector machines,IEEE Trans. on
Intell. Transp. Sys., vol. 8, no. 2, pp. 340–350, 2007.
[11]
C. P. Lam, A. Y. Yang et al., “Improving human-in-the-loop decision
making in multi-mode driver assistance systems using hidden mode
stochastic hybrid systems,” in IEEE Int. Conf. on Intell. Robots and Syst.,
Sept 2015, pp. 5776–5783.
[12]
D. L. Strayer and W. A. Johnston, “Driven to distraction: Dual-task
studies of simulated driving and conversing on a cellular telephone,
Psychol. Sci., vol. 12, no. 6, pp. 462–466, 2001.
[13]
Y.-K. Wang, T.-P. Jung, and C.-T. Lin, “EEG-based attention tracking
during distracted driving,IEEE Trans. on Neural Sys. and Rehabilitation
Engineering, vol. 23, no. 6, pp. 1085–1094, 2015.
[14]
M. Wollmer, C. Blaschke et al., “Online driver distraction detection using
long short-term memory,IEEE Trans. on Intell. Transp. Sys., vol. 12,
no. 2, pp. 574–582, 2011.
[15]
R. Wang, P. V. Amadori, and Y. Demiris, “Real-time workload classifi-
cation during driving using hypernetworks,” in IEEE Intern. Conf. on
Intell. Robots and Sys., 2018, pp. 3060–3065.
[16]
P. V. Amadori, T. Fischer et al., “Decision anticipation for driving
assistance systems,” in IEEE Intern. Conf. on Intell. Transp. Sys., 2020.
[17]
M. Petit, T. Fischer, and Y. Demiris, “Lifelong augmentation of
multimodal streaming autobiographical memories,” IEEE Trans. Cogn.
Devel. Syst., vol. 8, no. 3, pp. 201–213, Sept. 2016.
[18]
Y. Xing, C. Lv et al., “Driver activity recognition for intelligent vehicles:
A deep learning approach,” IEEE Trans. on Veh. Tech., vol. 68, no. 6,
pp. 5379–5390, 2019.
[19]
T. Billah, S. M. Rahman et al., “Recognizing distractions for assistive
driving by tracking body parts,” IEEE Trans. on Circuits and Sys. for
Video Tech., vol. 29, no. 4, pp. 1048–1062, 2018.
[20]
M. Petit and Y. Demiris, “Hierarchical action learning by instruction
through interactive grounding of body parts and proto-actions,” in IEEE
Int. Conf. on Robot. and Automat., 2016, pp. 3375–3382.
[21]
A. Taniguchi, T. Taniguchi, and T. Inamura, “Spatial concept acquisition
for a mobile robot that integrates self-localization and unsupervised word
discovery from spoken sentences,IEEE Trans. Cogn. Devel. Syst., vol. 8,
no. 4, pp. 285–297, Dec. 2016.
[22]
M. A. Recarte and L. M. Nunes, “Mental workload while driving: Effects
on visual search, discrimination, and decision making.” J. of Exp. Psychol.:
Applied, vol. 9, no. 2, pp. 119–137, 2003.
[23]
E. T. Solovey, M. Zec et al., “Classifying driver workload using
physiological and driving performance data: Two field studies,” in SIGCHI
Conf. on Hum. Factors in Comput. Sys., 2014, pp. 4057–4066.
[24]
B. Mehler and B. Reimer, “How demanding is “just driving”? A cognitive
workload-psychophysiological reference evaluation,” in Intern. Driving
Symp. on Hum. Factors in Driver Assessment, Training and Vehicle
Design, 2019, pp. 363–369.
[25]
L. Fridman, B. Reimer et al., “Cognitive load estimation in the wild,
in CHI Conf. on Hum. Factors in Comput. Sys., 2018, pp. 652:1–652:9.
[26]
B. Reimer, C. Gulash et al., “The MIT AgeLab n-back: a multi-modal
android application implementation,” in Intern. Conf. on Automot. User
Interfaces and Interactive Veh. Applications, 2014.
[27]
G. Hossain and M. Yeasin, “Analysis of cognitive dissonance and overload
through ability-demand gap models,” IEEE Trans. Cogn. Devel. Syst.,
vol. 9, no. 2, pp. 170–182, June 2015.
[28]
X. Zhang, Y. Sugano et al., “MPIIGaze: Real-world dataset and deep
appearance-based gaze estimation,” IEEE Trans. on Pattern Anal. and
Mach. Intell., vol. 41, no. 1, pp. 162–175, 2017.
[29]
T. Fischer, H. J. Chang, and Y. Demiris, “RT-GENE: Real-time eye gaze
estimation in natural environments,” in Eur. Conf. on Comput. Vision,
2018, pp. 339–357.
[30]
S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
Computation, vol. 9, no. 8, pp. 1735–1780, 1997.
[31]
C. D. Wickens, “Multiple resources and mental workload,Hum. Factors,
vol. 50, no. 3, pp. 449–455, 2008.
[32]
A. Baddeley, “Working memory: Theories, models, and controversies,”
Annual Review of Psychol., vol. 63, pp. 1–29, 2012.
[33]
H. K. Wong, J. Epps, and S. Chen, “A comparison of methods for
mitigating within-task luminance change for eyewear-based cognitive
load measurement,” IEEE Trans. Cogn. Devel. Syst., vol. 12, no. 4, pp.
681–694, Dec. 2018.
[34]
E. Ferreira, D. Ferreira et al., “Assessing real-time cognitive load based
on psycho-physiological measures for younger and older adults,” in IEEE
Symp. on Comput. Intell., Cogn. Alg., Mind, and Brain, 2014, pp. 39–48.
[35]
T. Georgiou and Y. Demiris, “Adaptive user modelling in car racing
games using behavioural and physiological data,User Model. and User-
Adapted Interact., vol. 27, no. 2, pp. 267–311, 2017.
[36]
M. J. Cole, J. Gwizdka et al., “Inferring user knowledge level from eye
movement patterns,Information Processing & Management, vol. 49,
no. 5, pp. 1075–1091, 2013.
[37]
O. Celiktutan and Y. Demiris, “Inferring human knowledgeability from
eye gaze in mobile learning environments,” in Eur. Conf. on Comput.
Vision Workshops, 2018, pp. 193–209.
[38]
R. Bednarik, S. Eivazi, and H. Vrzakova, “A computational approach for
prediction of problem-solving behavior using support vector machines
and eye-tracking data,” in Eye Gaze in Intelligent User Interfaces, 2013,
pp. 111–134.
[39]
M. X. Huang, J. Li et al., “Moment-to-moment detection of internal
thought during video viewing from eye vergence behavior,” in ACM
Intern. Conf. on Multimedia, 2019, p. 2254–2262.
[40]
T. Fischer and Y. Demiris, “Markerless perspective taking for humanoid
robots in unconstrained environments,” in IEEE Intern. Conf. on Robot.
and Automat., 2016, pp. 3309–3316.
[41]
T. Fischer and Y. Demiris, “Computational modelling of embodied visual
perspective-taking,IEEE Trans. Cogn. Devel. Syst., vol. 12, no. 4, pp.
723–732, 2020.
[42]
A. Dasgupta, S. M. Bhattacharya, and A. Routray, “A system for
noncontact estimation of cognitive load using saccadic parameters based
on a serio-parallel computing framework,IEEE Trans. Cogn. Devel.
Syst., vol. 11, no. 3, pp. 450–459, 2019.
[43]
M. Blanco, W. J. Biever et al., “The impact of secondary task cognitive
processing demand on driving performance,” Accident Anal. & Prevention,
vol. 38, no. 5, pp. 895–906, 2006.
[44]
N. Merat, A. H. Jamson et al., “Highly automated driving, secondary
task performance, and driver state,Hum. Factors, vol. 54, no. 5, pp.
762–771, 2012.
[45]
F. Naujoks, C. Purucker, and A. Neukum, “Secondary task engagement
and vehicle automation–comparing the effects of different automation
levels in an on-road experiment,Transp. Res. part F: Traffic Psychol.
and Behav., vol. 38, pp. 67–82, 2016.
[46]
B. Wandtner, N. Sch
¨
omig, and G. Schmidt, “Secondary task engagement
and disengagement in the context of highly automated driving,Transp.
Res. part F: Traffic Psychol. and Behav., vol. 58, pp. 253–263, 2018.
12 IEEE TRANSACTIONS ON COGNITIVE AND DEVELOPMENTAL SYSTEMS
[47]
V. Alizadeh and O. Dehzangi, “The impact of secondary tasks on drivers
during naturalistic driving: Analysis of EEG dynamics,” in IEEE Intern.
Conf. on Intell. Transp. Sys., 2016, pp. 2493–2499.
[48]
T. Ersal, H. J. Fuller et al., “Model-based analysis and classification of
driver distraction under secondary tasks,IEEE Trans. on Intell. Transp.
Sys., vol. 11, no. 3, pp. 692–701, 2010.
[49]
C. Braunagel, W. Rosenstiel, and E. Kasneci, “Ready for take-over? a
new driver assistance system for an automated classification of driver
take-over readiness,IEEE Intell. Transp. Sys. Mag., vol. 9, no. 4, pp.
10–22, 2017.
[50]
J. Engstr
¨
om, G. Markkula et al., “Effects of cognitive load on driving
performance: The cognitive control hypothesis,Hum. factors, vol. 59,
no. 5, pp. 734–764, 2017.
[51]
R. M
¨
uller, S. Kornblith, and G. Hinton, “When does label smoothing
help?” Neural Information Processing Systems, 2019.
[52]
C. Szegedy, W. Liu et al., “Going deeper with convolutions,” in IEEE
Conf. on Comput. Vision and Pattern Recognition, 2015, pp. 1–9.
[53]
S. M. Jaeggi, M. Buschkuehl et al., “The concurrent validity of the
N-back task as a working memory measure,” Memory, vol. 18, no. 4,
pp. 394–412, 2010.
[54]
R. C. Williges and W. W. Wierwille, “Behavioral measures of aircrew
mental workload,” Human Factors, vol. 21, no. 5, pp. 549–574, 1979.
[55]
X. Wu and Z. Li, “Secondary task method for workload measurement in
alarm monitoring and identification tasks,” in Int. Conf. on Cross-Cultural
Design, 2013, pp. 346–354.
[56]
D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,
arXiv preprint arXiv:1412.6980, 2014.
[57]
F. Tango and M. Botta, “Real-time detection system of driver distraction
using machine learning,” IEEE Trans. on Intell. Transp. Sys., vol. 14,
no. 2, pp. 894–905, 2013.
Pierluigi Vito Amadori
(S’14, M’17) received the
M.Sc. degree (Hons.) in Telecommunications Engi-
neering from the University of Rome La Sapienza,
Rome, Italy, in 2013 and the Ph.D. degree in
Electronic Engineering from the Department of Elec-
trical & Electronic Engineering, University College
London, London, U.K., in 2017.
He currently holds a position as a Postdoctoral
Research Associate at the Personal Robotics Lab-
oratory at Imperial College London, London, U.K.
His main research interests include driver monitoring,
user modeling and driving assistance systems.
Tobias Fischer
(M’16) received the B.Sc. degree
from Ilmenau University of Technology, Germany,
in 2013, the M.Sc. degree in Artificial Intelligence
from the University of Edinburgh, U.K., in 2014, and
the Ph.D. degree from the Personal Robotics Lab,
Imperial College London, London, U.K, in 2018.
His research interests include both computer vision
and human vision, visual attention and computational
cognition. He is interested in applying this knowledge
to cognitive robotics.
Dr. Fischer was a recipient of the Queen Mary
Award for the Best U.K. Robotics PhD Thesis in 2018 and the Eryl Cadwaladr
Davies prize for the best departmental thesis in 2017-2018.
Ruohan Wang
(S’16) received the B.Sc. degree
(Hons.) from National University of Singapore, Singa-
pore, in 2012, and the Ph.D. degree from the Personal
Robotics Lab, Imperial College London, London,
U.K, in 2021.
He is currently a research scientist with Institute
of Infocomm Research, A*STAR, Singapore. His
main research interests include machine learning and
its application in assistive robotics. Ruohan was a
recipient of National Science Scholarship, Singapore.
Yiannis Demiris
(SM’03) received the B.Sc. (Hons.)
degree in artificial intelligence and computer science
and the Ph.D. degree in intelligent robotics from
the Department of Artificial Intelligence, University
of Edinburgh, Edinburgh, U.K., in 1994 and 1999,
respectively.
He is a Professor with the Department of Electrical
and Electronic Engineering, Imperial College London,
London, U.K., where he is the Royal Academy of
Engineering Chair in Emerging Technologies, and
the Head of the Personal Robotics Laboratory. His
current research interests include human-robot interaction, machine learning,
user modeling, and assistive robotics. He has published more than 200 journal
and peer-reviewed conference papers in the above areas.
Prof. Demiris was a recipient of the Rector’s Award for Teaching Excellence
in 2012 and the FoE Award for Excellence in Engineering Education in 2012.
He is a Fellow of IET, BCS, and Royal Statistical Society.
... • Multimodality. The dataset contains three types of information, i.e., EEG signals, eye gaze and vehicle states, which is more diverse than other datasets [22]- [25], [27], [28], [30], [32]. The multimodal information can reflect the driver cognitive workload levels from different perspectives. ...
... • Multi-scenario. In comparison with the single scenario in previous research [22]- [28], [30], [32], diverse driving scenarios with the varied visibility are adopted in this study, as shown in Fig. 2. Designed color-based stimuli are sensitive to the ambient light, and participants' reaction time reduces with the decreases of visibility. Therefore, the dataset can reveal the influence of varied visibility on driver workload recognition. ...
... • Multiple senses. Since the visual-auditory mixed stimuli is utilized in experiments, the information of participants' visual and auditory responses is involved in the dataset, which is more realistic compared with the datasets constructed in previous research [16], [17], [27], [28]. Specifically, the designed visual task requires drivers to respond to visual information, which reflects the visual-induced cognitive workload in the real world. ...
Article
Full-text available
Driver workload inference is significant for the design of intelligent human-machine cooperative driving schemes since it allows the systems to alert drivers before potentially dangerous maneuvers and achieve a safer control transition. However, pattern variations among individual drivers and sensor artifacts pose great challenges to the existing cognitive workload recognition approaches. In this paper, we develop an attention-enabled recognition network with a decision-level fusion architecture to further improve the workload estimation performance. Specifically, the cross-attention mechanism can enhance useful feature representations learned by hyper long short-term memory (HyperLSTM) based modules from time-series multimodal information, i.e., electroencephalogram signals, eye movements, vehicle states. A novel dataset containing multiple driving scenarios is constructed to evaluate the model performance across different historical horizons and decision thresholds, and test results demonstrate the superior performance of the proposed model to other existing methods. Furthermore, robustness tests and driver-in-the-loop experiments are conducted to verify the effectiveness of the developed model in real-time workload levels inference. The code and supplementary materials are available at https://yanghh.io/Driver-Workload-Recognition .
... This model first employs an RNN module to learn the temporal features and then passes through an LSTM module to recognize workload levels. The architecture of the DecNet model utilized in this study is based on the work conducted by [57]. ...
... However, cognitive overload can significantly impede this process. When an operator faces an excess of information or task demands that exceed their cognitive resources, they are likely to experience mental fatigue [28,29], which can lead to reliance on simplifying strategies, known as heuristics. While heuristics can be useful for quick judgments, they often ignore much of the available data and the nuance required for high-quality decision making [30,31]. ...
Article
Full-text available
The advent of Industry 4.0 has heralded advancements in Human–robot Collaboration (HRC), necessitating a deeper understanding of the factors influencing human decision making within this domain. This scoping review examines the breadth of research conducted on HRC, with a particular focus on identifying factors that affect human decision making during collaborative tasks and finding potential solutions to improve human decision making. We conducted a comprehensive search across databases including Scopus, IEEE Xplore and ACM Digital Library, employing a snowballing technique to ensure the inclusion of all pertinent studies, and adopting the PRISMA Extension for Scoping Reviews (PRISMA-ScR) for the reviewing process. Some of the important aspects were identified: (i) studies’ design and setting; (ii) types of human–robot interaction, types of cobots and types of tasks; (iii) factors related to human decision making; and (iv) types of user interfaces for human–robot interaction. Results indicate that cognitive workload and user interface are key in influencing decision making in HRC. Future research should consider social dynamics and psychological safety, use mixed methods for deeper insights and consider diverse cobots and tasks to expand decision-making studies. Emerging XR technologies offer the potential to enhance interaction and thus improve decision making, underscoring the need for intuitive communication and human-centred design.
... For instance, in recent years, the prediction of eye movement scanpath can be divided into two categories: prediction models that hand-design features and powerful mathematical knowledge, and methods that intuitively obtain the sequence of eye fixes from the bottom-up salinity map and other useful indications (Han et al., 2021). With the advances of machine and deep learning, the study of computational eye-movement models has been mainly based on neural network learning models (e.g., Wang et al., 2021). For instance, in the Correspondence analysis with factor 1 (40.81%) and factor 2 (32.19%). ...
Article
Full-text available
In recent years, eye-tracking (ET) methods have gained an increasing interest in STEM education research. When applied to engineering education, ET is particularly relevant for understanding some aspects of student behavior, especially student competency, and its assessment. However, from the instructor’s perspective, little is known about how ET can be used to provide new insights into, and ease the process of, instructor assessment. Traditionally, engineering education is assessed through time-consuming and labor-extensive screening of their materials and learning outcomes. With regard to this, and coupled with, for instance, the subjective open-ended dimensions of engineering design, assessing competency has shown some limitations. To address such issues, alternative technologies such as artificial intelligence (AI), which has the potential to massively predict and repeat instructors’ tasks with higher accuracy, have been suggested. To date, little is known about the effects of combining AI and ET (AIET) techniques to gain new insights into the instructor’s perspective. We conducted a Review of engineering education over the last decade (2013–2022) to study the latest research focusing on this combination to improve engineering assessment. The Review was conducted in four databases (Web of Science, IEEE Xplore, EBSCOhost, and Google Scholar) and included specific terms associated with the topic of AIET in engineering education. The research identified two types of AIET applications that mostly focus on student learning: (1) eye-tracking devices that rely on AI to enhance the gaze-tracking process (improvement of technology), and (2) the use of AI to analyze, predict, and assess eye-tracking analytics (application of technology). We ended the Review by discussing future perspectives and potential contributions to the assessment of engineering learning.
... While the studies mentioned so far rely on scenario-specific mechanisms to infer mental states, it is also possible to rely on measures of physiological signals. For instance, [3,31] showed that eye-gaze patterns could be correlated with mental states such as cognitive overload or confusion about a task. Moreover, [31] demonstrated that these patterns also changed significantly when the human observes a robot failure, which is highly correlated with trust [15]. ...
... Eye gaze acts as a non-verbal behavioral sign that can indicate human intention, attention, and interests. In addition to helping us perceive the environment, gaze has various applications in human behaviour analysis (Ishii et al. 2016), human computer interaction (HCI) (Andrist et al. 2014), human robot interaction (HRI) (Moon et al. 2014), cognitive science (Amadori et al. 2021), and virtual reality (Konrad, Angelopoulos, and Wetzstein 2020). Thus, accurate estimation of gaze has gained attention over the years, making gaze estimation a well-established area of research. ...
Preprint
We present a novel multistream network that learns robust eye representations for gaze estimation. We first create a synthetic dataset containing eye region masks detailing the visible eyeball and iris using a simulator. We then perform eye region segmentation with a U-Net type model which we later use to generate eye region masks for real-world eye images. Next, we pretrain an eye image encoder in the real domain with self-supervised contrastive learning to learn generalized eye representations. Finally, this pretrained eye encoder, along with two additional encoders for visible eyeball region and iris, are used in parallel in our multistream framework to extract salient features for gaze estimation from real-world images. We demonstrate the performance of our method on the EYEDIAP dataset in two different evaluation settings and achieve state-of-the-art results, outperforming all the existing benchmarks on this dataset. We also conduct additional experiments to validate the robustness of our self-supervised network with respect to different amounts of labeled data used for training.
Article
Complex Cyber-Physical-Human System (CPHS) integrate the human operator as an essential element to assist with different aspects of information monitoring and decision making across the system to achieve the desired goal. A crucial aspect of enhancing CPHS efficiency lies in understanding the interaction dynamics of human cognitive factors and Cyber-Physical Systems (CPS). This entails designing feedback mechanisms, reasoning processes, and compliance protocols with consideration of their psychological impacts on human operators, fostering shared awareness between humans and AI agents and calibrating feedback levels to ensure operators are informed without being overwhelmed. This study focuses on a specific CPHS scenarios involving a human operator interacting with a swarm of robots for search, rescue, and monitoring tasks. It explores the impact of swarm non-compliance rate and feedback levels from the robotic swarm on the human operators’ cognitive processes, and the extent to which individual differences in information processing influence this interaction dynamics. A human subject study with 20 participants experienced in strategic gaming involved nine scenarios with randomized robot compliance and AI feedback levels was conducted. Cognitive factors, categorized into brain-based features (mental engagement, workload, distraction) and eye-based features (pupil size, fixation, saccade, blink rate), were analyzed. Statistical analyses revealed that brain-based features, particularly mental workload, were predictive of the effect of compliance level, while feedback level and expertise affected both eye features and brain cognitive factors.
Article
Conditional Automated Driving (CAD) has attracted widespread attention due to the substantial gap in achieving fully autonomous driving, wherein an essential endeavor entails determining the transition timing between automated and manual driving modes. Driver cognitive workload serves as a crucial indicator for identifying transition timing, while its precise determination is challenging with discrete workload levels in previous studies. To address this issue, this work develops a dual-stage learning framework to quantify driver cognitive workload continuously. Specifically, a semi-supervised co-training strategy is first designed to approximate workload values, and then supervised contrastive learning is employed to align them with their feature representations in the latent space. A novel driver workload dataset is constructed for the evaluation, and experimental results demonstrate that our proposed approach outperforms other state-of-the-art baselines in estimation accuracy. Furthermore, the rationality of quantified cognitive workload is analyzed through the driver’ subjective assessment, indicating it is a more reliable solution for achieving the driving authority transition.
Article
Modern aircraft cockpit system is highly information-intensive. Pilots often need to receive a large amount of information and make correct judgments and decisions in a short time. However, cognitive load can affect their ability to perceive, judge and make decisions accurately. Furthermore, the excessive cognitive load will induce incorrect operations and even lead to flight accidents. Accordingly, the research on cognitive load is crucial to reduce errors and even accidents caused by human factors. By using physiological acquisition systems such as eye movement, ECG, and respiration, multi-source physiological signals of flight cadets performing different flight tasks during the flight simulation experiment are obtained. Based on the characteristic indexes extracted from multi-source physiological data, the CGAN-DBN model is established by combining the conditional generative adversarial networks (CGAN) model with the deep belief network (DBN) model to identify the flight cadets' cognitive load. The research results show that the flight cadets' cognitive load identification based on the CGAN-DBN model established has high accuracy. And it can effectively identify the cognitive load of flight cadets. The research paper has important practical significance to reduce the flight accidents caused by the high cognitive load of pilots.
Conference Paper
Full-text available
Anticipating the correctness of imminent driver decisions is a crucial challenge in advanced driving assistance systems and has the potential to lead to more reliable and safer human-robot interactions. In this paper, we address the task of decision correctness prediction in a driver-in-the-loop simulated environment using unobtrusive physiological signals, namely, eye gaze and head pose. We introduce a sequence-to-sequence based deep learning model to infer the driver's likelihood of making correct/wrong decisions based on the corresponding cognitive state. We provide extensive experimental studies over multiple baseline classification models on an eye gaze pattern and head pose dataset collected from simulated driving. Our results show strong correlates between the physiological data and decision correctness, and that the proposed sequential model reliably predicts decision correctness from the driver with 80% precision and 72% recall. We also demonstrate that our sequential model performs well in scenarios where early anticipation of correctness is critical, with accurate predictions up to two seconds before a decision is performed.
Article
Full-text available
Humans are inherently social beings that benefit from their perceptional capability to embody another point of view, typically referred to as perspective-taking. Perspective-taking is an essential feature in our daily interactions and is pivotal for human development. However, much remains unknown about the precise mechanisms that underlie perspective-taking. Here we show that formalizing perspective-taking in a computational model can detail the embodied mechanisms employed by humans in perspective-taking. The model's main building block is a set of action primitives that are passed through a forward model. The model employs a process that selects a subset of action primitives to be passed through the forward model to reduce the response time. The model demonstrates results that mimic those captured by human data, including (i) response times differences caused by the angular disparity between the perspective-taker and the other agent, (ii) the impact of task-irrelevant body posture variations in perspective-taking, and (iii) differences in the perspective-taking strategy between individuals. Our results provide support for the hypothesis that perspective-taking is a mental simulation of the physical movements that are required to match another person's visual viewpoint. Furthermore, the model provides several testable predictions, including the prediction that forced early responses lead to an egocentric bias and that a selection process introduces dependencies between two consecutive trials. Our results indicate potential links between perspective-taking and other essential perceptional and cognitive mechanisms, such as active vision and autobiographical memories.
Article
Full-text available
-Mental fatigue in drivers is one of the leading causes that give rise to traffic accidents. Electroencephalography (EEG) based driving fatigue studies showed promising performance in fatigue monitoring. However, complex methodologies are not suitable for practical implementation. In our simulation based setup that retained the constraints of real driving, we took a step closer to fatigue estimation in a practical scenario. We adopted a pre-processing pipeline with low computational complexity, which can be easily and practically implemented in real-time. Moreover, regression-based continuous fatigue estimation was achieved using power spectral features in conjunction with time as the fatigue label. We sought to compare three regression models and three time windows to demonstrate their effects on the performance of fatigue estimation. Dynamic time warping was proposed as a new measure for evaluating the performance of fatigue estimation. The results derived from the validation of the proposed framework on 19 subjects showed that our proposed framework was promising towards practical implementation. Fatigue estimation by the support vector regression with radial basis function kernel and 5-second window length achieved the best performance. We also provided a comprehensive analysis on the spatial distribution of channels and frequency bands mostly contributing to fatigue estimation, which can inform the feature and channel reduction for real-time fatigue monitoring in practical driving. After reducing the number of electrodes by 75%, the proposed framework retained comparable performance in fatigue estimation. This study demonstrates the feasibility and adaptability of our proposed framework in practical implementation of mental fatigue estimation.
Conference Paper
Full-text available
Physiological arousal, measured as heart rate and skin conductance level, was recording during single-task highway driving (just driving), while driving and interacting with several voice-based and visual-manual infotainment user interfaces, while driving and engaging in multiple levels of a cognitive workload reference task (n-back), and while engaging in the same cognitive workload reference task under single-task (non-driving) conditions. Single-task highway driving was found to produce a level of physiological arousal in the same range as that of the relatively highly demanding 2-back task under non-driving conditions. While continuing innovations such as automatic transmission, power steering, as well as climate control, sound proofing and other comfort features, have reduced the overt demands of driving, these findings suggest that the remaining demand on resources during what has been thought of as "just driving" may be higher than many realize. The extent to which various implementations of longitudinal and lateral control driver assistance features being introduced change this dynamic is largely an open question.
Conference Paper
Full-text available
In this work, we consider the problem of robust gaze estimation in natural environments. Large camera-to-subject distances and high variations in head pose and eye gaze angles are common in such environments. This leads to two main shortfalls in state-of-the-art methods for gaze estimation: hindered ground truth gaze annotation and diminished gaze estimation accuracy as image resolution decreases with distance. We first record a novel dataset of varied gaze and head pose images in a natural environment, addressing the issue of ground truth annotation by measuring head pose using a motion capture system and eye gaze using mobile eyetracking glasses. We apply semantic image inpainting to the area covered by the glasses to bridge the gap between training and testing images by removing the obtrusiveness of the glasses. We also present a new real-time algorithm involving appearance-based deep convolutional neural networks with increased capacity to cope with the diverse images in the new dataset. Experiments with this network architecture are conducted on a number of diverse eye-gaze datasets including our own, and in cross dataset evaluations. We demonstrate state-of-the-art performance in terms of estimation accuracy in all experiments, and the architecture performs well even on lower resolution images.
Article
Cognitive load has been shown, over hundreds of validated studies, to be an important variable for understanding human performance. However, establishing practical, non-contact approaches for automated estimation of cognitive load under real-world conditions is far from a solved problem. Toward the goal of designing such a system, we propose two novel vision-based methods for cognitive load estimation, and evaluate them on a large-scale dataset collected under real-world driving conditions. Cognitive load is defined by which of 3 levels of a validated reference task the observed subject was performing. On this 3-class problem, our best proposed method of using 3D convolutional neural networks achieves 86.1% accuracy at predicting task-induced cognitive load in a sample of 92 subjects from video alone. This work uses the driving context as a training and evaluation dataset, but the trained network is not constrained to the driving environment as it requires no calibration and makes no assumptions about the subject's visual appearance, activity, head pose, scale, and perspective.
Conference Paper
Internal thought refers to the process of directing attention away from a primary visual task to internal cognitive processing. It is pervasive and closely related to primary task performance. As such, automatic detection of internal thought has significant potential for user modeling in human-computer interaction and multimedia applications. Despite the close link between the eyes and the human mind, only few studies have investigated vergence behavior during internal thought and none has studied moment-to-moment detection of internal thought from gaze. While prior studies relied on long-term data analysis and required a large number of gaze characteristics, we describe a novel method that is user-independent, computationally light-weight and only requires eye vergence information readily available from binocular eye trackers. We further propose a novel paradigm to obtain ground truth internal thought annotations by exploiting human blur perception. We evaluated our method during natural viewing of lecture videos and achieved a 12.1% improvement over the state of the art. These results demonstrate the effectiveness and robustness of vergence-based detection of internal thought and, as such, open new research directions for attention-aware interfaces.
Article
Intelligent vehicles and advanced driver assistance systems (ADAS) need to have proper awareness of the traffic context, as well as the driver status since ADAS share the vehicle control authorities with the human driver. This paper provides an overview of the ego-vehicle driver intention inference (DII), which mainly focuses on the lane change intention on highways. First, a human intention mechanism is discussed in the beginning to gain an overall understanding of the driver intention. Next, the egovehicle driver intention is classified into different categories based on various criteria. A complete DII system can be separated into different modules,which consist of traffic context awareness, driver states monitoring, and the vehicle dynamic measurement module. The relationship between these modules and the corresponding impacts on the DII are analyzed. Then, the lane change intention inference system is reviewed from the perspective of input signals, algorithms, and evaluation. Finally, future concerns and emerging trends in this area are highlighted.
Article
This paper proposes a system which uses a three-stage serio-parallel video-oculographic framework for computing the saccadic eye parameters to indicate the amount of cognitive loading. The three stages are viz. face and eye detection, iris and eye corner localization, and finally saccadic parameter computation. Since saccades are fast movements of the eyeballs, accurate estimation of these parameters requires high frame rates of acquisition and processing. Our proposed framework meets such deadlines by accelerating the process using Graphics Processing Units (GPU). The first stage comprises of the face and eye detection using respective Haar classifiers followed by tracking of a region of interest (ROI) using a Minimum Output Sum of Squared Error (MOSSE) filter. In the second stage, the filter parameters are transferred to the GPU, where our proposed parallel scheme is implemented. In the detected eye region, the iris candidates are ranked using a sum of dot products of normalized displacement vectors with gradient vectors. We also localize the eye corners as the reference points. The saccadic velocity and duration are obtained using this eye position signal in the third stage. Finally, the amount of cognitive loading is determined based on these parameters.
Article
Eye activity based within-task cognitive load measurement is currently not feasible in everyday situations. One important issue to be addressed to move such cognitive load measurement beyond controlled laboratory environments is determining practical methods for mitigating the pupillary light reflex (PLR) effect in cognitive load measurement. In this paper, four approaches to dealing with the PLR effect within a modified verbal digit span task are investigated: ignore the PLR, exclude PLR data, compensate for PLR and use PLR features for measurement. During experimental work, cognitive load and the PLR were induced with a modified verbal digit span task and changes in brightness of a large monitor respectively. The ‘exclude PLR’‘, compensate for PLR’ and ‘use PLR features’ methods were found to improve classification performance by up to 18.5% relative to the ‘ignore PLR’ method, which yielded the worst classification accuracy of 58% using an average pupil diameter feature. Features derived from the transient properties of the PLR response associated with cognitive load were found to yield the superior classification accuracy of 70%, which is an improvement compared with previously published approaches which treated the PLR responses as interference. The findings from this research suggest that the PLR cannot be easily ignored or normalised, and clearly demonstrate the importance of PLR-aware feature extraction for the design of future eyewear-based always-on cognitive load measurement in conditions that are more realistic than a darkened, controlled laboratory.