Conference PaperPDF Available

Analysing the Impact of Machine Learning to Model Subjective Mental Workload: A Case Study in Third-Level Education

Authors:

Abstract and Figures

Mental workload measurement is a complex multidisciplinary research area that includes both the theoretical and practical development of models. These models are aimed at aggregating those factors, believed to shape mental workload, and their interaction, for the purpose of human performance prediction. In the literature, models are mainly theory-driven: their distinct development has been influenced by the beliefs and intuitions of individual scholars in the disciplines of Psychology and Human Factors. This work presents a novel research that aims at reversing this tendency. Specifically, it employs a selection of learning techniques, borrowed from machine learning, to induce models of mental workload from data, with no theoretical assumption or hypothesis. These models are subsequently compared against two well-known subjective measures of mental workload, namely the NASA Task Load Index and the Workload Profile. Findings show how these data-driven models are convergently valid and can explain overall perception of mental workload with a lower error.
Content may be subject to copyright.
Analysing the impact of Machine Learning to
model subjective Mental Workload: a case study
in third-level education
Karim Moustafa, Luca Longo
1School of Computer Science, Technological University Dublin,
Dublin, Republic of Ireland
2ADAPT, the global centre of excellence for digital content technology,
Dublin, Republic of Ireland
*luca.longo@dit.ie
Abstract. Mental workload measurement is a complex multidisciplinary
research area that includes both the theoretical and practical develop-
ment of models. These models are aimed at aggregating those factors,
believed to shape mental workload, and their interaction, for the purpose
of human performance prediction. In the literature, models are mainly
theory-driven: their distinct development has been influenced by the be-
liefs and intuitions of individual scholars in the disciplines of Psychology
and Human Factors. This work presents a novel research that aims at
reversing this tendency. Specifically, it employs a selection of learning
techniques, borrowed from machine learning, to induce models of men-
tal workload from data, with no theoretical assumption or hypothesis.
These models are subsequently compared against two well-known subjec-
tive measures of mental workload, namely the NASA Task Load Index
and the Workload Profile. Findings show how these data-driven mod-
els are convergently valid and can explain overall perception of mental
workload with a lower error.
1 Introduction
Assessing human mental workload is fundamental in the disciplines of Human-
Computer Interaction and Ergonomics [13,53]. Through mental workload, human
performance can be predicted and used for designing interacting technologies and
systems aligned to the limitations of the human mental limited capabilities [26].
However, despite its theoretical utility, and after decades of research, it is still an
umbrella construct [21,12,30]. In the last 50 years, researchers and scholars have
devoted their effort to the design and development of models of mental workload
that can act as a proxy for assessing human performance [15,9,47,35]. Mental
Workload (MWL) is a complex psychological construct, believed to be multi-
dimensional and composed of several factors. Various approaches have been de-
veloped to measure and to aggregate these factors into an overall index of mental
workload [50,28,22]. The vast majority of these are theory-driven, which means
that they utilise theoretical hypothesises and beliefs for assessing MWL deduc-
tively. Also, even if theoretically sound, these models are rather ad-hoc and they
mainly adopt basic operators for aggregating factors together, with the implicit
assumption of their linearity and often additivity. However, it is argued that
MWL is far from being a linear phenomenon and the application of non-linear
computational approaches can advance its modelling. Additionally, instead of
using theoretical knowledge, it is argued that data-driven approaches are likely
to offer a significant improvement in the developmend of models of mental work-
load [56]. In particular, Machine Learning (ML) is one of these approaches that
has been recently considered in MWL modelling. For example, researchers have
started applying ML techniques using physiological or task performance mea-
sures [51,55]. Other studies employing ML have shown promising results as in
[40,49,38].
This research study aims at investigating the impact of supervise modelling
techniques, hardly borrowed from machine learning, in the creation of models
of MWL by employing subjective self-reporting features from humans. In detail,
this study compares traditional subjective models of MWL, namely the NASA
Task Load Index (NASA-TLX) [14] and the Workload Profile (WP) [50], against
data-driven models produced by a number of ML techniques. Concisely, this pa-
per attempts to answer the research question: Can machine learning techniques
help build data-driven models of mental workload that have a better face validity
than the Nasa Task Load Index and the Workload Profile?
The rest of this paper is organised as follows. Section 2 describes related work
in the field of MWL measurement, with an emphasis on subjective approaches.
It then discusses the gaps in the literature that motivate the need of non-linear
modelling methods for mental workload. Section 3 introduces the design of a
comparative study and it describes the research methodology adopted for build-
ing data-driven models of mental workload. Section 4 presents the findings and
critically evaluates them with a rigorous comparison against the selected MWL
baseline instruments, namely the NASA-TLX and the Workload Profile. This
comparison is performed by computing the convergent and face validity of the
induced MWL models from data. Finally, Section 5 concludes the paper by high-
lighting its contribution and suggesting future work.
2 Related Work
The importance of measuring MWL has arisen from the crucial need of predict-
ing human performance [23,25,26]. In turn, human performance plays a central
role in the design of interactive technologies, interfaces as well as educational
and instructional material [31,29,23,36,24,37,27]. Measuring mental workload is
not a trivial task [48]. Various measures exist, with different advantages and
disadvantages, and they can be clustered in three main classes:
subjective measures - this class refers to the subjective perception of the
operator who is executing a specific task or interacting with an underlying
system. Subjective measures, also referred to as self-reporting measures, rely
on a direct estimation of individual differences such as emotional state, level
of stress, the effort devoted to the task and its demand. The perception of
users usually can be gathered by means of surveys or questionnaires in the
post-task phase [13]. This category includes measures such as the NASA Task
Load Index (NASA-TLX) [14], the Workload Profile (WP) [50] based on the
Multiple Resource Theory [52], and the Subjective Workload Assessment
Technique (SWAT) [42];
task performance measures - this category includes primary and secondary
task measures. These measures focus on quantifying the objective perfor-
mance of humans in relation to a specific task under execution. Example
include the number of errors, the time needed and the resources used to
accomplish a task or the reaction time to a secondary task [34,?];
physiological measures - this class relies on the analysis of the physiological
responses of a human executing a task. Examples include the heart rate,
EEG brain signals, eye movements and skin conductivity [4,35].
Self-reporting subjective measures are based upon the assumption that only
the human involved with a task can provide accurate and precise judgements
about the experienced mental workload. They are often employed post-task and
are easy to be administered. For these reasons, they are appealing to many
practitioners and are the focus of this paper. However, they contribute to an
overall description of the mental workload experienced on a task with no infor-
mation about its temporal variation. The category of task performance measures
is based upon the belief that the mental workload experienced by an individual
becomes relevant only if it impacts system performance. Primary task measures
are strongly connected to the concept of performance since they provide objective
and quantifiable measures of error or human success. Secondary task measures
can be gathered during task execution and are more sensitive to mental workload
variation. However, they might influence the execution of the primary task and
in turn influence mental workload. The class of physiological measures considers
responses of the body gathered from the individual interacting with an under-
lying task/system. The assumption is that they are highly correlated to mental
workload. Their utility lies in the interpretation and analysis of psychological
processes and their effect on the state of the body over time, without demand-
ing an explicit response by the human. However, they require specific equipment
and trained operators minimising their employability in real-world tasks.
2.1 Subjective Measurements Methods
Two out of the several subjective measures of mental workload developed in the
last decades are the NASA Task Load Index (NASA-TLX) [14] and the Workload
Profile (WP) [50]. Since these have been selected as baselines in this research
study, their detailed description follows. NASA-TLX is a mental workload assess-
ment tool developed by the the National Aeronautics and Space Administration
agency. It was originally conceived to assess the mental workload of pilots dur-
ing aviation tasks. Subsequently, it was adopted in other fields and used as a
benchmark in many research studies as for instance in [46,43,44,45,27]. The orig-
inal questionnaire behind this instrument can be found in [14]. The NASA-TLX
scale is built upon six dimensions and an additional pair-wise comparison among
these dimensions. This comparison is used to give weights to the six dimensions
as shown in equation 1.
NASA T LXM W L = 6
X
i=1
di×wi!1
15 (1)
The Workload Profile (WP) is based on the Multiple Resource Theory (MRT)
that was introduced by prof. Wickens [52]. The WP index is derived from eight
dimensions: perceptual/central processing, response processing, spatial process-
ing, verbal processing, visual processing, auditory processing, manual responses,
and speech responses. In WP, the operator is asked to report the proportion of
attentional resources elicited during task execution. The final mental workload
score is a sum of the eight factors, as shown in equation 2.
W PMW L =
8
X
i=1
di(2)
For a detailed information about the scales used by the two mental workload
instruments described above, the reader is referred to [22].
2.2 Machine Learning and data-driven methods for mental
workload modeling
Machine learning (ML) is a subfield of Artificial Intelligence that focuses on
creating models from data. It can be seen as a method of data analysis for
automated analytical model building. It focuses on automatic procedures than
can learn from data and identify patterns with minimal human intervention.
ML can be supervised, unsupervised or semi-supervised. On one hand, super-
vised ML aims to build mathematical models from a set of data that contains
both the inputs and the desired output (supervisory data). On the other hand,
unsupervised ML takes only input data and it is aimed at finding structures, pat-
terns, and groups or clusters in it. Semi-supervised ML employs both the above
learning mechanisms and it occurs when not all the inputs have an associated
output. A number of research studies have employed ML for mental workload
modeling. For example, [41,16] analysed physiological brain signals, gathered
by functional Near-Infrared Spectroscopy (fNIRS), with unsupervised ML. [49]
and [40] employed supervised ML respectively using speech data and linguis-
tic/keyboard dynamics of the operators to predict her/his mental workload. [8]
and [32] adopted supervised ML for mental workload assessment using features
extracted from eye movements. Similarly, supervised ML was used to predict lev-
els of cognitive load in driving tasks employing physiological eye movements and
primary task measures such as braking, acceleration, and steering angles [55]. Re-
cently, the multi-model approach of combining multiple physiological measures
for mental workload assessmnt has emerged demonstrating an enhancement over
using individual techniques separately [1,20]. Supervised ML has also been em-
ployed with subjective self-reporting data [38] and compared against well-known
self-reporting measures.
3 Design And Methodology
In order to tackle the research question formalised in section 1, a comparative
research study was designed to evaluate the accuracy of data-driven models,
built with supervised machine learning versus two subjective baselines models of
mental workload, namely the NASA-TLX and the WP, as shown in figure 1. Two
criteria for evaluating MWL models have been selected, in line to other studies
in the literature [43,22]: convergent [5] and face validity [39]. The definitions of
these two forms of validity adopted here are shown in Table 1. Existing data has
been used and the CRISP-DM methodology (Cross-Industry Standard Process
for Data Mining) has been followed for constructing MWL models [7].
Fig. 1: The design of a comparative study aimed at comparing data-driven
models of mental workload, built with supervised machine learning, against
two subjective baseline models.
Table 1: Criteria for comparing mental workload models
Name Description Statistical Tools
Convergent
Validity
It aims to determine whether dif-
ferent MWL assessment measures
are theoretically related.
Correlation coefficient of the
MWL scores produced my base-
line models vs ML models.
Face
Validity
It aims to determine the extent
to which a measure can actually
grasp the construct of MWL.
Error of a MWL model in predict-
ing a self-reported perception of
MWL.
3.1 Dataset, context and participants
The dataset selected for this research study has been formed in an educational
context. More specifically, recruited participants were students who attended
classes of the Research Design and Proposal Writing module, in a master course
in the School of Computing, at Dublin Institute of Technology. Four different
topics have repeatedly been delivered in four consecutive semesters, from 2015
to 2017. (‘Science’, ‘The Scientific Method’, ‘Planning Research’, ‘Literature Re-
view’). These topics were delivered adopting three different instructional formats:
1. The first format focused on the transmission of information with a traditional
direct-instruction method – from lecturer to students – by projecting slides
on a whiteboard and describing them verbally.
2. The second format included the delivery of the same content, as developed
using the first format, as multimedia videos, pre-recorded by the same lec-
turer. Videos were built by employing the principles of the Cognitive Theory
of Multimedia Learning [33]. Further details can be found in [27];
3. The third format included a collaborative activity conducted after the deliv-
ery of the video, as developed in the second format. The goal of this activity
was to improve the social construction of the information through dialogue
among students divided in groups.
The number of classes, their length and the number of students are sum-
marised in Table 2. Students were of 16 nationalities (19-54 years; mean=31.7,
std=7.5). For each class, students were randomly split into two groups. They re-
spectively received the questionnaire associated to the NASA-TLX and the WP.
In addition to this, students were asked to answer an additional question on
overall perception of MWL, hereinafter referred to as the Overall Perception of
Mental Workload (OP-MWL), on a discrete scale from 0 to 20 (figure 2). Those
students who agreed to participate in the experiment received a consent form,
approved by the ethics committee of the Dublin Institute of Technology, and a
study information sheet. These forms describe the theoretical framework of the
study, the confidentiality of the data, and the anonymisation of their personal
information. Thus, two sub-datasets were formed, one containing the answers
of the NASA-TLX questionnaire, and one related to the answers related to the
WP questionnaire, respectively containing 145 and 139 samples.
Table 2: Number of classes for each format, number of students in each class
and their length in minutes
Lecture Format 1 Format 2 Format 3
classes students mins classes students mins classes students mins
Science 2 14,17 62,60 1 26 18 1 16 60
Scientific Method 1 23 46 2 18,18 28,28 1 18 50
Research Planning 1 20 54 2 22,22 10,10 1 9 79
Literature Review 1 21 55 1 24 19 1 16 77
Fig. 2: Scale of the question for measuring the overall perception of mental
workload (OP-MWL).
3.2 Machine learning for training mental workload models
Supervised machine learning was employed to train models of mental workload
from collected data. The dependent feature is the overall perception of mental
workload provided by students (OP-MWL) while the independent features are
the questions of the NASA-TLX and the WP instruments.
Data Understanding - Three sets of independent features were formed, as
described in the summary table 3. This helped understand the nature of the data
and it allowed the investigation of its characteristics, such as the type of features,
their values and ranges. The table also shows the normality of the distributions
of each feature and its skewness. Figure 3 depicts the distribution of the target
variable (the overall perception of mental workload OP-MWL).
Fig. 3: Distribution of the target variable: the overall perception of mental
workload provided by students (OP-MWL).
Type n Mean SD Median Min Max Range Skew Kurtosis SE
Feature set 1: questions of the NASA-TLX
Mental R 145 10.04 3.42 10 1 20 19 -0.04 -0.34 0.28
Physical R 145 6.31 4.19 6 1 20 19 0.63 -0.22 0.35
Temporal R 145 9.22 3.41 10 1 20 19 -0.01 0.16 0.28
Performance R 145 8.72 3.73 9 2 17 15 0.17 -0.92 0.31
Frustration R 145 7.55 3.93 7 1 19 18 0.43 -0.57 0.33
Effort R 145 9.89 4.02 10 1 20 19 0.13 -0.18 0.33
Feature set 2: pairwise comparisons of the NASA-TLX
Temporal vs frustration C 145 1.19 0.4 1 1 2 1 1.54 0.37 0.03
Performance vs mental C 145 1.48 0.5 1 1 2 1 0.07 -2.01 0.04
Mental vs physical C 145 1.09 0.29 1 1 2 1 2.84 6.13 0.02
Frustration vs performance C 145 1.81 0.4 2 1 2 1 -1.54 0.37 0.03
Temporal vs effort C 145 1.62 0.49 2 1 2 1 -0.49 -1.77 0.04
Physical vs frustration C 145 1.45 0.5 1 1 2 1 0.21 -1.97 0.04
Performance vs temporal C 145 1.41 0.49 1 1 2 1 0.38 -1.87 0.04
Mental vs effort C 145 1.38 0.49 1 1 2 1 0.49 -1.77 0.04
Physical vs temporal C 145 1.8 0.4 2 1 2 1 -1.48 0.21 0.03
Frustration vs effort C 145 1.79 0.41 2 1 2 1 -1.43 0.05 0.03
Physical vs performance C 145 1.91 0.29 2 1 2 1 -2.84 6.13 0.02
Temporal vs mental C 145 1.7 0.46 2 1 2 1 -0.85 -1.29 0.04
Effort vs physical C 145 1.08 0.28 1 1 2 1 3 7.03 0.02
Frustration vs mental C 145 1.81 0.39 2 1 2 1 -1.6 0.55 0.03
Performance vs effort C 145 1.43 0.5 1 1 2 1 0.26 -1.94 0.04
Feature set 3: questions of the Workload Profile
Solving deciding R 139 11.17 3.93 11 2 20 18 -0.18 -0.51 0.33
Response selection R 139 9.92 4.34 10 1 20 19 -0.16 -0.72 0.37
Task space R 139 8.74 4.71 9 1 20 19 0.07 -0.96 0.4
Verbal material R 139 12.48 3.8 13 2 20 18 -0.57 -0.32 0.32
Visual resources R 139 12.24 3.79 13 3 20 17 -0.45 -0.42 0.32
Auditory resources R 139 12.78 3.69 13 4 20 16 -0.3 -0.57 0.31
Manual response R 139 9.46 5.05 10 1 20 19 -0.03 -0.92 0.43
Speech response R 139 8.82 5.03 9 1 20 19 0.14 -0.98 0.43
Dependent features
OP M W L (NASA group) R 145 10.68 3.19 11 2 17 15 -0.41 -0.39 0.27
OP M W L (WP group) R 139 10.47 3.37 10 1 18 17 -0.38 -0.19 0.29
Table 3: Summary Table (ST) of the dataset features and targets (R=Range,
C=Categorical)
Data Preparation - The final datasets to be used for training purposes were
subsequently constructed. Two datasets were formed:
dataset NASA-TLX : this includes all the NASA-TLX features, in addition
to the binary preferences which emerged from the pairwise comparison of
the original instrument (Feature sets 1+2 of Table 3).
dataset WP: this includes all the eight features of WP (Feature set 3 of Table
3).
The dataset NASA-TLX had 41 missing values spotted in 11 records (all in the
pair-wise comparison part) so, due to the limited amount of available data, impu-
tation was performed. The K-Nearest Neighbours (KNN) algorithm was applied
to estimate missing values based on the concept of similarity. This algorithm
has demonstrated good performance without affecting the quality of data [2,17].
K represents the number of nearest instances to be considered while calculating
the missing instance.
Data Modelling - This stage is aimed at inducing models of mental workload
by learning from available data rather than making ad-hoc theory-driven models.
An assumption made is that the aggregation of those factors believed to model
mental workload is non-linear. Tackling the complex problem of MWL modelling,
and in the spirit of the No-Free-Lunch theorem [54] – stating that there is
not one best approach that always outperforms the other – different supervised
machine learning algorithms for non-linear regression were chosen. Each learning
strategy encodes a distinct set of assumptions, that means different inductive
biases. Additionally, a linear method based on probability was also selected for
comparison purposes:
Information-based: Random Forest by Randomization regression (Extra Trees:
Extremely Randomized Trees) [11];
Similarity-based: K-Nearest Neighbours regression [18];
Error-Based: Support Vector Regression (Radial basis function kernel) [3];
Probability-based: Bayesian Generalised Linear Model regression [10].
The datasets were randomly split into 5 partitions of equal size, non overlap-
ping. Four of these were used for training purposed (80% of the data) and the
held-out set for testing purposes (20%) of the data. The process was repeated
5 times, and at each time, the held-out set was different. The parameters em-
ployed in each regression technique have been automatically tuned through a
random search approach (number of randomly selected predictors and number
of random cuts for extra trees, the number of neighbours for KNN and sigma
and regularisation term for SVM) Additionally, 5-fold cross validation has been
used in each training phase and the Root Mean Square Error as metric (RMSE)
for fitting the overall perception of mental workload (OPMWL). Therefore, one
is expected to have 5 surrogate models, for each training phase. The best one,
that means the one with less RSME, was kept as the final induced model. Since,
the process was repeated 5 times, as per figure 1, one is expected to be left with
5 induced models for each regression technique.
Model Evaluation - In order to evaluate the final induced models from data,
the following error metrics are evaluated [19,6]:
Mean Squared Error (MSE) (eq. 3). It is the most common metric for the
evaluation of regression-based models. The higher the value the worse the
model. It is useful if observations contain unexpected value that are impor-
tant. In case of a single very bad prediction, the squaring will make the error
even worse, thus skewing the metric and overestimating the badness of the
regression model (range [0,)];
Root Mean Squared Error (RSME) (eq. 4). It is the square root of the MSE
and it has the ability to present the variance on the same scale of the target
variable. (range [0,); here [0,20]);
Mean Absolute Error MAE (eq. 5). It is a linear score and all the individual
differences between expected and predicted outcome are weighted equally in
the average. Contrarily to MSE, it is not that sensitive to outliers. (range
[0,)];
MSE =1
n
n
X
i=1
(yibyi)2(3)
RM SE =v
u
u
t1
n
n
X
i=1
(yibyi)2(4)
MAE =1
n
n
X
i=1
|(yibyi)|(5)
with yiis the actual expected value, byiis the model’s prediction
4 Results And Evaluation
4.1 Accuracy of the final induced models
Figure 4 depicts the box-plots containing the RMSE values for training. Accord-
ing to the previous design, each box plot contains 5 points, one for each final
induced model trained with 80% of the data. It can be observed that, in most of
the cases, the final induced models, trained with the NASA-TLX features (fea-
ture sets 1 + 2 of Table 3), have always lower RSME than those models built
upon the WP features (feature set 3 of table 3), even if this is not significant.
This denotes that the selected regression techniques can train a model similarly
and consistently. Also, since it is in the scale [0,20], it denotes the small error in
fitting the target feature (OP-MWL). In fact, errors on average, lies between 1
and 5, across the selected regression techniques, with mean around 3. It can be
also noted that the mean of the error of the Bayesian generalised linear models
is higher than the others, non-linear model, preliminary confirming the previ-
ous hypothesis of non-linearity of the independent features. This means that the
non-linear models can better learn the non-linear aggregation of the independent
features.
Fig. 4: The distributions of the RSME of the final induced models, grouped by
features sets (NASA Task Load Index, Workload Profile). Each bar contains 5
values, one for each model grouped by the regression technique.
4.2 Convergent validity of the induced models
The convergent validity of the induced models is assessed by calculating the
Spearman’s correlation between their inferred MWL scores, and the scores pro-
duced by the baseline models (NASA-TLX, WP) using the testing sets. Figure 5
shows these correlation coefficients in box-plots, each containing 5 values corre-
sponding to the 5 trained models tested with the 5 testing sets of 20% each. The
Spearman’s correlation statistic was used because the assumptions behind the
Pearson’s correlation statistics were not met. Generally, a moderate/high posi-
tive coefficients have been found (with p < 0.05) indicating that the inferences
of the induced models, built with machine learning, are valid since they correlate
with the baseline models. Also, these results are in line to the recommendation
of [5] whereby convergent validities above ρ= 0.70 are recommended, whereas
those below ρ= 0.50 should be avoided.
Fig. 5: Convergent validity of the final induced models.
4.3 Face Validity of Induced Models
Face Validity was computed to measure the extent to which the final induced
models can actually grasp the construct of Mental Workload. This was deter-
mined by computing the error of the final induced models, and the selected
baselines, in predicting the overall perception of mental workload (OP-MWL)
with the testing data, that means they are evaluated with unseen data. Figures
7, 8 and 9 show the scatterplots of this comparison while figure 6 depicts the
MSE, RMSE and MAE values. As in the previous case, each box-plot contains 5
values corresponding to the 5 error obtained with the testing sets of 20% each)
(a) Mean Square Errors
(b) Room Mean Square Errors
(c) Mean Absolute Errors
Fig. 6: The distributions of the errors of the final induced models and baseline
models, grouped by features set used (NASA-TLX or Workload Profile). Each
bar contains 5 points, one for each model grouped by the regression technique.
Firstly, the situation is consistent with the training error: slightly higher for the
induced models trained with the WP features. However, the error boundaries
for the testing sets are narrower than those achieved during training. In fact,
the RSME values, regardless of the regression techniques employed, have mean
around 3, with shorter box-plots, suggesting a good degree of generalisability of
the induced models. Also it can be seen that the mean of the errors produced by
the baseline models is always higher than those produced by the induced models.
In other words, the baseline models generate indexes of mental workload that
are always more distant to the overall perception of mental workload, reported
by subjects, when compared to the distance of the inferences produced by the
machine learning models.
4.4 Discussion
Findings are promising and show how subjective mental workload can be mod-
elled with a higher degree of accuracy using data-driven techniques, when com-
pared to traditional subjective techniques, namely the NASA Task Load Index
and the Workload Profile, used as baselines. In detail, an analysis of the con-
vergent validity of the data-driven models, learnt from data by employing su-
pervised machine learning regression techniques, against the selected baseline
models, show how these are theoretically related. In other words, if we believe
that the baseline models actually measure mental workload, so we can do the
same with the data-drive models. With this confidence, a subsequent analysis of
their face validity showed how data-driven models can approximate the percep-
tion of overall mental workload, as reported by subjects, with a higher degree of
precision (less error) when compared to the selected baselines. This means that
data-driven models covering the concept it purports to measure, that means
Mental Workload, with a higher precision. Findings are indeed restricted to the
dataset under consideration, but they motivate further research in this space.
5 Conclusion
This work presents an assessment of the ability of machine learning techniques
to model mental workload. The motivation behind this work was to shift from
state-of-the-art MWL modelling techniques – mainly theory-driven – to auto-
mated learning techniques able to induce MWL models from data. Specifically,
a number of learning regression techniques have been selected to induce models
of mental workload employing features gathered from users subjectively. These
features included the answers to the questionnaires of the NASA Task Load
Index and the Workload Profile, two baseline mental workload self-reporting
measures chosen for comparative purposes. The induced models were compared
against the two selected baselines through an assessment of their convergent and
face validity. Convergent validity was aimed at determining whether the induced
models were theoretically related to the selected baselines, known to model the
construct of mental workload. Face validity was aimed at determining whether
the induced models could actually cover the concept it purports to measure, that
means Mental Workload. The former validity was assessed through a correlation
analysis of the mental workload scores produced by the induced models and the
selected baselines. The latter validity was assessed by investigating the error of
the machine learning models and the baselines to predict an overall perception
of mental workload subjectively reported by subjects, after the completion of
experimental tasks in third level education.
The findings of this experiment confirm that supervised machine learning
algorithms are potential alternatives to traditional theory-driven techniques for
modeling mental workload. Machine learning poses itself as a seed for an effi-
cient mechanism that facilitates the understanding of the construct of mental
workload, the relationship of its factors and their impact to task performance. A
viable direction for future work would be to extend the current experiment with
an in depth evaluation of the importance of each feature for predicting the overall
perception of mental workload. Subsequently, simpler mental workload models
could be created containing the most important features. This can increase the
understanding of the complex but fascinating construct of mental workload and
contribute towards the ultimate goal of building a highly generalisable model
that can be employed across fields, disciplines and experimental contexts.
References
1. Aghajani, H., Garbey, M., Omurtag, A.: Measuring mental workload with eeg+
fnirs. Frontiers in human neuroscience 11, 359 (2017)
2. Batista, G.E., Monard, M.C.: A study of k-nearest neighbour as an imputation
method. HIS 87, 251–260,48 (2002)
3. Bennett, K.P., Campbell, C.: Support vector machines. ACM SIGKDD Explo-
rations Newsletter 2(2), 1–13 (2000),
4. Cain, B.: A review of the mental workload literature. Tech. rep., Defence Research
and Development Canada Toronto Human System Integration Section; 2007. Re-
port No. : RTO-TRHFM-121-Part-II. Contract No (2004)
5. Carlson, K.D., Herdman, A.O.: Understanding the impact of convergent validity
on research results. Organizational Research Methods 15(1), 17–32 (2012)
6. Chai, T., Draxler, R.R.: Root mean square error (rmse) or mean absolute error
(mae)?–arguments against avoiding rmse in the literature. Geoscientific model de-
velopment 7(3), 1247–1250 (2014)
7. Chapman, P., Clinton, J., Khabaza, T., Reinartz, T., Wirth, R.: The crisp-dm
process model. The CRIP–DM Consortium 310 (1999)
8. Cortes Torres, C.C., Sampei, K., Sato, M., Raskar, R., Miki, N.: Workload as-
sessment with eye movement monitoring aided by non-invasive and unobtrusive
micro-fabricated optical sensors. In: Adjunct Proceedings of the 28th Annual ACM
Symposium on User Interface Software & Technology. pp. 53–54. ACM (2015)
9. Fan, J., Smith, A.P.: The impact of workload and fatigue on performance. In:
International Symposium on Human Mental Workload: Models and Applications.
pp. 90–105. Springer (2017)
10. Gelman, A., Jakulin, A., Pittau, M.G., Su, Y.S.: A weakly informative default prior
distribution for logistic and other regression models. Annals of Applied Statistics
2(4), 1360–1383 (2008)
11. Geurts, P., Ernst, D., Wehenkel, L.: Extremely randomized trees. Machine Learning
63(1), 3–42 (2006)
12. Hancock, P.A.: Whither workload? mapping a path for its future development. In:
International Symposium on Human Mental Workload: Models and Applications.
pp. 3–17. Springer (2017)
13. Hancock, P.A., Meshkati, N.: Human mental workload. Elsevier (1988)
14. Hart, S.G., Staveland, L.E.: Development of NASA-TLX (Task Load Index): Re-
sults of Empirical and Theoretical Research. Advances in Psychology 52(C), 139–
183 (1988)
15. Hart, Sandra, G.: NASA-task load index (NASA-TLX); 20 years later. Human
Factors and Ergonomics Society Annual Meting pp. 904–908 (2006)
16. Hincks, S.W., Afergan, D., Jacob, R.J.: Using fnirs for real-time cognitive workload
assessment. In: International Conference on Augmented Cognition. pp. 198–208.
Springer (2016)
17. Jonsson, P., Wohlin, C.: An evaluation of k-nearest neighbour imputation using
likert data. In: 10th International Symposium on Software Metrics, 2004. Proceed-
ings. pp. 108–118 (Sept 2004)
18. Kotsiantis, S.B.: Supervised Machine Learning: A Review of Classification Tech-
niques. Informatica 31(2), 249–268 (2007),
19. Kv˚alseth, T.O.: Cautionary note about r 2. The American Statistician 39(4), 279–
285 (1985)
20. Liu, Y., Ayaz, H., Shewokis, P.A.: Multisubject “learning” for mental workload
classification using concurrent eeg, fnirs, and physiological measures. Frontiers in
human neuroscience 11, 389 (2017)
21. Longo, L.: Formalising Human Mental Workload as a Defeasible Computational
Concept. Ph.D. thesis, Trinity College Dublin (2014)
22. Longo, L.: A defeasible reasoning framework for human mental workload repre-
sentation and assessment. Behaviour and Information Technology 34(8), 758–786
(2015)
23. Longo, L.: Designing medical interactive systems via assessment of human mental
workload. In: Int. Symposium on Computer-Based Medical Systems. pp. 364–365
(2015)
24. Longo, L.: Mental workload in medicine: foundations, applications, open problems,
challenges and future perspectives. In: Computer-Based Medical Systems (CBMS),
2016 IEEE 29th International Symposium on. pp. 106–111. IEEE (2016)
25. Longo, L.: Subjective usability, mental workload assessments and their impact on
objective human performance. In: IFIP Conference on Human-Computer Interac-
tion. pp. 202–223. Springer (2017)
26. Longo, L.: Experienced mental workload, perception of usability, their interaction
and impact on task performance. PloS ONE 13(8), 1–36 (08 2018),
27. Longo, L.: On the reliability, validity and sensitivity of three mental workload
assessment techniques for the evaluation of instructional designs: A case study
in a third-level course. In: Proceedings of the 10th International Conference on
Computer Supported Education, CSEDU 2018, Funchal, Madeira, Portugal, March
15-17, 2018, Volume 2. pp. 166–178 (2018),
28. Longo, L., Barrett, S.: Cognitive effort for multi-agent systems. In: International
Conference on Brain Informatics. pp. 55–66. Springer (2010)
29. Longo, L., Dondio, P.: On the relationship between perception of usability and
subjective mental workload of web interfaces. In: Web Intelligence and Intelligent
Agent Technology (WI-IAT), 2015 IEEE/WIC/ACM International Conference on.
vol. 1, pp. 345–352. IEEE (2015)
30. Longo, L., Leva, M.C.: Human Mental Workload: Models and Applications: First
International Symposium, H-WORKLOAD 2017, Dublin, Ireland, June 28-30,
2017, Revised Selected Papers, vol. 726. Springer (2017)
31. Longo, L., Rusconi, F., Noce, L., Barrett, S.: The importance of human mental
workload in web-design. In: 8th International Conference on Web Information Sys-
tems and Technologies. pp. 403–409 (April 2012)
32. Mannaru, P., Balasingam, B., Pattipati, K., Sibley, C., Coyne, J.: Cognitive context
detection in uas operators using eye-gaze patterns on computer screens. In: Next-
Generation Analyst IV. vol. 9851, p. 98510F. International Society for Optics and
Photonics (2016)
33. Mayer, R.E.: Cognitive Theory of Multimedia Learning, p. 4371. Cambridge Hand-
books in Psychology, Cambridge University Press, 2 edn. (2014)
34. Meshkati, N., Loewenthal, A.: An Eclectic and Critical Review of Four Primary
Mental Workload Assessment Methods: A Guide for Developing a Comprehensive
Model. Advances in Psychology 52(1978), 251 – 267 (1988),
35. Mijovi´c, P., Milovanovi´c, M., Kovi´c, V., Gligorijevi´c, I., Mijovi´c, B., Maˇcuˇzi´c, I.:
Neuroergonomics method for measuring the influence of mental workload modula-
tion on cognitive state of manual assembly worker. In: International Symposium on
Human Mental Workload: Models and Applications. pp. 213–224. Springer (2017)
36. Mohammadi, M., Mazloumi, A., Kazemi, Z., Zeraati, H.: Evaluation of Mental
Workload among ICU Ward’s Nurses. Health promotion perspectives 5(4), 280–7
(2015),
37. Monfort, S.S., Sibley, C.M., Coyne, J.T.: Using machine learning and real-time
workload assessment in a high-fidelity uav simulation environment. In: Next-
Generation Analyst IV. vol. 9851, p. 98510B. International Society for Optics and
Photonics (2016)
38. Moustafa, K., Luz, S., Longo, L.: Assessment of mental workload: A comparison
of machine learning methods and subjective assessment techniques. In: Longo, L.,
Leva, M.C. (eds.) Human Mental Workload: Models and Applications. pp. 30–50.
Springer International Publishing, Cham (2017)
39. Nevo, B.: Face validity revisited. Journal of Educational Measurement 22(4), 287–
293 (1985)
40. Ott, T., Wu, P., Paullada, A., Mayer, D., Gottlieb, J., Wall, P.: ATHENA A zero-
intrusion no contact method for workload detection using linguistics, keyboard
dynamics, and computer vision. In: Communications in Computer and Information
Science. vol. 617, pp. 226–231 (2016),
41. Pham, T.T., Nguyen, T.D., Van Vo, T.: Sparse fnirs feature estimation via unsuper-
vised learning for mental workload classification. In: Advances in Neural Networks,
pp. 283–292. Springer (2016)
42. Reid, G.B., Nygren, T.E.: The subjective workload assessment technique: A scaling
procedure for measuring mental workload. In: Advances in psychology, vol. 52, pp.
185–218. Elsevier (1988)
43. Rizzo, L., Dondio, P., Delany, S.J., Longo, L.: Modeling mental workload via rule-
based expert system: A comparison with NASA-TLX and workload profile. IFIP
Advances in Information and Communication Technology 475, 215–229 (2016),
44. Rizzo, L., Longo, L.: Representing and inferring mental workload via defeasible
reasoning: A comparison with the NASA task load index and the workload profile.
In: Proceedings of the 1st Workshop on Advances In Argumentation In Artificial
Intelligence co-located with XVI International Conference of the Italian Associa-
tion for Artificial Intelligence (AI*IA 2017), Bari, Italy, November 16-17, 2017. pp.
126–140 (2017)
45. Rizzo, L., Longo, L.: Inferential models of mental workload with defeasible argu-
mentation and non-monotonic fuzzy reasoning: a comparative study. In: Proceed-
ings of the 2nd Workshop on Advances In Argumentation In Artificial Intelligence
co-located with XVII International Conference of the Italian Association for Ar-
tificial Intelligence (AI*IA 2018), Trento, Italy, November 20-23, 2018. pp. 11–26
(2018)
46. Rubio, S., Daz, E., Martn, J., Puente, J.M.: Evaluation of subjective mental work-
load: A comparison of swat, nasa-tlx, and workload profile methods. Applied Psy-
chology 53(1), 61–86 (2004),
47. Smith, A.P., Smith, H.N.: Workload, fatigue and performance in the rail industry.
In: International Symposium on Human Mental Workload: Models and Applica-
tions. pp. 251–263. Springer (2017)
48. Smith, K.T.: Observations and issues in the application of cognitive workload mod-
elling for decision making in complex time-critical environments. In: International
Symposium on Human Mental Workload: Models and Applications. pp. 77–89.
Springer (2017)
49. Su, J., Luz, S.: Predicting cognitive load levels from speech data. Smart Innovation,
Systems and Technologies 48, 255–263 (2016)
50. Tsang, P.S., Velazquez, V.L.: Diagnosticity and multidimensional subjective work-
load ratings. Ergonomics 39(3), 358–381 (1996)
51. Walter, C., Cierniak, G., Gerjets, P., Rosenstiel, W., Bogdan, M.: Classifying men-
tal states with machine learning algorithms using alpha activity decline. In: ESANN
2011, 19th European Symposium on Artificial Neural Networks, Bruges, Belgium,
April 27-29, 2011, Proceedings (2011),
52. Wickens, C.D.: Multiple resources and mental workload. Human factors 50(3),
449–455 (2008)
53. Wickens, C.D.: Mental workload: assessment, prediction and consequences. In: In-
ternational Symposium on Human Mental Workload: Models and Applications.
pp. 18–29. Springer (2017)
54. Wolpert, D.H.: The supervised learning no-free-lunch theorems. In: Soft computing
and industry, pp. 25–42. Springer (2002)
55. Yoshida, Y., Ohwada, H., Mizoguchi, F., Iwasaki, H.: Classifying Cognitive Load
and Driving Situation with Machine Learning. International Journal of Machine
Learning and Computing 4(3), 210–215 (2014)
56. Young, M.S., Brookhuis, K.A., Wickens, C.D., Hancock, P.A.: State of science:
mental workload in ergonomics. Ergonomics 58(1), 1–17 (2015)
Appendix
Fig. 7: Scatterplots of the overall perception of mental workload reported by
subjects (OP-MWL) (x-axis) and the prediction of the induced models (y-axis)
for the NASA-TLX (Left) and the Workload Profile (Right) grouped by fold
Fig. 8: Scatterplots of the overall perception of mental workload (x-axis), as
reported by subjects and the prediction of the induced models (y-axis) for the
5 models produced by the regression algorithms (Extra trees: col 1; KNN: col
2; SVR: col 3; NB: col 4) employing the features of the NASA Task Load Index
Fig. 9: Scatterplots of the overall perception of mental workload (x-axis), as
reported by subjects and the prediction of the induced models (y-axis) for the
5 models produced by the regression algorithms (Extra trees: col 1; KNN: col
2; SVR: col 3; NB: col 4) employing the features of the Workload Profile
... Self-report measures, often referred to as subjective measures, involve a participant or "subject" who usually provides qualitative and/or quantitative reports concerning his/her personal experience while performing either a primary, or secondary task or both (Moray, 1982;Vidulich, 1988;Nygren, 1991;DiDomenico and Nussbaum, 2008;Moustafa and Longo, 2018). In many self-report measures, a user is asked to answer a pre and/or a post-task questionnaire. ...
Article
Full-text available
Human mental workload is arguably the most invoked multidimensional construct in Human Factors and Ergonomics, getting momentum also in Neuroscience and Neuroergonomics. Uncertainties exist in its characterization, motivating the design and development of computational models, thus recently and actively receiving support from the discipline of Computer Science. However, its role in human performance prediction is assured. This work is aimed at providing a synthesis of the current state of the art in human mental workload assessment through considerations, definitions, measurement techniques as well as applications, Findings suggest that, despite an increasing number of associated research works, a single, reliable and generally applicable framework for mental workload research does not yet appear fully established. One reason for this gap is the existence of a wide swath of operational definitions, built upon different theoretical assumptions which are rarely examined collectively. A second reason is that the three main classes of measures, which are self-report, task performance, and physiological indices, have been used in isolation or in pairs, but more rarely in conjunction all together. Multiple definitions complement each another and we propose a novel inclusive definition of mental workload to support the next generation of empirical-based research. Similarly, by comprehensively employing physiological, task-performance, and self-report measures, more robust assessments of mental workload can be achieved.
... Mental workload has been regarded as an essential factor that substantially influences task performance (Young et al., 2015;Galy, 2018;Longo, 2018a). As a construct, it has been widely applied in the design and evaluation of complex human-machine systems and environments such as in aircraft operation (Hu and Lodewijks, 2020;Yu et al., 2021), train and vehicle operation (Li et al., 2020;Wang et al., 2021), nuclear power plants (Gan et al., 2020;Wu et al., 2020), various human-computer and braincomputer interfaces (Longo, 2012;Asgher et al., 2020;Putze et al., 2020;Bagheri and Power, 2021) and in educational contexts (Moustafa and Longo, 2019;Orru and Longo, 2019;Longo and Orr, 2020;Longo and Rajendran, 2021), to name a few. Mental workload research has accumulated momentum over the last two decades, given the fact that numerous technologies have emerged that engage users in multiple cognitive levels and requirements for different task activities operating in diverse environmental conditions. ...
Article
Full-text available
Many research works indicate that EEG bands, specifically the alpha and theta bands, have been potentially helpful cognitive load indicators. However, minimal research exists to validate this claim. This study aims to assess and analyze the impact of the alpha-to-theta and the theta-to-alpha band ratios on supporting the creation of models capable of discriminating self-reported perceptions of mental workload. A dataset of raw EEG data was utilized in which 48 subjects performed a resting activity and an induced task demanding exercise in the form of a multitasking SIMKAP test. Band ratios were devised from frontal and parietal electrode clusters. Building and model testing was done with high-level independent features from the frequency and temporal domains extracted from the computed ratios over time. Target features for model training were extracted from the subjective ratings collected after resting and task demand activities. Models were built by employing Logistic Regression, Support Vector Machines and Decision Trees and were evaluated with performance measures including accuracy, recall, precision and f1-score. The results indicate high classification accuracy of those models trained with the high-level features extracted from the alpha-to-theta ratios and theta-to-alpha ratios. Preliminary results also show that models trained with logistic regression and support vector machines can accurately classify self-reported perceptions of mental workload. This research contributes to the body of knowledge by demonstrating the richness of the information in the temporal, spectral and statistical domains extracted from the alpha-to-theta and theta-to-alpha EEG band ratios for the discrimination of self-reported perceptions of mental workload.
... Subjective measures, also referred to as self-reported measures, rely on the individual's perceived experience of the interaction with a learning task. They are based on the assumption that only the individual involved in the task can provide an accurate and precise judgement about the experienced load, as employed in a number of studies [13,30]. The perception of the individual can be gathered through means of a survey or questionnaire. ...
Chapter
Instructional efficiency within education is a measurable concept and models have been proposed to assess it. The main assumption behind these models is that efficiency is the capacity to achieve established goals at the minimal expense of resources. This article challenges this assumption by contributing to the body of Knowledge with a novel model that is grounded on ideal mental workload and performance, namely the parabolic model of instructional efficiency. A comparative empirical investigation has been constructed to demonstrate the potential of this model for instructional design evaluation. Evidence demonstrated that this model achieved a good concurrent validity with the well-known likelihood model of instructional efficiency, treated as baseline, but a better discriminant validity for the evaluation of the training and learning phases. Additionally, the inferences produced by this novel model have led to a superior information gain when compared to the baseline.
... Subjective measures, also referred to as self-reported measures, rely on the individual's perceived experience of the interaction with a learning task. They are based on the assumption that only the individual involved in the task can provide an accurate and precise judgement about the experienced load, as employed in a number of studies [13,30]. The perception of the individual can be gathered through means of a survey or questionnaire. ...
Preprint
Full-text available
Instructional efficiency within education is a measurable concept and models have been proposed to assess it. The main assumption behind these models is that efficiency is the capacity to achieve established goals at the minimal expense of resources. This article challenges this assumption by contributing to the body of Knowledge with a novel model that is grounded on ideal mental workload and performance, namely the parabolic model of instructional efficiency. A comparative empirical investigation has been constructed to demonstrate the potential of this model for instructional design evaluation. Evidence demonstrated that this model achieved a good concurrent validity with the well-known likelihood model of instructional efficiency, treated as baseline, but a better discriminant validity for the evaluation of the training and learning phases. Additionally, the inferences produced by this novel model have led to a superior information gain when compared to the baseline.
... Modeling the construct of Mental Workload is not easy. Many approaches to aggregate factors believed to influence mental workload have been proposed, both deductive as in Rizzo et al. (2016), Rizzo and Longo (2017) and Rizzo and Longo (2018), and inductive as in Moustafa, Luz, and Longo (2017) and Moustafa and Longo (2019). Regardless of the aggregation strategies three main classes of measures, as described in the following sections, exist: self-reporting, task-performance measures and physiological measures. ...
Article
Full-text available
Cognitive cognitive load theory (CLT) has been conceived for improving instructional design practices. Although researched for many years, one open problem is a clear definition of its cognitive load types and their aggregation towards an index of overall cognitive load. In Ergonomics, the situation is different with plenty of research devoted to the development of robust constructs of mental workload (MWL). By drawing a parallel between CLT and MWL, as well as by integrating relevant theories and measurement techniques from these two fields, this paper is aimed at investigating the reliability, validity and sensitivity of three existing self-reporting mental workload measures when applied to long learning sessions, namely, the NASA Task Load index, the Workload Profile and the Rating Scale Mental Effort, in a typical university classroom. These measures were aimed at serving for the evaluation of two instructional conditions. Evidence suggests these selected measures are reliable and their moderate validity is in line with results obtained within Ergonomics. Additionally, an analysis of their sensitivity by employing the descriptive Harrell-Davis estimator suggests that the Workload Profile is more sensitive than the Nasa Task Load Index and the Rating Scale Mental Effort for long learning sessions.
... Despite 50 years of research, unfortunately it is still an ill-defined construct, with many definitions, discipline-specific and with many application-dependent models [40]. Similarly, many formalisms exist to represent and assess mental workload [22,39,88,89]. Along with the various ad hoc models of mental workload, the many definitions and context-specific models, a number of criteria for evaluating these have been proposed [90], including reliability, validity, sensitivity and diagnosticity. ...
Article
Full-text available
Knowledge-representation and reasoning methods have been extensively researched within Artificial Intelligence. Among these, argumentation has emerged as an ideal paradigm for inference under uncertainty with conflicting knowledge. Its value has been predominantly demonstrated via analyses of the topological structure of graphs of arguments and its formal properties. However, limited research exists on the examination and comparison of its inferential capacity in real-world modelling tasks and against other knowledge-representation and non-monotonic reasoning methods. This study is focused on a novel comparison between defeasible argumentation and non-monotonic fuzzy reasoning when applied to the representation of the ill-defined construct of human mental workload and its assessment. Different argument-based and non-monotonic fuzzy reasoning models have been designed considering knowledge-bases of incremental complexity containing uncertain and conflicting information provided by a human reasoner. Findings showed how their inferences have a moderate convergent and face validity when compared respectively to those of an existing baseline instrument for mental workload assessment, and to a perception of mental workload self-reported by human participants. This confirmed how these models also reasonably represent the construct under consideration. Furthermore, argument-based models had on average a lower mean squared error against the self-reported perception of mental workload when compared to fuzzy-reasoning models and the baseline instrument. The contribution of this research is to provide scholars, interested in formalisms on knowledge-representation and non-monotonic reasoning, with a novel approach for empirically comparing their inferential capacity.
... Multiple models that predict Mental workload exist for numerous domains. According to research by Moustafa and Longo (2018), the current mental workload models are very complex. These models ignore the in-depth evaluation of each feature leading to intricate models. ...
Thesis
Full-text available
Human Mental Workload is an intervening variable and a fundamental concept in the discipline of Ergonomics. It is deduced from variations in performance. High or low mental workload leads to hampering of performance. Mental workload in an educa- tional setting has been extensively researched. It is applied in instructional design but it is obscure as to which factors are majorly driving mental workload in learners. This dissertation investigates the importance of the features used in the the NASA-Task Load Index mental workload assessment instrument and their impact on the perfor- mance of learners as assessed by multiple-choice tests conducted in classrooms of an MSc programme in a university. Model training is performed on these attributes using machine learning approaches including decision tree regression and linear regression. Montecarlo sampling was used in the training phase to ensure model stability. The identification of the importance of selected features is carried on using the permutation feature technique since it is adaptable and applicable across a variety of supervised learning methods. Empirical evidence emphasises the absence of more important fea- tures over the others tentatively suggesting their applicability in a multi-dimensional model.
... According to Cain (2007), the main reason for measuring MWL is to quantify the mental cost of performing a certain task, in order to predict user and system performance. It has been recently employed in the fields of education (Orru and Longo, 2019b,a;Orru et al., 2018;Moustafa and Longo, 2019) and health-care (Longo, 2015b;Byrne, 2017). Mainly, it is used in the areas of psychology and ergonomics, with applications in aviation and automobile industries (Paxion et al., 2014) and in interface and web design (Tracy et al., 2006;Longo, 2011;Junior et al., 2019). ...
Thesis
Full-text available
Limited work exists for the comparison across distinct knowledge-based approaches in Artificial Intelligence (AI) for non-monotonic reasoning, and in particular for the examination of their inferential and explanatory capacity. Non-monotonicity, or defeasibility, allows the retraction of a conclusion in the light of new information. It is a similar pattern to human reasoning, which draws conclusions in the absence of information, but allows them to be corrected once new pieces of evidence arise. Thus, this thesis focuses on a comparison of three approaches in AI for implementation of non-monotonic reasoning models of inference, namely: expert systems, fuzzy reasoning and defeasible argumentation. Three applications from the fields of decision-making in healthcare and knowledge representation and reasoning were selected from real-world contexts for evaluation: human mental workload modelling, computational trust modelling, and mortality occurrence modelling with biomarkers. The link between these applications comes from their presumptively non-monotonic nature. They present incomplete, ambiguous and retractable pieces of evidence. Hence, reasoning applied to them is likely suitable for being modelled by non-monotonic reasoning systems. An experiment was performed by exploiting six deductive knowledge bases produced with the aid of domain experts. These were coded into models built upon the selected reasoning approaches and were subsequently elicited with real-world data. The numerical inferences produced by these models were analysed according to common metrics of evaluation for each field of application. For the examination of explanatory capacity, properties such as understandability, extensibility, and post-hoc interpretability were meticulously described and qualitatively compared. Findings suggest that the variance of the inferences produced by expert systems and fuzzy reasoning models was higher, highlighting poor stability. In contrast, the variance of argument-based models was lower, showing a superior stability of its inferences across different system configurations. In addition, when compared in a context with large amounts of conflicting information, defeasible argumentation exhibited a stronger potential for conflict resolution, while presenting robust inferences. An in-depth discussion of the explanatory capacity showed how defeasible argumentation can lead to the construction of non-monotonic models with appealing properties of explainability, compared to those built with expert systems and fuzzy reasoning. The originality of this research lies in the quantification of the impact of defeasible argumentation. It illustrates the construction of an extensive number of non-monotonic reasoning models through a modular design. In addition, it exemplifies how these models can be exploited for performing non-monotonic reasoning and producing quantitative inferences in real-world applications. It contributes to the field of non-monotonic reasoning by situating defeasible argumentation among similar approaches through a novel empirical comparison.
Preprint
Full-text available
Many research works indicate that EEG bands, specifically the alpha and theta bands, have been potentially helpful cognitive load indicators. However, minimal research exists to validate this claim. This study aims to assess and analyze the impact of the alpha-to-theta and the theta-to-alpha band ratios on supporting the creation of models capable of discriminating self-reported perceptions of mental workload. A dataset of raw EEG data was utilized in which 48 subjects performed a resting activity and an induced task demanding exercise in the form of a multitasking SIMKAP test. Band ratios were devised from frontal and parietal electrode clusters. Building and model testing was done with high-level independent features from the frequency and temporal domains extracted from the computed ratios over time. Target features for model training were extracted from the subjective ratings collected after resting and task demand activities. Models were built by employing Logistic Regression, Support Vector Machines and Decision Trees and were evaluated with performance measures including accuracy, recall, precision and f1-score. The results indicate high classification accuracy of those models trained with the high-level features extracted from the alpha-to-theta ratios and theta-to-alpha ratios. Preliminary results also show that models trained with logistic regression and support vector machines can accurately classify self-reported perceptions of mental workload. This research contributes to the body of knowledge by demonstrating the richness of the information in the temporal, spectral and statistical domains extracted from the alpha-to-theta and theta-to-alpha EEG band ratios for the discrimination of self-reported perceptions of mental workload.
Book
This book constitutes the refereed proceedings of the 5th International Symposium on Human Mental Workload: Models and Applications, H-WORKLOAD 2021, held virtually in November 2021. The volume presents 9 revised full papers, which were carefully reviewed and selected from 16 submissions. The papers are organized in two topical sections on models and applications.
Conference Paper
Full-text available
Inferences through knowledge driven approaches have been researched extensively in the field of Artificial Intelligence. Among such approaches argumentation theory has recently shown appealing properties for inference under uncertainty and conflicting evidence. Nonetheless , there is a lack of studies which examine its inferential capacity over other quantitative theories of reasoning under uncertainty with real-world knowledge-bases. This study is focused on a comparison between argumentation theory and non-monotonic fuzzy reasoning when applied to modeling the construct of human mental workload (MWL). Different argument-based and non-monotonic fuzzy reasoning models, aimed at inferring the MWL imposed by a selection of learning tasks, in a third-level context, have been designed. These models are built upon knowledge-bases that contain uncertain and conflicting evidence provided by human experts. An analysis of the convergent and face validity of such models has been performed. Results suggest a superior inferential capacity of argument-based models over fuzzy reasoning-based models.
Article
Full-text available
Past research in HCI has generated a number of procedures for assessing the usability of interacting systems. In these procedures there is a tendency to omit characteristics of the users, aspects of the context and peculiarities of the tasks. Building a cohesive model that incorporates these features is not obvious. A construct greatly invoked in Human Factors is human Mental Workload. Its assessment is fundamental for predicting human performance. Despite the several uses of Usability and Mental Workload, not much has been done to explore their relationship. This empirical research focused on I) the investigation of such a relationship and II) the investigation of the impact of the two constructs on human performance. A user study was carried out with participants executing a set of information-seeking tasks over three popular web-sites. A deep correlation analysis of usability and mental workload, by task, by user and by classes of objective task performance was done (I). A number of Supervised Machine Learning techniques based upon different learning strategy were employed for building models aimed at predicting classes of task performance (II). Findings strongly suggests that usability and mental workload are two non overlapping constructs and they can be jointly employed to greatly improve the prediction of human performance.
Conference Paper
Full-text available
The NASA Task Load Index (NASA − TLX) and the Workload Profile (WP) are likely the most employed instruments for subjective mental workload (MWL) measurement. Numerous areas have made use of these methods for assessing human performance and thusly improving the design of systems and tasks. Unfortunately, MWL is still a vague concept, with different definitions and no universal measure. This research investigates the use of defeasible reasoning to represent and assess MWL. Reasoning is defeasible when a conclusion, supported by a set of premises, can be retracted in the light of new information. In this empirical study, this type of reasoning is considered for modelling MWL, given the intrinsic uncertainty involved in assessing it. In particular, it is shown how the NASA − TLX and the WP can be translated into defeasible structures whose inferences can achieve similar validity of the original instruments, even when less information is available. It is also discussed how these structures can have a higher extensibility and how their inferences are more self-explanatory than the ones produced by the NASA − TLX and WP.
Conference Paper
Full-text available
Self-reporting procedures and inspection methods have been largely employed in the fields of interaction and web-design for assessing the usability of interfaces. However, there seems to be a propensity to ignore features related to end-users or the context of application during the usability assessment procedure. This research proposes the adoption of the construct of mental workload as an additional aid to inform interaction and web-design. A user-study has been performed in the context of human-web interaction. The main objective was to explore the relationship between the perception of usability of the interfaces of three popular web-sites and the mental workload imposed on end-users by a set of typical tasks executed over them. Usability scores computed employing the System Usability Scale were compared and related to the mental workload scores obtained employing the NASA Task Load Index and the Workload Profile self-reporting assessment procedures. Findings advise that perception of usability and subjective assessment of mental workload are two independent, not fully overlapping constructs. They measure two different aspects of the human-system interaction. This distinction enabled the demonstration of how these two constructs cab be jointly employed to better explain objective performance of end-users, a dimension of user experience, and informing interaction and web-design.
Article
Full-text available
An accurate measure of mental workload level has diverse neuroergonomic applications ranging from brain computer interfacing to improving the efficiency of human operators. In this study, we integrated electroencephalogram (EEG), functional near-infrared spectroscopy (fNIRS), and physiological measures for the classification of three workload levels in an n-back working memory task. A significantly better than chance level classification was achieved by EEG-alone, fNIRS-alone, physiological alone, and EEG+fNIRS based approaches. The results confirmed our previous finding that integrating EEG and fNIRS significantly improved workload classification compared to using EEG-alone or fNIRS-alone. The inclusion of physiological measures, however, does not significantly improves EEG-based or fNIRS-based workload classification. A major limitation of currently available mental workload assessment approaches is the requirement to record lengthy calibration data from the target subject to train workload classifiers. We show that by learning from the data of other subjects, workload classification accuracy can be improved especially when the amount of data from the target subject is small.
Article
Full-text available
We studied the capability of a Hybrid functional neuroimaging technique to quantify human mental workload (MWL). We have used electroencephalography (EEG) and functional near-infrared spectroscopy (fNIRS) as imaging modalities with 17 healthy subjects performing the letter n-back task, a standard experimental paradigm related to working memory (WM). The level of MWL was parametrically changed by variation of n from 0 to 3. Nineteen EEG channels were covering the whole-head and 19 fNIRS channels were located on the forehead to cover the most dominant brain region involved in WM. Grand block averaging of recorded signals revealed specific behaviors of oxygenated-hemoglobin level during changes in the level of MWL. A machine learning approach has been utilized for detection of the level of MWL. We extracted different features from EEG, fNIRS, and EEG+fNIRS signals as the biomarkers of MWL and fed them to a linear support vector machine (SVM) as train and test sets. These features were selected based on their sensitivity to the changes in the level of MWL according to the literature. We introduced a new category of features within fNIRS and EEG+fNIRS systems. In addition, the performance level of each feature category was systematically assessed. We also assessed the effect of number of features and window size in classification performance. SVM classifier used in order to discriminate between different combinations of cognitive states from binary- and multi-class states. In addition to the cross-validated performance level of the classifier other metrics such as sensitivity, specificity, and predictive values were calculated for a comprehensive assessment of the classification system. The Hybrid (EEG+fNIRS) system had an accuracy that was significantly higher than that of either EEG or fNIRS. Our results suggest that EEG+fNIRS features combined with a classifier are capable of robustly discriminating among various levels of MWL. Results suggest that EEG+fNIRS should be preferred to only EEG or fNIRS, in developing passive BCIs and other applications which need to monitor users' MWL.
Book
This book constitutes the refereed proceedings of the First International Symposium on Human Mental Workload: Models and Applications, H-WORKLOAD 2017, held in Dublin, Ireland, in June 2017. The 15 revised full papers presented together with two keynotes were carefully reviewed and selected from 35 submissions. The papers are organized in two topical sections on models and applications.
Conference Paper
I present a number of looming barriers to a smooth path of progress for cognitive workload assessment. The first of these is the AID’s of workload (i.e., association, indifference, and dissociation) between its various reflections (i.e., subjective, physiological, and performance measures). The second is the manner in which the time-varying change in imposed task demand links to the workload response, and what specific characteristics of the former drive the latter. The third is the persistent but largely unaddressed issue of the meaningfulness of the work undertaken. Thus, does interesting and involving work result in lower workload and vice-versa? If these foregoing and predominantly methodological concerns can be overcome, then the utility of the workload construct can continue to grow. If they cannot be resolved then workload assessment threatens to be ineffective in a world which desperately requires a valid and reliable way to index cognitive achievement.
Conference Paper
I describe below the manner in which workload measurement can be used to validate models that predict workload. These models in turn can be employed to predict the decisions that are made, which select a course of action that is of lower effort or workload, but may also be of lower expected value (or higher expected cost). We then elaborate on four different contexts in which these decisions are made, with non-trivial consequences to performance and learning: switching attention, accessing information, studying material, and behaving safety. Each of these four is illustrated by a series of examples.