Conference PaperPDF Available

Affective Human-Robot Interaction with Multimodal Explanations

Authors:

Abstract and Figures

Facial expressions are one of the most practical and straightforward ways to communicate emotions. Facial Expression Recognition has been used in lots of fields such as human behaviour understanding and health monitoring. Deep learning models can achieve excellent performance in facial expression recognition tasks. As these deep neu-ral networks have very complex nonlinear structures, when the model makes a prediction, it is not easy for human users to understand what is the basis for the model's prediction. Specifically, we do not know which facial units contribute to the classification more or less. Developing affec-tive computing models with more explainable and transparent feedback for human interactors is essential for a trustworthy human-robot interaction. Comparing to "white-box" approaches, "black-box" approaches using deep neural networks, which have advantages in terms of overall accuracy but lack reliability and explainability. In this work, we introduce a multimodal affective human-robot interaction framework, with visual-based and verbal-based explanation, by Layer Wise Relevance Propagation (LRP) and Local Intepretable Mode-Agnostic Explanation (LIME). The proposed framework has been tested on the KDEF dataset, and in human-robot interaction experiments with the Pepper robot. This experimental evaluation shows the benefits of linking deep learning emotion recognition systems with explainable strategies.
Content may be subject to copyright.
Affective Human-Robot Interaction with
Multimodal Explanations
Hongbo Zhu, Chuang Yu, and Angelo Cangelosi
The University of Manchester, UK
{hongbo.zhu,chuang.yu,angelo.cangelosi}
@manchester.ac.uk
Abstract. Facial expressions are one of the most practical and straight-
forward ways to communicate emotions. Facial Expression Recognition
has been used in lots of fields such as human behaviour understand-
ing and health monitoring. Deep learning models can achieve excellent
performance in facial expression recognition tasks. As these deep neu-
ral networks have very complex nonlinear structures, when the model
makes a prediction, it is not easy for human users to understand what is
the basis for the model’s prediction. Specifically, we do not know which
facial units contribute to the classification more or less. Developing affec-
tive computing models with more explainable and transparent feedback
for human interactors is essential for a trustworthy human-robot inter-
action. Comparing to “white-box” approaches, “black-box” approaches
using deep neural networks, which have advantages in terms of overall ac-
curacy but lack reliability and explainability. In this work, we introduce
a multimodal affective human-robot interaction framework, with visual-
based and verbal-based explanation, by Layer Wise Relevance Propaga-
tion (LRP) and Local Intepretable Mode-Agnostic Explanation (LIME).
The proposed framework has been tested on the KDEF dataset, and in
human-robot interaction experiments with the Pepper robot. This exper-
imental evaluation shows the benefits of linking deep learning emotion
recognition systems with explainable strategies.
Keywords: Explainable robotic·Facial Expression Recognition (FER)·
eXplainable Artificial Intelligence (XAI)·Human-Robot Interaction (HRI).
1 Introduction
Facial expression is a critical non-verbal communication strategy, and human
emotions can be expressed through facial expressions, which can be read and
interpreted by emotional AI technology [7, 26]. Face expression detection is sig-
nificant for patients with specific diseases or congenital disabilities [13], especially
when they cannot express their thoughts through words and actions. In this case,
real-time facial emotion detection needs to be performed to take corresponding
medical measures for the patient.
The advancement of AI poses challenges for humans to trace model results,
especially in the field of deep learning. It is difficult for data scientists and
2 Hongbo Zhu, Chuang Yu, and Angelo Cangelosi
even engineers who write AI algorithms to explain what is happening inside the
models and how these AI models come to specific results [1]. XAI is proposed
to address this dilemma, which is a set of methods and processes that enable
users to understand the output of AI models [2]. AI developers need to have
a comprehensive understanding and awareness of the working mechanism, to
monitor whether the working process of the model complies with regulations,
thereby reducing legal and security risks and gaining the user trust.
In this work, we explored how explainable methods (namely LRP and LIME)
could make facial emotion recognition more transparent and trustworthy with
visual and verbal explanations. In the visual interpretation extraction part, LRP
was utilized to provide a visual explanation on a CNN-based emotion classifier.
For the verbal interpretation extraction part, Openface [4] was used to recognise
face action units and calculate the related intensity. Then LIME was employed
to analyse the contribution of each Action Unit(AU) for model prediction.
The pipeline of our model is shown in Figure 1. Firstly, the Pepper robot
predicts the facial emotion states of the interactor during HRI. Then, Pepper
verbalises the predicted emotion as linguistic feedback and shows the heatmap
generated from the LRP model as explainable visual feedback. In addition, the
robot can give more detailed emotion recognition feedback to increase interaction
transparency. This multimodel explanation feedback can help the human inter-
actor understand the robot’s internal emotion recognition process, facilitating
human-robot trust.
Fig. 1. The proposed multimodel explanation framework
All in all, our paper contributions are as follows:
We retrained a deep learning model of VGG16 to perform emotion recogni-
tion task on KDEF dataset, and LRP was utlized to highlight the crucial
pixel-features of the input image and generate heatmap-based explanation.
We made use of Openface to detact AU and calcualted the corresponding
intensity, then random forest was used perform emotion prediction. Finally
AU-based explanation was generated by LIME.
Affective Human-Robot Interaction with Multimodal Explanations 3
The proposed multimodal explainable method was tested on Pepper robot,
the generated heatmap is shown on the screen of chest. Verbal explanation
based on AU from LIME is given at the same time. Finally, trust is con-
structed and the feedback from the interactor can be used to improve the
facial expression recognition (FER) model.
2 Related Works and Background
The Deep learning-based model and Facial Action Coding Systems (FACS)
based model are two mainstream methods for facial emotion recognition [25].
Compared with traditional ML methods, deep learning-based black-box meth-
ods have higher accuracy but usually lack reliability and interpretability due
to the complex network structure. Explainable AI is proposed to solve this
challenge. Common explainable methods are backpropagation-based Layer-Wise
Relevance Propagation (LRP) [3] and perturbation-based Local Interpretable
Model-agnostic Explanations (LIME) [17]. The main goal of these methods is to
find activation regions in the DL model and highlight the parts of the input image
that have a decisive influence on the classifier’s decision. While these methods
account for the contribution of the input image at the pixel level, they do not
give an explanation at the facial action unit level. Facial Action Coding System
(FACS) [6] is a standard of most FER models for estimating and recognising
AUs. It is based on the activation of facial muscles during facial expressions.
These activations are represented by AUs. Action units (AUs) [22] were mostly
used as features, feeding classifiers to recognize emotions.
Numerous interpretable techniques have been deployed to explain the dynam-
ics process of AI models. We explored backpropagation-based and perturbation-
based explainable methods and use them to develop our multimodel explanation
architecture for FER in Human-Robot interaction.
2.1 Backpropagation-based Explanation
Backpropagation is an internal algorithm common across neural network archi-
tectures. It is used to calculate the gradient of the loss function with regard
to the weights of the connection between the layers of the network, and un-
derstand the correlation between the input and output to a network [15]. As
for backpropagation-based explainable methods, attributions are calculated by
backpropagating once or more times through the network.
Layer-Wise Relevance Propagation is a backpropagation-based interpretable
method [14]. It calculates importance scores in a layer-by-layer approximation of
backpropagation, which does not interact with the training process of the net-
work and can be easily applied to an already trained DNN model. LRP provides
an intuitive human-readable heatmap of input images at the pixel level. It uses
the network weights and the neural activation created by the forward-pass to
propagate the output back through the network from the predicted output to the
input layer [16]. The heatmap is used to visualize the contribution of each pixel
4 Hongbo Zhu, Chuang Yu, and Angelo Cangelosi
to the prediction. The contribution of the intermediate neuron or each pixel is
quantified to relevance value R, representing how important the given pixel is
to a particular prediction.
2.2 Perturbation-based Explanation
The perturbation-based XAI method modifies the input of the model to in-
vestigate which parts of the input elements are more critical for the model’s
predictions [8]. Specifically, the disturbance is generated by occluding some pix-
els or replacing some words in the sentence, then observing the changes in the
output. If the input after the disturbance significantly changes the output, it
is considered that “the cause behind the disturbance” is very significant. The
perturbation-based interpretable methods are generally applicable to the vast
majority of deep learning models [18].
Local Interpretable Model-agnostic Explanations is a commonly used post-
hoc perturbation-based explainable model [12]. It can generate instance-based
explanations for the model predictions. For a given input sample Y, LIME gen-
erates perturbed data near Y. The weights of the perturbed data are calculated
according to how close they are to the sample Y. LIME then trains an inter-
pretable sparse linear model on the perturbed dataset as a local approximate
classifier. In contrast to most backpropagation-based algorithms that need to
use the internal information of the classification model to generate explanations,
LIME generates explanations without accessing the model’s internals. [8].
2.3 Emotion Recognition for HRI
Unlike FER in human-computer interaction, the position of the face relative to
the camera is relatively fixed. Robotic emotion recognition is related to envi-
ronmental factors, making it an extremely challenging task to enable the robot
to understand emotions. Emotional robots have many real-life applications, and
studies have shown that in treating children with autism, they are more inclined
to interact with robots than humans [21]. Therefore, building a robotic system
with emotional intelligence will help detect the emotional state of autistic chil-
dren in real-time, thereby providing more efficient treatment. By dynamically
interacting with the external environment, emotional robots can also learn better
adaptability and flexibility.
3 Methodology
In this paper, explainable emotion recognition was explored in HRI through
backpropagation-based model for visual explanation and perturbation-based model
for verbal explanation, which will be introduced below.
Affective Human-Robot Interaction with Multimodal Explanations 5
3.1 Visual-based Explanation
As for visual-based explanation, we explored explainable facial emotion recogni-
tion with LRP. This can explain inputs’ relevance for a certain prediction, typi-
cally for image processing. Using LRP, the robot can extract the facial heatmap
that highlights sufficient parts of the facial pixels most responsible for the emo-
tion prediction task. Face recognition with VGG16 is used in this paper. VGG16
[5] is a simple and widely used convolutional neural network model for image
classification tasks. In this work, we reused the pre-trained image classification
VGG16 model to fine-tune it for our face emotional recognition task.
The input images of this model are with a fixed size of 224 224. The image
is passed through a series of convolutional layers, following three fully-connected
layers with different depths. As our task has seven emotions, the final fully-
connected layer was changed to seven dimensions, as shown in Figure 2. The
whole pre-trained model is further trained on the facial emotional dataset for
facial emotion recognition. During this new training, the previous layers of the
pre-trained VGG16 are kept fixed.
Fig. 2. The architecture of modified VGG16.
Layer-Wise Relevance Propagation (LRP) can explain the relevance of inputs
for a certain prediction, typically for image processing. So we can see which
part of the input images, or more precisely which pixels, most contribute to
a specific prediction. As a model-specific method, LRP generally assumes that
the classifier can be decomposed into several layers of computation [24]. In the
forward pass process, the image goes through a convolutional neural network
for feature extraction from the image. Then, these extracted features are input
to a classifier with a fully connected neural network and a softmax layer which
gives the final prediction. At this point, we are interested in why the model gets
that prediction. LRP goes in reverse order over the layers, we have visited in the
forward pass and calculates the relevant scores for each of the neurons in the
layer until we arrive at the input again. We can then calculate the relevance for
each pixel of the input image. The positive relevant scores indicate how much
6 Hongbo Zhu, Chuang Yu, and Angelo Cangelosi
contributions the pixels make to the model prediction, and the negative values
mean these pixels would speak against it, which leads to the heatmap result.
When the LRP model is applied to the trained neural network, it propagates
the classification function f(x) backward in the network through pre-defined
propagation rules from the output layer to the input layer. Let jand kbe
neurons at two continuous layers. The propagating relevance scores R(l+1)
kat a
given layer l+ 1 onto neurons of the lower layer lis achieved by applying the
following rule[20]:
R(l)
j=X
k
Zjk
PjZjk
R(l+1)
k(1)
The quantity Zjk models how much importance the neuron jhas contributed
to making the neuron krelevant.
Zjk = x(l)
jw(l,l+1)
jk (2)
The relevance of a neuron is calculated according to Formula 1, which can
calculate the relevance Rfor a neuron jin layer l. So our current layer is l,
and the output layer becomes l+ 1. The calculation for neuron jnow works
as follows. For each neuron jin the layer l, we calculate the activation based
on the neuron j. And the activation is calculated according to Zjk. It simply
multiplies the input for the neuron jin our current layer, with the weight that
goes into the neuron kin the next layer. This input xcomes from passing the
pixel values through the previous layers, showing how strong the activation is
between these neurons. Intuitively, if there is a high value, it means that the
neuron was very important for the output. So we interpret this fraction as a
relative activation of a specific neuron, compared to all activations in that layer.
Finally, we multiply the relevant score of the neuron in the next layer with
this relative value to propagate the relevance of the next layer backwards. The
propagation procedure will not stop until it reaches the input layer.
3.2 Verbal-based Explaination
According to the facial action units (AUs) that make up an expression, FACS
divides the face into upper and lower parts and subdivides the facial action units
into different AUs to encode facial emotions, shown in Figure 3. Openface can
detect action units and identify the corresponding intensity of each activated AU
as an open source software. To explain the relationship between action units and
emotion, LIME was used to calculate and visualize the contribution of each AU
to the predicted emotion in our work. AUs are defined as subtle facial muscle
movements. According to the physiological distribution of facial muscles and
related characteristics, the movements of different facial muscles can be classified
into different AUs [23]. Each AU represents a facial behaviour generated with
an anatomically distinct facial muscle group [10]. The combination of AUs can
produce most facial expressions, and the goal of facial AU recognition is to detect
AU and calculate AU intensity for each input face expression. Here Openface
Affective Human-Robot Interaction with Multimodal Explanations 7
was used in our work to detect and estimate the intensity of AUs from input
images, which is shown in Figure 4.
Fig. 3. The illusration of AUs. Fig. 4. The workflow of Openface
Local Interpretable Model-agnostic Interpretation (LIME) aims to explain
any black-box model by creating a local approximation, which can approximate
the original model in the vicinity of an individual instance. It works on almost
any input format, such as text, tabular data, images or even graphs.
ξ(x) = argmin
gG
L(f, g, πx) + (g) (3)
The idea behind LIME is quite intuitive. For instance, we know the properties
of input data point xin a tabular format. In this optimization formula above,
the complex model is denoted with fand the simple model or the local model is
denoted with g. In this simple model, small gcomes from a set of interpretable
models which are denoted with a capital G, here capital Gis a family of sparse
linear models, such as linear regression.
The first loss term Ltry to find an approximation of the complex model f
by the simple model gin the neighbourhood of our data point x. In other terms,
we want to get a good approximation in the local neighbourhoods. The third
argument πhere defines the local neighbourhoods of that data point and is some
sort of proximity measure.
The second loss term is used to regularize the complexity of our simple
surrogate model for linear regression. For instance, a desirable condition could
be to have many zero-weighted input features, so ignoring most of the features
and just including a few makes our explanations simpler for the decision tree.
It makes sense to have a relatively small depth that stays comprehensible for
humans. So overall, this is a complexity measure, and as this optimization
problem is a minimization problem, we are trying to minimize . In summary,
this loss function says that we look for a simple model g.
To minimize those two-loss terms, it should approximate the complex model
in that local area and stay as simple as possible. In the first step, we simply
generate some new data points in the neighbourhood of our input data point.
8 Hongbo Zhu, Chuang Yu, and Angelo Cangelosi
More specifically, we randomly generate data points everywhere, but they will
be weighted according to the distance to our input data point. As we are just
interested in the local area around our input. These data points are generated by
perturbations. This can be achieved by sampling from a normal distribution with
the mean and standard deviation for each feature. Then we get the prediction
for these data points using our complex model f.
L(f, g, πx) = X
z,z∈Z
πx(z) (f(z)g(z))2(4)
We minimize the first loss term by getting the highest accuracy on that
new data set using a simple linear model for linear regression. For instance, we
minimize the sum of square distances between the predictions and the ground
truth. Then a loss function is used to optimize the linear model. It’s basically the
sum of squared distances between the label, which comes from the complex model
fand the prediction of the simple model g[9]. Additionally, the proximity πis
added to weight the loss, according to how close a data point is. Here exponential
kernel is used as a distance metric, so we can think of this like a heatmap. The
points that are close to our input data points are weighed the most. That is
how we ensure that the model is locally faithful. The second loss term is used
to make sure that our model stays simple. In LIME, a sparse linear model is
used. In practice, this can be achieved by using a regularization technique. This
way we ensure to get a simple explanation with only a few relevant variables. In
summary, LIME fits a linear interpretable model in that local area, which is a
local approximation of a complex model.
4 Model Evaluation and Results
4.1 KDEF Datasets and Pre-Processing
The Karolinska Directed Emotional Faces (KDEF) [11] dataset consists of 4900
facial expression photos with 70 individuals (half males and half females, ages
from 20 to 30). Each person imitates seven different facial emotions and, each
facial expression is recorded from five camera views. In this paper, we only
use the front face photos in our experiment as our robot mainly interacts with a
human user in a front view. Some examples are as shown in Figure 5. That means
we used one-fifth of the dataset, 980 pictures in total, so each emotion subset
contains 140 front view images for each expression. The face images were rescaled
to a standard 224*224 pixels and three colour channels, to fit the input format
of the classification model. And we randomly split the front-face dataset into
the trainning part, validation part and testing part with a ratio of 700:140:140.
4.2 Multimodal Explanation
In the affective HRI, the robot will not only recognize the human interactor
emotion but also provide the multimodal explainable feedbacks, including visual
Affective Human-Robot Interaction with Multimodal Explanations 9
Fig. 5. Sample of the KDEF dataset Fig. 6. Visual based explanation
feedback with explainable heatmap that illustrates emotion recognition contribu-
tion extracted from LRP model and verbal feedback with understandable robot
speech to explain the face AU activation for emotion recognition.
Based on the pre-trained VGG16 model in the visual explanation part, our
face emotion recognition model is further trained on an Nvidia RTX 2080Ti
graphic card. We set the batch size to 32 and used Adam as the optimization
algorithm, with the learning rate of 0.00001. After 250 epochs of training, the
model achieves a classification testing accuracy of 91.4% on the KDEF dataset.
The predicted result and model parameters were fed to LRP, and then the pixel
wize contribution was calculated and shown on the heatmap for an explanation.
For example, the comparison of the two heatmap images in Figure 6 shows that
the robot uses similar feature pixels in two different faces. This means that when
VGG16 classifies the face as a ‘surprise’ emotion, the robot relies more on feature
pixes near the eyes, nostrils and lips to make its prediction, which is in line with
theories of human emotion perception and cognition[19].
In the verbal explanation part, we use Openface to extract the activation of
16 AUs used for emotion recognition with random forest. Finally, the AUs-based
explanation chart was generated by LIME, as shown in Figure 7. The blue bar
indicates positive contribution while the orange bar indicates the negative con-
tribution of surprise prediction. According to the histogram, AU26 (Jaw Drop)
and AU05 (Upper Lid Raiser) make the biggest positive contribution to the pre-
diction. Then the blanks of the predefined text template were filled with the AU
names that make most significant contribution. Finally, Text-to-Speech (TTS)
generates voice explanations for robot speech.
4.3 Test on Pepper Robot
In this work, we have tested our multimodel explanation methods on the Pepper
robot. During the experiments, a person interacts with the Pepper robot, who
can simultaneously recognise their facial expressions and verbalise the human
face emotion prediction as verbal feedback in HRI. The related speech is gen-
erated based on the emotion recognition results. For example, if the emotion
10 Hongbo Zhu, Chuang Yu, and Angelo Cangelosi
recognition result is happy, the explainable speech sentence will be As I noticed
your Cheek Raiser and Lip Corner P uller,I think you are happy. The speech
voice is synthesized through the Text-To-Speech (TTS) tool of the Naoqi SDK
of the Pepper robot. And based on the LRP model, the robot can extract the
heatmap images as the pixel-level explanation for the interactor. The original
face and the heatmap face are be shown in the Pepper chest screen as inter-
pretable visual feedback. Through verbal and visual feedback, this explainable
system has the benefit of supporting trustworthy human-robot interaction.
Fig. 7. An example of AUs-based explanation for surprise emotion
5 Conclusion and Future Work
Robotic systems may become more commonplace, but at the same time, more
complex. When robots fail to express their intentions, people will feel not only
discomfortable but also untrustworthy. It is necessary for people to know how
the robot recognize human emotion to assess when such systems can be trusted,
even if robots follow a reasonable decision-making process.
In conclusion, this paper integrates two explainable methods in emotion
recognition for trustworthy HRI. Using the explainable method LRP, the robot
can extract the facial heatmap that highlights significant parts of the facial
pixels most responsible for the emotion prediction task. The visualized atten-
tion heatmap and verbal feedback, can help the user understand the perceptual
mechanism the robot uses to recognise emotions. Thus the explainable method
provides essential insights into the natural features of the prediction model.
In this work, we just completed the essential human facial emotion recog-
nition and related explanation and have not conducted much work on trust
validation with our affective HRI model with XAI in human-robot interaction
scenes. As for future work, more human-joined tests will be taken for trust eval-
uation to explore the effectiveness of our multimodel explanation. And we will
explore how we can use the feedback for human-in-the-loop robot learning to
improve the robot’s emotional perception ability in dynamic HRI scenes.
Affective Human-Robot Interaction with Multimodal Explanations 11
References
1. Adadi, A., Berrada, M.: Peeking inside the black-box: a survey on explainable
artificial intelligence (xai). IEEE access 6, 52138–52160 (2018)
2. Arrieta, A.B., D´ıaz-Rodr´ıguez, N., Del Ser, J., Bennetot, A., Tabik, S., Barbado,
A., Garc´ıa, S., Gil-L´opez, S., Molina, D., Benjamins, R., et al.: Explainable artifi-
cial intelligence (xai): Concepts, taxonomies, opportunities and challenges toward
responsible ai. Information fusion 58, 82–115 (2020)
3. Bach, S., Binder, A., Montavon, G., Klauschen, F., uller, K.R., Samek, W.: On
pixel-wise explanations for non-linear classifier decisions by layer-wise relevance
propagation. PloS one 10(7), e0130140 (2015)
4. Baltrusaitis, T., Zadeh, A., Lim, Y.C., Morency, L.P.: Openface 2.0: Facial behavior
analysis toolkit. In: 2018 13th IEEE international conference on automatic face &
gesture recognition (FG 2018). pp. 59–66. IEEE (2018)
5. Dubey, A.K., Jain, V.: Automatic facial recognition using vgg16 based transfer
learning model. Journal of Information and Optimization Sciences 41(7), 1589–
1596 (2020)
6. Ekman, P., Friesen, W.V.: Facial action coding system. Environmental Psychology
& Nonverbal Behavior (1978)
7. Ekman, P., Friesen, W.V., Ellsworth, P.: Emotion in the human face: Guidelines
for research and an integration of findings, vol. 11. Elsevier (2013)
8. Ivanovs, M., Kadikis, R., Ozols, K.: Perturbation-based methods for explaining
deep neural networks: A survey. Pattern Recognition Letters 150, 228–234 (2021)
9. Kavila, S.D., Bandaru, R., Gali, T.V.M.B., Shafi, J.: Analysis of cardiovascular dis-
ease prediction using model-agnostic explainable artificial intelligence techniques.
In: Principles and Methods of Explainable Artificial Intelligence in Healthcare, pp.
27–54. IGI Global (2022)
10. Lien, J.J., Kanade, T., Cohn, J.F., Li, C.C.: Automated facial expression recogni-
tion based on facs action units. In: Proceedings third IEEE international conference
on automatic face and gesture recognition. pp. 390–395. IEEE (1998)
11. Lundqvist, D., Flykt, A., ¨
Ohman, A.: Karolinska directed emotional faces. Cogni-
tion and Emotion (1998)
12. Malik, S., Kumar, P., Raman, B.: Towards interpretable facial emotion recognition.
In: Proceedings of the Twelfth Indian Conference on Computer Vision, Graphics
and Image Processing. pp. 1–9 (2021)
13. Martinez, M., Multani, N., Anor, C.J., Misquitta, K., Tang-Wai, D.F., Keren, R.,
Fox, S., Lang, A.E., Marras, C., Tartaglia, M.C.: Emotion detection deficits and
decreased empathy in patients with alzheimer’s disease and parkinson’s disease
affect caregiver mood and burden. Frontiers in Aging Neuroscience 10, 120 (2018)
14. Montavon, G., Binder, A., Lapuschkin, S., Samek, W., uller, K.R.: Layer-wise
relevance propagation: an overview. Explainable AI: interpreting, explaining and
visualizing deep learning pp. 193–209 (2019)
15. Nie, W., Zhang, Y., Patel, A.: A theoretical explanation for perplexing behaviors
of backpropagation-based visualizations. In: International Conference on Machine
Learning. pp. 3809–3818. PMLR (2018)
16. Rathod, J., Joshi, C., Khochare, J., Kazi, F.: Interpreting a black-box model used
for scada attack detection in gas pipelines control system. In: 2020 IEEE 17th India
Council International Conference (INDICON). pp. 1–7. IEEE (2020)
17. Ribeiro, M.T., Singh, S., Guestrin, C.: why should i trust you?” explaining the
predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD interna-
tional conference on knowledge discovery and data mining. pp. 1135–1144 (2016)
12 Hongbo Zhu, Chuang Yu, and Angelo Cangelosi
18. Robnik-ˇ
Sikonja, M., Bohanec, M.: Perturbation-based explanations of prediction
models. In: Human and machine learning, pp. 159–175. Springer (2018)
19. Rosenberg, E.L., Ekman, P.: What the face reveals: Basic and applied studies of
spontaneous expression using the Facial Action Coding System (FACS). Oxford
University Press (2020)
20. Samek, W., Montavon, G., Vedaldi, A., Hansen, L.K., uller, K.R.: Explainable
AI: interpreting, explaining and visualizing deep learning, vol. 11700. Springer
Nature (2019)
21. Taheri, A., Meghdari, A., Alemi, M., Pouretemad, H.: Human–robot interaction in
autism treatment: a case study on three pairs of autistic children as twins, siblings,
and classmates. International Journal of Social Robotics 10(1), 93–113 (2018)
22. Tian, Y.I., Kanade, T., Cohn, J.F.: Recognizing action units for facial expression
analysis. IEEE Transactions on pattern analysis and machine intelligence 23(2),
97–115 (2001)
23. Yao, L., Wan, Y., Ni, H., Xu, B.: Action unit classification for facial expression
recognition using active learning and svm. Multimedia Tools and Applications
80(16), 24287–24301 (2021)
24. Yin, P., Huang, L., Lee, S., Qiao, M., Asthana, S., Nakamura, T.: Diagnosis of
neural network via backward deduction. In: 2019 IEEE International Conference
on Big Data (Big Data). pp. 260–267. IEEE (2019)
25. Yu, C., Tapus, A.: Interactive robot learning for multimodal emotion recognition.
In: International Conference on Social Robotics. pp. 633–642. Springer (2019)
26. Yu, C., Tapus, A.: Multimodal emotion recognition with thermal and rgb-d cameras
for human-robot interaction. In: Companion of the 2020 ACM/IEEE International
Conference on Human-Robot Interaction. pp. 532–534 (2020)
... While explaining using natural language can be intuitive to a nonexpert user, there are several limitations in using only text as an explanation medium [4]. Prior studies reveal a need for multi-modal explanation [5] and highlight its benefits in terms of intuitiveness and efficiency in presenting complex information [6], [7], [8]. However, the problem of multimodal explanation is not well-studied in the context of This work has been partially supported by the European Union's Horizon Europe research and innovation programme under the TRAIL project, Marie Skłodowska-Curie grant agreement No 101072488, and by the Italian Ministry for Universities and Research (MUR) with the PNRR Project FAIR (Future Artificial Intelligence Research) PE0000013. 1 Pradip Pramanick is with the Interdepartmental Center for Advances in Robotic Surgery -ICAROS, University of Naples Federico II, Naples, Italy pradip.pramanick@unina.it 2 Silvia Rossi is with the Department of Electrical Engineering and Information Technologies -DIETI, University of Naples Federico II, Napoli, Italy silvia.rossi@unina.it ...
... Perlmutter et al. [27] combined visualization of a robot's beliefs and intentions with textual feedback to improve the transparency of a situated language understanding system. A similar form of visualization has been explored in [8] to explain emotion recognition in HRI and in [28] for an explainable HRI system to teach robots with augmented reality. Hastie et al. developed a multimodal interface by combining text explanations in a graphical interface for transparent interaction with a remote robot [29]. ...
Preprint
Full-text available
The explainability of a robot's actions is crucial to its acceptance in social spaces. Explaining why a robot fails to complete a given task is particularly important for non-expert users to be aware of the robot's capabilities and limitations. So far, research on explaining robot failures has only considered generating textual explanations, even though several studies have shown the benefits of multimodal ones. However, a simple combination of multiple modalities may lead to semantic incoherence between the information across different modalities - a problem that is not well-studied. An incoherent multimodal explanation can be difficult to understand, and it may even become inconsistent with what the robot and the human observe and how they perform reasoning with the observations. Such inconsistencies may lead to wrong conclusions about the robot's capabilities. In this paper, we introduce an approach to generate coherent multimodal explanations by checking the logical coherence of explanations from different modalities, followed by refinements as required. We propose a classification approach for coherence assessment, where we evaluate if an explanation logically follows another. Our experiments suggest that fine-tuning a neural network that was pre-trained to recognize textual entailment, performs well for coherence assessment of multimodal explanations. Code & data: https://pradippramanick.github.io/coherent-explain/.
... Emotion Recognition for Affective Human-Robot Interaction Research studies on emotion recognition for affective human-robot interaction can be roughly divided into two categories: vision-based and multimodal methods. Visionbased methods focus on facial expressions of human interactants and utilize them to recognize emotions [5], [6], [7], [8], [9], [10], [11]. However, these methods overlook the potential insights from other modalities, such as text and audio. ...
Preprint
Full-text available
Equipping humanoid robots with the capability to understand emotional states of human interactants and express emotions appropriately according to situations is essential for affective human-robot interaction. However, enabling current vision-aware multimodal emotion recognition models for affective human-robot interaction in the real-world raises embodiment challenges: addressing the environmental noise issue and meeting real-time requirements. First, in multiparty conversation scenarios, the noises inherited in the visual observation of the robot, which may come from either 1) distracting objects in the scene or 2) inactive speakers appearing in the field of view of the robot, hinder the models from extracting emotional cues from vision inputs. Secondly, realtime response, a desired feature for an interactive system, is also challenging to achieve. To tackle both challenges, we introduce an affective human-robot interaction system called UGotMe designed specifically for multiparty conversations. Two denoising strategies are proposed and incorporated into the system to solve the first issue. Specifically, to filter out distracting objects in the scene, we propose extracting face images of the speakers from the raw images and introduce a customized active face extraction strategy to rule out inactive speakers. As for the second issue, we employ efficient data transmission from the robot to the local server to improve realtime response capability. We deploy UGotMe on a human robot named Ameca to validate its real-time inference capabilities in practical scenarios. Videos demonstrating real-world deployment are available at https://pi3-141592653.github.io/UGotMe/.
... Visual explanations with robots have been proposed for several purposes, such as navigation (Maruyama et al., 2022;Halilovic and Lindner, 2023). However, more importantly for the scope of this work, Zhu et al. (2022) proposed a multi-modal explanation framework that coupled visual-based with verbal-based explanations to explain facial emotion recognition. Moreover, Sobrín-Hidalgo et al. (2024) presented a preliminary study proposing a vision-language model that allows the robot to generate explanations combining data from its logs and the images it captures. ...
Preprint
Full-text available
The addressee estimation (understanding to whom somebody is talking) is a fundamental task for human activity recognition in multi-party conversation scenarios. Specifically, in the field of human-robot interaction, it becomes even more crucial to enable social robots to participate in such interactive contexts. However, it is usually implemented as a binary classification task, restricting the robot's capability to estimate whether it was addressed and limiting its interactive skills. For a social robot to gain the trust of humans, it is also important to manifest a certain level of transparency and explainability. Explainable artificial intelligence thus plays a significant role in the current machine learning applications and models, to provide explanations for their decisions besides excellent performance. In our work, we a) present an addressee estimation model with improved performance in comparison with the previous SOTA; b) further modify this model to include inherently explainable attention-based segments; c) implement the explainable addressee estimation as part of a modular cognitive architecture for multi-party conversation in an iCub robot; d) propose several ways to incorporate explainability and transparency in the aforementioned architecture; and e) perform a pilot user study to analyze the effect of various explanations on how human participants perceive the robot.
Chapter
Full-text available
The heart is mainly responsible for supplying oxygen and nutrients and pumping blood to the entire body. The diseases that affect the heart or capillaries are known as cardiovascular diseases. In predicting cardiovascular diseases, machine learning and neural network models play a vital role and help in reducing human effort. Though the complex algorithms in machine learning and neural networks help in giving accurate results, the interpretability behind the prediction has become difficult. To understand the reason behind the prediction, explainable artificial intelligence (XAI) is introduced. This chapter aims to perform different machine learning and neural network models for predicting cardiovascular diseases. For the interpretation behind the prediction, the authors used explainable artificial intelligence model-agnostic approaches. Based on experimentation results, the artificial neural network (ANN) with multi-level model gives an accuracy of 87%, which is best compared to other models.
Conference Paper
Full-text available
In this paper, an interpretable deep-learning-based system has been proposed for facial emotion recognition. A novel approach to interpret the proposed system’s results, Divide & Conquer based Shapley additive explanations (DnCShap), has also been developed. The proposed approach computes ‘Shapley values’ that denote the contribution of each image feature towards a particular prediction. The Divide and Conquer algorithm has been incorporated for computing the Shapley values in linear time instead of the exponential time taken by the existing interpretability approaches. The experiments performed on four facial emotion recognition datasets, i.e., FER-13, FERG, JAFFE, and CK+, resulted in the emotion classification accuracy of 62.62%, 99.68%, 91.97%, and 99.67%, respectively. The results show that DnCShap has consistently interpreted the highly relevant facial features for the emotion classification for various datasets.
Article
Full-text available
Deep neural networks (DNNs) have achieved state-of-the-art results in a broad range of tasks, in particular the ones dealing with the perceptual data. However, full-scale application of DNNs in safety-critical areas is hindered by their black box-like nature, which makes their inner workings nontransparent. As a response to the black box problem, the field of explainable artificial intelligence (XAI) has recently emerged and is currently rapidly growing. The present survey is concerned with perturbation-based XAI methods, which allow to explore DNN models by perturbing their input and observing changes in the output. We present an overview of the most recent research focusing on the differences and similarities in the applications of perturbation-based methods to different data types, from extensively studied perturbations of images to the just emerging research on perturbations of video, natural language, software code, and reinforcement learning entities.
Article
Full-text available
Automatic facial expression analysis remains challenging due to its low recognition accuracy and poor robustness. In this study, we utilized active learning and support vector machine (SVM) algorithms to classify facial action units (AU) for human facial expression recognition. Active learning was used to detect the targeted facial expression AUs, while an SVM was utilized to classify different AUs and ultimately map them to their corresponding facial expressions. Active learning reduces the number of non-support vectors in the training sample set and shortens the labeling and training times without affecting the performance of the classifier, thereby reducing the cost of labeling samples and improving the training speed. Experimental results show that the proposed algorithm can effectively suppress correlated noise and achieve higher recognition rates than principal component analysis and a human observer on seven different facial expressions.
Article
Full-text available
In the last few years, Artificial Intelligence (AI) has achieved a notable momentum that, if harnessed appropriately, may deliver the best of expectations over many application sectors across the field. For this to occur shortly in Machine Learning, the entire community stands in front of the barrier of explainability, an inherent problem of the latest techniques brought by sub-symbolism (e.g. ensembles or Deep Neural Networks) that were not present in the last hype of AI (namely, expert systems and rule based models). Paradigms underlying this problem fall within the so-called eXplainable AI (XAI) field, which is widely acknowledged as a crucial feature for the practical deployment of AI models. The overview presented in this article examines the existing literature and contributions already done in the field of XAI, including a prospect toward what is yet to be reached. For this purpose we summarize previous efforts made to define explainability in Machine Learning, establishing a novel definition of explainable Machine Learning that covers such prior conceptual propositions with a major focus on the audience for which the explainability is sought. Departing from this definition, we propose and discuss about a taxonomy of recent contributions related to the explainability of different Machine Learning models, including those aimed at explaining Deep Learning methods for which a second dedicated taxonomy is built and examined in detail. This critical literature analysis serves as the motivating background for a series of challenges faced by XAI, such as the interesting crossroads of data fusion and explainability. Our prospects lead toward the concept of Responsible Artificial Intelligence , namely, a methodology for the large-scale implementation of AI methods in real organizations with fairness, model explainability and accountability at its core. Our ultimate goal is to provide newcomers to the field of XAI with a thorough taxonomy that can serve as reference material in order to stimulate future research advances, but also to encourage experts and professionals from other disciplines to embrace the benefits of AI in their activity sectors, without any prior bias for its lack of interpretability.
Article
Detecting and analyzing emotions from human facial movement is a problem defined and developed over many years for the benefits it brings. Facial expression is a crux of the human-computer interaction (HCI) research area. Researchers are exploring its application in security, medical science, and to know the behavior of a person or community. In this paper, we have proposed a deep learning-based framework using transfer learning for facial expression. This approach uses the existing VGG16 model for a modified trained model and concatenates additional layers on it. VGG16 model is already trained on ImageNet, which has 1000 classes. After this, the model has been verified on CK+, JAFFE benchmark datasets. Extended Cohn-Kanade (CK+) and Japanese Female Facial Expression (JAFFE) are popular Facial Expression Dataset. The proposed model has shown 94.8% accuracy on CK+ and 93.7% on the JAFFE dataset and found superior to existing techniques. We have implemented proposed technique on Google Colab-GPU that has helped us to process these data.
Conference Paper
Human emotion detection is an important aspect in social robotics and HRI. In this paper, we propose a vision-based multimodal emotion recognition method based on gait data and facial thermal images designed for social robots. Our method can detect 4 human emotional states (i.e., neutral, happiness, anger, and sadness). We gathered data from 25 participants in order to build-up an emotion database for training and testing our classification models. We implemented and tested several approaches such as Convolutional Neural Network (CNN), Hidden Markov Model (HMM), Support Vector Machine (SVM), and Random Forest (RF). These were trained and tested in order to compare the emotion recognition ability and to find the best approach. We designed a hybrid model with both the gait and the thermal data and the accuracy of our system shows an improvement of 10% over the other models based on our emotion database. This will be explored in a real-time HRI scenario.
Conference Paper
Interaction plays a critical role in skills learning for natural communication. In human-robot interaction (HRI), robots can get feedback during the interaction to improve their social abilities. In this context, we propose an interactive robot learning framework using mul-timodal data from thermal facial images and human gait data for online emotion recognition. We also propose a new decision-level fusion method for the multimodal classification using Random Forest (RF) model. Our hybrid online emotion recognition model focuses on the detection of four human emotions (i.e., neutral, happiness, angry, and sadness). After conducting offline training and testing with the hybrid model, the accuracy of the online emotion recognition system is more than 10% lower than the offline one. In order to improve our system, the human verbal feedback is injected into the robot interactive learning. With the new online emotion recognition system, a 12.5% accuracy increase compared with the online system without interactive robot learning is obtained.